Trained model can generate correct text but incorrect speech #13

chentuochao · 2024-07-27T09:57:28Z

I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh
The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen.
Here is the training logging:

During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:

Do you know what problem may be?

Thank you

zhangshaolei1998 · 2024-07-28T03:19:48Z

I wonder if you have tried to test it directly using the model we provide, and whether this happens?

If not, I think it may be a problem with the training scripts? Perhaps you can provide the training scripts?

chentuochao · 2024-07-28T04:43:08Z

Thank you for your kind reply!
I also tried your provided pretrained model, it works well. The weird issues only happens to my trained model.
Here is the traning script I am using

export CUDA_VISIBLE_DEVICES=0

LANG=fr
DATA_ROOT=/scr/data/zhangshaolei/datasets/cvss/cvss-c
DATA=$DATA_ROOT/${LANG}-en/fbank2unit
model=streamspeech.simul-s2st.${LANG}-en

fairseq-train $DATA \
  --user-dir researches/ctc_unity \
  --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
  --task speech_to_speech_ctc --target-is-code --target-code-size 1000 --vocoder code_hifigan  \
  --criterion speech_to_unit_2pass_ctc_asr_st --label-smoothing 0.1 --rdrop-alpha 0.0 \
  --arch streamspeech --share-decoder-input-output-embed \
  --encoder-layers 12 --encoder-embed-dim 256 --encoder-ffn-embed-dim 2048 --encoder-attention-heads 4 \
  --translation-decoder-layers 4 --synthesizer-encoder-layers 2 \
  --decoder-layers 2  --decoder-embed-dim 512 --decoder-ffn-embed-dim 2048 --decoder-attention-heads 8 \
  --k1 0 --k2 0 --n1 1 --n2 -1 \
  --chunk-size 8 --multichunk \
  --uni-encoder \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --ctc-upsample-rate 25 \
  --save-dir checkpoints/$model \
  --validate-interval 1000 --validate-interval-updates 1000 \
  --save-interval 1 --save-interval-updates 1000 \
  --keep-last-epochs 15 \
  --no-progress-bar --log-format json --log-interval 100 \
  --lr 0.001 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 1.0 \
  --max-tokens 22000 --max-target-positions 1200 --update-freq 2 \
  --attn-type espnet --pos-enc-type rel_pos \
  --keep-interval-updates 40 \
  --keep-best-checkpoints 20 \
  --seed 1 --fp16 --num-workers 8

config_gcmvn.yaml

global_cmvn:
  stats_npz_path: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocoder:
  checkpoint: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
  config: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
  type: code_hifigan

config_mtl_asr_st_ctcst.yaml

target_unigram:
   decoder_type: transformer
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 8.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 4
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
source_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
ctc_target_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1

I also attach the model we trained here (https://drive.google.com/file/d/1rdOEt1NSt8oxUBHL0WfM_CCtKczt6TzO/view?usp=share_link)

zhangshaolei1998 · 2024-07-29T06:56:05Z

There seems to be no problem with training scripts. Problems with generating short speech are often caused by the non-autoregressive text-to-unit generation module. I wonder if you have modified this part of the code?

chentuochao · 2024-07-30T09:54:35Z

Yeah, I think it should be the problem at autoregressive text-to-unit generation module. I did not change any part of training code and model. Do you have any idea what happens?
I will retry to re-download the GitHub repo and train again to see whether I an still facing the problem and update in this issue

zhangshaolei1998 · 2024-07-30T10:21:52Z

Sorry, I haven't encountered this problem before, and I don't have any experience to solve this issue yet.

Maybe you can retrain with the latest code and record the final loss. We can see whether the loss after convergence is within the normal range.

Lili-q · 2024-08-02T10:33:02Z

Hello, I also trained a fr-en streaming S2ST model completely according to the tutorial, and did not make any changes to the code, and encountered a similar problem as you.

The result of streaming ASR is normal, but the result of simultaneous translation is incorrect, and the corresponding token is also abnormal (very short), and the synthesized audio is less than 1s, with almost no sound.

I tested the same source audio using my own trained model and the pre-trained model provided by the author. See the following pictures.

a. Result on my own trained model:

b. Results on the pre-trained model provided by the author

Did you solve your problem?

chentuochao · 2024-08-05T08:23:58Z

Dear authors,
I tried to redo all pipeline again, but I still has that issues:
Here are the all commands we use after installing the environment:

bash 0.download_pretrain_models.sh

# changed the env variables
bash preprocess.sh

# changed the paths in config_gcmvn.yaml

# copy and paste config_mtl_asr_st_ctcst.yaml to fbank2unit

# changed paths in train.simul-s2st.sh
bash train.simul-s2st.sh

# changed paths in simuleval.simul-s2st.sh
bash simuleval.simul-s2st.sh

Do you know what the potential problem is?

EmreOzkose · 2024-08-31T17:24:01Z

I have the same issue. Is there any update?

chentuochao · 2024-08-31T20:31:54Z

Hi Emre,
I found this bug is related to the loss function and author pushed the fixed loss in the most recent commit. Just pull it, then the problem will be solved

EmreOzkose · 2024-09-02T11:04:45Z

I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise.

EmreOzkose · 2024-09-02T11:52:54Z

I use another Hubert model to extract source units. Do it affect this situation?

EmreOzkose · 2024-09-03T06:20:42Z

It was the problem :). I misunderstood some part of the model. When I changed back to the original hubert, the problem is solved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained model can generate correct text but incorrect speech #13

Trained model can generate correct text but incorrect speech #13

chentuochao commented Jul 27, 2024

zhangshaolei1998 commented Jul 28, 2024

chentuochao commented Jul 28, 2024

zhangshaolei1998 commented Jul 29, 2024

chentuochao commented Jul 30, 2024

zhangshaolei1998 commented Jul 30, 2024

Lili-q commented Aug 2, 2024

chentuochao commented Aug 5, 2024

EmreOzkose commented Aug 31, 2024

chentuochao commented Aug 31, 2024

EmreOzkose commented Sep 2, 2024

EmreOzkose commented Sep 2, 2024

EmreOzkose commented Sep 3, 2024

Trained model can generate correct text but incorrect speech #13

Trained model can generate correct text but incorrect speech #13

Comments

chentuochao commented Jul 27, 2024

zhangshaolei1998 commented Jul 28, 2024

chentuochao commented Jul 28, 2024

zhangshaolei1998 commented Jul 29, 2024

chentuochao commented Jul 30, 2024

zhangshaolei1998 commented Jul 30, 2024

Lili-q commented Aug 2, 2024

chentuochao commented Aug 5, 2024

EmreOzkose commented Aug 31, 2024

chentuochao commented Aug 31, 2024

EmreOzkose commented Sep 2, 2024

EmreOzkose commented Sep 2, 2024

EmreOzkose commented Sep 3, 2024