Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ILM beam search and decoding #1291

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

AmirHussein96
Copy link
Contributor

@AmirHussein96 AmirHussein96 commented Oct 5, 2023

This is a Librispeech zipformer recipe using HAT loss from k2-fsa/k2#1244. The recipe includes HAT training, greedy decoding, modified beam search decoding, and subtracting ILM with RNN-LM shallow fusion.

So far, @desh2608 and I have tested this on Librispeech, and the results are similar to regular RNN-LM shallow fusion. However, the intended use of this is adaptation to a new domain with an external RNN-LM trained on that domain.

Model Train Decode LM scale ILM scale test-clean test-other
Zipformer-HAT train-960 greedy_search - - 2.22 5.01
    modified_beam_search 0 0 2.18 4.96
    + RNNLM shallow fusion 0.29 0 1.96 4.55
    - ILME 0.29 0.1 1.95 4.55
    - ILME 0.29 0.3 1.97 4.5

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

@AmirHussein96 if you have some time, you can try out the experiment suggested by @marcoyang1998: #1271 (comment).

@marcoyang1998 do you have a RNNLM trained on GigaSpeech?

@marcoyang1998
Copy link
Collaborator

I believe @yfyeung has an RNNLM trained on GigaSpeech. @yfyeung Would you mind sharing one, maybe you can upload it to huggingface?

@desh2608 desh2608 linked an issue Oct 5, 2023 that may be closed by this pull request
@yfyeung
Copy link
Collaborator

yfyeung commented Oct 8, 2023

Yeah, I have RNNLM trained on GigaSpeech but not in icefall style.

https://huggingface.co/yfyeung/icefall-asr-gigaspeech-rnn_lm-2023-10-08

@yfyeung
Copy link
Collaborator

yfyeung commented Oct 9, 2023

@AmirHussein96 I note that you modified k2.rnnt_loss_pruned in k2. Would you mind sharing your branch?

@desh2608
Copy link
Collaborator

desh2608 commented Oct 9, 2023

@AmirHussein96 I note that you modified k2.rnnt_loss_pruned in k2. Would you mind sharing your branch?

check this: k2-fsa/k2#1244

@AmirHussein96
Copy link
Contributor Author

AmirHussein96 commented Oct 10, 2023

I conducted benchmarking on the following scenario:
Zipformer was initially trained on LibriSpeech and then adapted to Gigaspeech using text only. For the adaptation process, I utilized the Gigaspeech transcripts corresponding to the 1000h, M subset, to train the RNN-LM. Below, you'll find a comparison of various methods: RNN-LM Shallow Fusion (SF), RNN-LM LODR Bigram, and RNN-LM Shallow Fusion integrated with our ILME implementation.

  LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 0 0 20.81 19.95
+RNNLM SF 0.1 0 20.3 19.55
+ RNNLM SF 0.29 0 19.88 19.21
+ RNNLM SF 0.45 0 20.1 19.46
+ RNNLM SF LODR(bigram) 0.45 0.16 20.42 19.6
+ RNNLM SF - ILME 0.29 0.1 19.7 18.96
+ RNNLM SF - ILME 0.45 0.1 19.54 18.89
+ RNNLM SF - ILME 0.29 0.2 19.84 18.99

Choice of ILM/LODR and RNNLM weights:
ILM:[0.05 0.2] with step of 0.05
LODR:[0.02 0.45] with step of 0.05
RNNLM: [0.05 0.45] with step of 0.05

The configuration for the RNNLM and the training command is as following:

./rnn_lm/train.py \
    --world-size 4 \
    --exp-dir ./rnn_lm/exp \
    --start-epoch 0 \
    --num-epochs 30 \
    --start-epoch 19 \
    --use-fp16 0 \
    --tie-weights 1 \
    --embedding-dim 512 \
    --hidden-dim 512 \
    --num-layers 2 \
    --batch-size 300 \
    --lr 0.0001 \
    --lm-data data/lm_training_bpe_500/sorted_lm_data.pt \
    --lm-data-valid data/lm_training_bpe_500/sorted_lm_data-valid.pt

RNNLM results on dev: total nll: 776663.5668945312, num tokens: 261759, num sentences: 5715, ppl: 19.435
RNNLM results on test: total nll: 2401851.5998535156, num tokens: 805072, num sentences: 19930, ppl: 19.755

@marcoyang1998
Copy link
Collaborator

@AmirHussein96 I noticed that you are using a positive scale for LODR, this should be negative. You can check the code here:

hyp_log_prob += (
lm_score[new_token] * lm_scale
+ LODR_lm_scale * current_ngram_score
+ context_score
) # add the lm score

Would you mind re-running the decoding experiment with LODR, thanks!

@AmirHussein96
Copy link
Contributor Author

AmirHussein96 commented Oct 10, 2023

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py

@marcoyang1998 I used the implementation of modified_beam_search_lm_rescore_LODR()below which uses negative weight for LODR

am_scores.values / lm_scale + lm_scores - LODR_scores * lodr_scale

@AmirHussein96
Copy link
Contributor Author

AmirHussein96 commented Oct 10, 2023

@marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.

  beam LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 4 0 0 20.81 19.95
+ RNNLM SF 4 0.1 0 20.3 19.55
+ RNNLM SF 4 0.29 0 19.88 19.21
+ RNNLM SF 4 0.45 0 20.1 19.46
+ RNNLM SF 12 0.29 0 19.77 19.01
           
+ RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
+ RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
+ RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
+ RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44
           
+ RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
+ RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
+ RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
+ RNNLM SF - ILME 12 0.45 0.1 19.21 18.57

The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.

The decoding command is below

for method in modified_beam_search_LODR; do
  ./zipformer_hat/decode.py \
  --epoch 40 --avg 16 --use-averaged-model True \
  --beam-size 4 \
  --exp-dir ./zipformer_hat/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --max-contexts 4 \
  --max-states 8 \
  --max-duration 800 \
  --decoding-method $method \
  --use-shallow-fusion 1 \
  --lm-type rnn \
  --lm-exp-dir rnn_lm/exp \
  --lm-epoch 25 \
  --lm-scale 0.45 \
  --lm-avg 5 \
  --lm-vocab-size 500 \
  --rnn-lm-embedding-dim 512 \
  --rnn-lm-hidden-dim 512 \
  --rnn-lm-num-layers 2 \
  --tokens-ngram 2 \
  --ngram-lm-scale $LODR_scale
done

@marcoyang1998
Copy link
Collaborator

The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py

Please have a look at #1017 and https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html for a comparison between different decoding methods with language models.

Another important comment is that the current ILME implementation is Shallow Fusion so it can be used in streaming but LODR is a language model rescoring.

LODR works in both shallow fusion and rescoring. modified_beam_search_LODR is the shallow fusion type LODR and modified_beam_search_lm_rescore_LODR is the rescoring type. You usually need to set a large --beam-size to achieve good results with rescoring-type methods (see https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/rescoring.html#id3).

@JuanPZuluaga
Copy link

Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?

Thanks.

@AmirHussein96
Copy link
Contributor Author

Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?

Thanks.

The primary reason for choosing RNN-LM is its computational efficiency and suitability for streaming applications. Additionally, the improvement from using a Transformer-LM compared to RNN-LM for rescoring is minimal.

@AmirHussein96
Copy link
Contributor Author

@marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.

  beam LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 4 0 0 20.81 19.95

  • RNNLM SF 4 0.1 0 20.3 19.55
  • RNNLM SF 4 0.29 0 19.88 19.21
  • RNNLM SF 4 0.45 0 20.1 19.46
  • RNNLM SF 12 0.29 0 19.77 19.01
               
  • RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
  • RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
  • RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
  • RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44
               
  • RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
  • RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
  • RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
  • RNNLM SF - ILME 12 0.45 0.1 19.21 18.57
    The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.

The decoding command is below

for method in modified_beam_search_LODR; do
  ./zipformer_hat/decode.py \
  --epoch 40 --avg 16 --use-averaged-model True \
  --beam-size 4 \
  --exp-dir ./zipformer_hat/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --max-contexts 4 \
  --max-states 8 \
  --max-duration 800 \
  --decoding-method $method \
  --use-shallow-fusion 1 \
  --lm-type rnn \
  --lm-exp-dir rnn_lm/exp \
  --lm-epoch 25 \
  --lm-scale 0.45 \
  --lm-avg 5 \
  --lm-vocab-size 500 \
  --rnn-lm-embedding-dim 512 \
  --rnn-lm-hidden-dim 512 \
  --rnn-lm-num-layers 2 \
  --tokens-ngram 2 \
  --ngram-lm-scale $LODR_scale
done

@marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME.

  baseline RNNLM SF LODR ILME
RNNLM SF <0.001 - <0.001 <0.001
LODR <0.001 <0.001 - 1
ILME <0.001 <0.001 1 -

@danpovey
Copy link
Collaborator

danpovey commented Oct 12, 2023 via email

@AmirHussein96
Copy link
Contributor Author

AmirHussein96 commented Oct 12, 2023

Did you see any difference between zipformer with normal RNN-T and zipformer-HAT?

Yes we compared zipformer with the zipformer-HAT using greedy and modified beam search, and the performance is almost the same.

@AmirHussein96
Copy link
Contributor Author

Please let me know if any modifications are needed to finalize the merging of the pull request.

@desh2608
Copy link
Collaborator

Please let me know if any modifications are needed to finalize the merging of the pull request.

@AmirHussein96 this needs the k2 PR (k2-fsa/k2#1244) to be merged first.

@csukuangfj besides ILM, I am also using HAT for joint speaker diarization (with my SURT model), and Amir is using it for joint language ID in code-switched ASR. We will make PRs for those recipes in the coming months, but it would be great to have these ones checked in first.

@csukuangfj
Copy link
Collaborator

@marcoyang1998 Could you have a look at this PR?

export CUDA_VISIBLE_DEVICES="0,1,2,3"

# For non-streaming model training:
./zipformer/train.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the recipe name.

@marcoyang1998
Copy link
Collaborator

Could you please add a section about HAT (WERs, training command, decoding command etc.) in RESULTS.md?

@marcoyang1998
Copy link
Collaborator

I had a glance and left a few comments. The rest looked fine, thanks for the work!

Would you mind uploading your HAT model to huggingface so that other people can try it?

@desh2608
Copy link
Collaborator

@AmirHussein96 if you have some time, can we make a final push to get this checked in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hybrid autoregressive transducer
7 participants