Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of the Art for conformer and beam decoding #106

Open
abhinavg4 opened this issue Jan 8, 2021 · 17 comments
Open

State of the Art for conformer and beam decoding #106

abhinavg4 opened this issue Jan 8, 2021 · 17 comments

Comments

@abhinavg4
Copy link

abhinavg4 commented Jan 8, 2021

Hi, Thanks for developing this great tool kit. I had 2 questions about the conformer model :-

  1. For the conformer model in examples/conformer , I think almost all the parameter are similar to conformer(S) of https://arxiv.org/pdf/2005.08100.pdf . However, the performance gap between the paper and conformer model in examples/conformer seems to be quite big (2.7 v/s 6.44 for test-clean). What do you think might be the reason for this?

One reason I can see is that 2.7 is obtained with beam-search whereas 6.44 without. But I don't think just beam search can bring that difference. Can you give me some pointers on how can I reduce this gap? Also, Did you try decoding with beam search for examples/conformer ?

  1. I was trying to decode examples/conformer with beam search test_subword_conformer.py using the pre-trained model provided via drive. For this I just modified beam-width parameter in config.yml. But the decoding is taking very large time (about 30 min per batch, the total number of batches in test clean ~650) on Nvidia p40 with 24GB memory.

Is this the expected behaviour or do I need to something more than changing beam-width from 0 to 4/8. What was the decoding time for you?

Thanks,
Abhinav

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 8, 2021

Hi @abhigarg-iitk

  1. I tried using beam search but it was still around 6%. I only think of one reason is that the model is not fully converged, since it was trained for like 25 epochs. You can see in the transducer loss image in the example, the gap between val loss and train loss was still big and it seems that the losses could reduce more if I trained it longer. Unfortunately, at this time I don't have resources to continue training model.
  2. I tested on CPU, the greedy and beam search only took around 60s total per batch with each batch with batch_size=1. Were you mistaken the time of each batch for the time of all batches? I think on GPU, you should use larger batch size to decode instead of 1.

@nglehuy nglehuy added the question Further information is requested label Jan 8, 2021
@abhinavg4
Copy link
Author

abhinavg4 commented Jan 12, 2021

  1. Thanks I will look into that

  2. Actually I tested on GPU as well as CPU and yes you are right for both it takes about 60s per batch with batch_size=1. Also however using a batch size greater than 1 is not helping as it still loops through each batch one by one. I agree with your comment in Conformer decode speed too slow  #58 , But I think we can sort the test data based on length to minimize the padding. Also anyways I think we can have an option to do beam search decoding over batch instead of iterating over it one at a time. Are you planning to add an option for batch beam decoding?

    2.1 - My initial impression was that parallel_iteration in the while_loop might process each element in a batch parallelly, but in practice, I didn't observe it. The beam search was iterating each sample one by one

    2.2. Please have a look at Bug related to batch beam decoding #110 too

Thanks

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 16, 2021

@abhigarg-iitk I'm also planning to use batch dimension directly in the decoding, I'll find a way to do that ASAP.

@gandroz
Copy link
Contributor

gandroz commented Jan 26, 2021

Hi @abhigarg-iitk
I agree with you, I did not suceed to reach the same WER either even after more than 50 epochs (maybe still not fully converged). However, the difference between the beam search and greedy search results is not that much telling me two things:

  • at such WER, the model is so efficient that a beam search does not improve the predictions
  • or there is an issue in the beam search implementation

Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.

@abhinavg4
Copy link
Author

abhinavg4 commented Jan 27, 2021

Hi @gandroz ,

In my opinion

- at such WER, the model is so efficient that a beam search does not improve the predictions
- or there is an issue in the beam search implementation

I think the first statement might be true. Unless we have a shallow fusion with an LM the beam search might not be that effective. For reference see table 3 in this work. Although #123 has good point. maybe we can have a look at some of the standard beam searches of espnet

Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.

Although conformer paper doesn't mention explicitly the vocab size, contextnet paper mentions using 1k word-piece model and I assume conformer might be using the same vocab. Moreover maybe we can infer the vocab size using number of parameter mentioned in the conformer paper.

@gandroz
Copy link
Contributor

gandroz commented Jan 27, 2021

Hi @abhigarg-iitk

I contacted the first author of the paper and here is his answer:

Regarding the tokenizer, we use an internal word-piece model with a vocabulary of 1K. Regarding training recipes, we 'only' train on the Librispeech 970 hours Train-dataset.

ESPNet reproduced our results and integrated Conformer into their toolkit and posted strong results on Librispeech without Language model fusion: 1.9/4.9/2.1/4.9 (dev/devother/test/testother). Note, their model used a Transformer decoder (compared to our RNN-T decoder which should as well help improve over these results).

On the other hand, we also have open-sourced our implementation of the Conformer Layer in the encoder which might be helpful to refer to. Hope this helps!

@abhinavg4
Copy link
Author

Hi @gandroz ,

Thanks for this answer. I had looked earlier into the Lingvo implementation of conformer. And one strange contrast was the use of ff in the convolution module instead of pointwise conv used in the original paper. Also the class name says "Lightweight conv layer" which also has a mention in the paper.

Infact I also trying replacing pointwise conv with ff layers but the results were somewhat worse. Although I didn't check my implementation throughly.

Even Espnet seems to use pointwise conv and not ff link.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 27, 2021

@abhigarg-iitk @gandroz Before changing the beam search, I already inherited the beam search code from ESPNet here and tested, the WER was lower but not much different from greedy.

@nglehuy nglehuy added discussion and removed question Further information is requested labels Jan 27, 2021
@nglehuy nglehuy pinned this issue Jan 27, 2021
@gandroz
Copy link
Contributor

gandroz commented Jan 27, 2021

I made a quick review of the model code and did not find any great difference with the ESPNet implementation. Maybe in the decoder... The paper refers to a single layer LSTM whereas the transformer decoder in ESPNet seems to add MHA layers.

@gandroz
Copy link
Contributor

gandroz commented Jan 27, 2021

Also in ESPNet:

Our Conformer model consists of a Conformer encoder proposed in [7] and a Transformer decoder

In ASR tasks, the Conformer model predicts a target sequence Y of characters or byte-pair-encoding (BPE) tokens2
from an input sequence X of 80 dimensional log-mel filterbank features with/without 3-dimensional pitch features. X is first sub-sampled in a convolutional layer by a factor of 4, as in [4], and then fed into the encoder and decoder to compute the cross-entropy (CE) loss. The encoder output is also used to compute a connectionist temporal classification (CTC) loss [17] for joint CTC-attention training and decoding [18]. During inference, token-level or word-level language model (LM) [19] is combined via shallow fusion.

So definitevely, the approach from ESPNet is not purely the one exposed in the conformer paper.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 29, 2021

There're 2 things I'm not sure in the paper: variational noise and the structure of prediction and joint networks. I don't know if they have dense layers right after the encoder and prediction net or only dense after adding 2 inputs, layernorm or projection in the prediction net.
The contextnet paper says the structure is from this paper which says the vocabulary is 4096 word-pieces.

@gandroz
Copy link
Contributor

gandroz commented Jan 29, 2021

@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details

@pourfard
Copy link

pourfard commented Jan 31, 2021

Hi,
After 10 epochs the transducer_loss is about 3.5 for my training data (300 hours). But the test results are not promising. What is the transducer_loss after 20 or 30 epochs for your training data (LibriSpeech)? does it gets under 1.0 after 20-30 epochs? should I still wait for 20 more epochs? because every 5 epochs takes 1 day on my 1080Ti.

Here is my training log.

@gandroz @usimarit @abhigarg-iitk

@gandroz
Copy link
Contributor

gandroz commented Jan 31, 2021

@pourfard I could not say... I'm using the whole training dataset (960h) and after 50 epochs, the losses were 22.16 ont the training datasets and 6.3 on the dev ones. And yes, it is very long to train...

image

@tund
Copy link

tund commented Feb 2, 2021

@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details

Hi @gandroz : have you got something back from the conformer's authors?

@gandroz
Copy link
Contributor

gandroz commented Feb 3, 2021

@tund not yet, I'll try to poke him again tomorrow

@tund
Copy link

tund commented Feb 3, 2021

Thanks @gandroz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants