Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training ASR models in multiple languages #91

Open
monatis opened this issue Dec 30, 2020 · 9 comments
Open

Training ASR models in multiple languages #91

monatis opened this issue Dec 30, 2020 · 9 comments
Labels
help wanted Extra attention is needed

Comments

@monatis
Copy link
Contributor

monatis commented Dec 30, 2020

TensorFlowASR makes it quite easy to train and deploy almost SOTA ASR models, but it provides a pretrained model only in English. On the other hand, FAIR has recently published an open and free dataset in 8 languages (see the paper). It is in the public domain and of a large size, and has the same quality as LibriSpeech. So, my suggestion is form a volunteer working group to collaborate on training ASR models in multiple languages and share them publically.

Maintainers of the repo can pin the issue and label it with help-wanted for visibility if this idea makes sense.

@nglehuy nglehuy pinned this issue Dec 30, 2020
@nglehuy nglehuy added the help wanted Extra attention is needed label Dec 30, 2020
@nglehuy
Copy link
Collaborator

nglehuy commented Dec 30, 2020

This is the great idea 😆

@monatis
Copy link
Contributor Author

monatis commented Dec 30, 2020

Hi @usimarit! Thanks for such a great project and your support to boost visibility of this issue 😻 So I will start by writing a helper script to automatically download MLS dataset in a given language and prepare the transcription and alphabet files and PR to add it to the repo. Then I can train a Conformer model in German by using this script as a first step 🚀

@nglehuy
Copy link
Collaborator

nglehuy commented Dec 31, 2020

Hi @monatis, just for your information, we should train using subwords instead of characters for performance boost

@monatis
Copy link
Contributor Author

monatis commented Dec 31, 2020

Hi @usimarit, yeah I know that training with subwords yields a better performance, but I'm automatically generating an alphabet file in #92 for those who want to use characters anyway.

@nglehuy
Copy link
Collaborator

nglehuy commented Jan 26, 2021

Hi everyone, if you guys want to share your pretrained models, just upload the .h5 or .pb files along with config.yml to any drive (google, dropbox, etc.), then write a section, add the link, add your contact to a subsection in that section, in the README.md in the example directory that belongs to each model like in image (the one I made):
Screenshot from 2021-01-26 19-42-51
And finally open a pull request to merge to the repo 😄

@monatis
Copy link
Contributor Author

monatis commented Jan 26, 2021

@usimarit Grat. I've been quite busy for some time, but I'll be more active in this repo on following days and contribute pretrained models in other languages. Thanks

@JStumpp
Copy link

JStumpp commented Feb 4, 2021

@monatis, did you already train a Conformer model in German?

@monatis
Copy link
Contributor Author

monatis commented Feb 5, 2021

Hi @JStumpp I started to train it and hope to release it next week.

@christina284
Copy link

Do you know in which subset of Librispeech is the english pretrained model trained on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants