Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What I should do if I want to train a Japanese Model? #219

Open
ymzlygw opened this issue Aug 23, 2021 · 3 comments
Open

What I should do if I want to train a Japanese Model? #219

ymzlygw opened this issue Aug 23, 2021 · 3 comments

Comments

@ymzlygw
Copy link

ymzlygw commented Aug 23, 2021

Hi, my question is that for english, the output of model is directly the index of char If I understand correctly,then it can map between char and sequence. And for japanese, what is the output of model? and how to create map between index and kanji of jp.

@ymzlygw
Copy link
Author

ymzlygw commented Aug 24, 2021

I see the english_characters , what about japanese? And too get the japanese_characters, token_type using is 'char' or 'bpe'?
ENGLISH_CHARACTERS = [a-z],

@nglehuy
Copy link
Collaborator

nglehuy commented Oct 10, 2021

@ymzlygw I think for Japanese, Korean, Chinese we should use subwords instead of characters. If you can define a vocabulary contains all characters of the language like in english then you can use character mode. As far as I know those languages have characters that are a combination of "some characters in alphabet" so I think it's quite a lot for you to define a characters vocabulary file.

@psyma
Copy link

psyma commented Feb 16, 2022

Hi, I tried to train a Chinese model and it seems not good, I followed the steps in Conformer the same way with English. can have a suggestion on how could I properly train a Chinese model? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants