Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface already has an efficient implementation of this? #58

Open
laurislopata opened this issue Mar 19, 2024 · 3 comments
Open

Huggingface already has an efficient implementation of this? #58

laurislopata opened this issue Mar 19, 2024 · 3 comments

Comments

@laurislopata
Copy link

When Karpathy claimed an efficient implementation of the BPE optimizer doesn't exist, I did some research and found this on Hugging Face: https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs

Isn't this exactly what Karpathy was creating?

@laurislopata laurislopata changed the title Huggingface already has an efficient implementation of this Huggingface already has an efficient implementation of this? Mar 19, 2024
@NLPV2011
Copy link

Hugging face's tokenizers not support for all non-english language

@AugustasMacijauskas
Copy link

Hugging face's tokenizers not support for all non-english language

I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?

@NLPV2011
Copy link

I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?

i'm not sure, i think i need to implement BPE tokenizer from scratch for easy to use.... you may like karpathy's minBPE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants