Huggingface already has an efficient implementation of this? #58

laurislopata · 2024-03-19T23:01:41Z

When Karpathy claimed an efficient implementation of the BPE optimizer doesn't exist, I did some research and found this on Hugging Face: https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs

Isn't this exactly what Karpathy was creating?

NLPV2011 · 2024-03-21T12:46:06Z

Hugging face's tokenizers not support for all non-english language

AugustasMacijauskas · 2024-03-23T17:14:24Z

Hugging face's tokenizers not support for all non-english language

I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?

NLPV2011 · 2024-05-29T12:37:25Z

I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?

i'm not sure, i think i need to implement BPE tokenizer from scratch for easy to use.... you may like karpathy's minBPE

laurislopata changed the title ~~Huggingface already has an efficient implementation of this~~ Huggingface already has an efficient implementation of this? Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huggingface already has an efficient implementation of this? #58

Huggingface already has an efficient implementation of this? #58

laurislopata commented Mar 19, 2024

NLPV2011 commented Mar 21, 2024

AugustasMacijauskas commented Mar 23, 2024

NLPV2011 commented May 29, 2024

Huggingface already has an efficient implementation of this? #58

Huggingface already has an efficient implementation of this? #58

Comments

laurislopata commented Mar 19, 2024

NLPV2011 commented Mar 21, 2024

AugustasMacijauskas commented Mar 23, 2024

NLPV2011 commented May 29, 2024