Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preprocess a large text dataset (approximately 80 GB) #51

Open
SandyPanda-MLDL opened this issue Jun 13, 2024 · 1 comment
Open

Comments

@SandyPanda-MLDL
Copy link

I am trying to preprocess a huge text dataset (non English) as per the code of preprocess.ipynb as provided in the repo itself. In order to do so, I have split the large dataset into small chunks of 1.26 GB (approximately) and then trying to preprocess it. However, I am getting errors (like segmentation error, etc.,) and unable to complete the preprocessing for all the chunks. Can anyone suggest anything regarding this?

@SoshyHayami
Copy link

you don't have to do it the way the author pre-processed their dataset. just use the regular .map() and set num_proc to whatever your cpu can handle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants