Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to remove cached dataset files on large runs #984

Open
isaac-chung opened this issue Jun 25, 2024 · 2 comments
Open

Option to remove cached dataset files on large runs #984

isaac-chung opened this issue Jun 25, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jun 25, 2024

When running all retrieval tasks, a machine can easily run out of disk space, as loading a dataset stores the dataset files in a cache directory (usually ~/.cache/huggingface/datasets). e.g.

import mteb
all_retrieval_tasks = mteb.get_tasks(task_types=["Retrieval"])
for task in all_retrieval_tasks:  
    task.load_data()
...

Suggestion

  1. Add an option within evaluate to call the dataset's cleanup_cache_files method, or
  2. implement __exit__() (call cleanup_cache_files) for AbsTask to be able to use the task as a context manager

CC @imenelydiaker (related to the script we have)

@isaac-chung isaac-chung added the enhancement New feature or request label Jun 25, 2024
@KennethEnevoldsen
Copy link
Contributor

An option in the CLI might simply be to do:

mteb run ... --disable-datasets-caching

Using the following:

from datasets import disable_caching
disable_caching()

We might additionally add the arguments:

eval = mteb.MTEB(...)

eval.run(..., automatically_clean_up_cache=True) # on or off by default? On would be more stable but also more invasive

Which will automatically clean up if there is not enough space

@imenelydiaker
Copy link
Contributor

Would go for an option in the CLI also!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants