Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements check on existing and new datasets #1049

Open
KennethEnevoldsen opened this issue Jul 5, 2024 · 1 comment
Open

Implements check on existing and new datasets #1049

KennethEnevoldsen opened this issue Jul 5, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@KennethEnevoldsen
Copy link
Contributor

We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)

We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:

  1. Checking that there are no empty documents
  2. Checking that the task contains no duplicates
  3. Checking leakage between train and test sets
  4. Optionally we could add the existing computed metrics here as well (e..g avg. length)

We can then write a file for a specific dataset / revision to compute these metrics.

Other tests such as checking if the language match etc. could also be added in the future.

@Muennighoff
Copy link
Contributor

Agreed that such tests would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants