Implements check on existing and new datasets #1049

KennethEnevoldsen · 2024-07-05T09:45:27Z

We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)

We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:

Checking that there are no empty documents
Checking that the task contains no duplicates
Checking leakage between train and test sets
Optionally we could add the existing computed metrics here as well (e..g avg. length)

We can then write a file for a specific dataset / revision to compute these metrics.

Other tests such as checking if the language match etc. could also be added in the future.

Muennighoff · 2024-07-05T17:49:16Z

Agreed that such tests would be great!

KennethEnevoldsen added the enhancement New feature or request label Jul 5, 2024

KennethEnevoldsen mentioned this issue Jul 5, 2024

STS22/LivedoorNewsClustering: Empty texts #1043

Closed

KennethEnevoldsen mentioned this issue Jul 5, 2024

Review: Outline of MIEB #1051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements check on existing and new datasets #1049

Implements check on existing and new datasets #1049

KennethEnevoldsen commented Jul 5, 2024

Muennighoff commented Jul 5, 2024

Implements check on existing and new datasets #1049

Implements check on existing and new datasets #1049

Comments

KennethEnevoldsen commented Jul 5, 2024

Muennighoff commented Jul 5, 2024