You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)
We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:
Checking that there are no empty documents
Checking that the task contains no duplicates
Checking leakage between train and test sets
Optionally we could add the existing computed metrics here as well (e..g avg. length)
We can then write a file for a specific dataset / revision to compute these metrics.
Other tests such as checking if the language match etc. could also be added in the future.
The text was updated successfully, but these errors were encountered:
We currently find a lot of inconsistencies in added datasets (e.g. #1043, #407, #1036)
We can naturally fix these as they arise, but it would be ideal to have a test which for each dataset checks if it of "high-quality", these checks could e.g. include:
We can then write a file for a specific dataset / revision to compute these metrics.
Other tests such as checking if the language match etc. could also be added in the future.
The text was updated successfully, but these errors were encountered: