Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT eval set contains 60 empty articles #473

Open
matthew-frank opened this issue May 8, 2021 · 3 comments
Open

BERT eval set contains 60 empty articles #473

matthew-frank opened this issue May 8, 2021 · 3 comments
Assignees
Labels

Comments

@matthew-frank
Copy link
Contributor

PR #435 contains a script, cleanup_scripts/separate_test_set.py that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted from the training set into the eval set. Unfortunately there's a bug in seperate_test_set.py that causes 60 of the 10000 extracted articles to be empty.

The seperate_test_set.py script was later adopted in PR #470, as the file input_preprocessing/seperate_test_set.py, so it needs to be fixed there as well.

The problem is that on line 75 (here in PR 435 and here in PR 470 ) the boundaries between articles are found by using pythons split('\n\n'). But this produces an empty entry at the end of the resulting array. Then when a random array entry is selected on line 79 there's an approximately 1/2 of 1% chance of selecting the last (empty) entry.

Two choices of fixing the problem would be (a) call pop() on line 75 or (b) change line 79 num_articles to (num_articles-1) (so that the last entry can't be selected).

@johntran-nv johntran-nv added the language_model BERT NLP label Nov 8, 2022
@johntran-nv
Copy link
Contributor

@sgpyc what do you think?

@itayhubara
Copy link
Contributor

The fix is straightforward but recreating the eval set would require: (1) updating google drive and (2) checking that RCPs are not affected. Since this benchmark is pretty old I think we should keep it as is and add this as a known bug to the documentation.

@matthew-frank
Copy link
Contributor Author

Yes, unfortunately this was left unfixed in the churn before v1.0 submission 1.5 years ago, so it is what it is, and it would not be productive to change the benchmark at this point. I'll submit a PR adding a small note to the bottom of the README.md for the benchmark.

@matthew-frank matthew-frank assigned matthew-frank and unassigned sgpyc Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants