BERT eval set contains 60 empty articles #473

matthew-frank · 2021-05-08T23:25:37Z

PR #435 contains a script, cleanup_scripts/separate_test_set.py that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted from the training set into the eval set. Unfortunately there's a bug in seperate_test_set.py that causes 60 of the 10000 extracted articles to be empty.

The seperate_test_set.py script was later adopted in PR #470, as the file input_preprocessing/seperate_test_set.py, so it needs to be fixed there as well.

The problem is that on line 75 (here in PR 435 and here in PR 470 ) the boundaries between articles are found by using pythons split('\n\n'). But this produces an empty entry at the end of the resulting array. Then when a random array entry is selected on line 79 there's an approximately 1/2 of 1% chance of selecting the last (empty) entry.

Two choices of fixing the problem would be (a) call pop() on line 75 or (b) change line 79 num_articles to (num_articles-1) (so that the last entry can't be selected).

The text was updated successfully, but these errors were encountered:

johntran-nv · 2022-11-21T21:52:45Z

@sgpyc what do you think?

itayhubara · 2022-12-07T13:19:20Z

The fix is straightforward but recreating the eval set would require: (1) updating google drive and (2) checking that RCPs are not affected. Since this benchmark is pretty old I think we should keep it as is and add this as a known bug to the documentation.

matthew-frank · 2022-12-07T14:06:34Z

Yes, unfortunately this was left unfixed in the churn before v1.0 submission 1.5 years ago, so it is what it is, and it would not be productive to change the benchmark at this point. I'll submit a PR adding a small note to the bottom of the README.md for the benchmark.

johntran-nv added the language_model BERT NLP label Nov 8, 2022

johntran-nv assigned sgpyc Nov 21, 2022

matthew-frank assigned matthew-frank and unassigned sgpyc Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT eval set contains 60 empty articles #473

BERT eval set contains 60 empty articles #473

matthew-frank commented May 8, 2021

johntran-nv commented Nov 21, 2022

itayhubara commented Dec 7, 2022

matthew-frank commented Dec 7, 2022

BERT eval set contains 60 empty articles #473

BERT eval set contains 60 empty articles #473

Comments

matthew-frank commented May 8, 2021

johntran-nv commented Nov 21, 2022

itayhubara commented Dec 7, 2022

matthew-frank commented Dec 7, 2022