You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our GitHub Actions CI, the MOM_file_parser unit tests will intermittently produce the following error when run over two PEs:
$ mpirun -n 2 ../../build/unit/MOM_unit_tests
<... output ...>
=== test_open_param_file_no_doc
=== test_open_param_file_no_doc
NOTE from PE 0: open_param_file: TEST_input has been opened successfully.
NOTE from PE 0: close_param_file: TEST_input has been closed successfully.
=== test_read_param_int
=== test_read_param_int
NOTE from PE 0: open_param_file: TEST_input has been opened successfully.
NOTE from PE 0: close_param_file: TEST_input has been closed successfully.
WARNING from PE 0: open_param_file: file TEST_input has already been opened. This should NOT happen! Did you specify the same file twice in a namelist?
FATAL from PE 1: open_param_file: Input file 'TEST_input' does not exist.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
This suggests some race condition related to either this specific test (test_read_param_int), the prior test (test_open_param_file_nodoc), or something more fundamental inside of open_param_file.
" is registered as open but has the wrong unit number!")
call MOM_error(WARNING, &
"open_param_file: file "//trim(filename)// &
" has already been opened. This should NOT happen!"// &
" Did you specify the same file twice in a namelist?")
reopened_file =.true.
endif! unit numbers
enddo! i
endif
if (any_across_PEs(reopened_file)) return
endif
The code block raising this issue should only trigger if CS%nfiles is positive. This ought to not be possible, since param is a new local variable on the stack of test_read_param_int and the function is only called once. (Each rank does call the function, but CS should be local to the rank.)
There are potential issues inside the code block, since inquire() could detect a file created on the other rank, or an IO unit could be left open from a previous test. But given that only a nonzero nfiles should execute these tests, it is confusing that it is even happening.
I don't yet know how to replicate this error, but would like to start tracking this issue as it happens in our CI.
The text was updated successfully, but these errors were encountered:
Still no idea on what is causing this, but I can now replicate it on my home machine. If I launch an endless stream of jobs using all of the cores (6 in my case), then it eventually fails with the same error. The first attempt took 70 tries and the second took 170. Whatever this problem is, it's not isolated to the GitHub Actions nodes, although it is very infrequent.
In our GitHub Actions CI, the
MOM_file_parser
unit tests will intermittently produce the following error when run over two PEs:This suggests some race condition related to either this specific test (
test_read_param_int
), the prior test (test_open_param_file_nodoc
), or something more fundamental inside ofopen_param_file
.MOM6/src/framework/MOM_file_parser.F90
Lines 152 to 171 in 53fdbc0
The code block raising this issue should only trigger if
CS%nfiles
is positive. This ought to not be possible, sinceparam
is a new local variable on the stack oftest_read_param_int
and the function is only called once. (Each rank does call the function, butCS
should be local to the rank.)There are potential issues inside the code block, since
inquire()
could detect a file created on the other rank, or an IO unit could be left open from a previous test. But given that only a nonzeronfiles
should execute these tests, it is confusing that it is even happening.I don't yet know how to replicate this error, but would like to start tracking this issue as it happens in our CI.
The text was updated successfully, but these errors were encountered: