Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in file parser and/or unit tests #189

Closed
marshallward opened this issue Aug 8, 2022 · 3 comments
Closed

Possible race condition in file parser and/or unit tests #189

marshallward opened this issue Aug 8, 2022 · 3 comments

Comments

@marshallward
Copy link
Member

In our GitHub Actions CI, the MOM_file_parser unit tests will intermittently produce the following error when run over two PEs:

$ mpirun -n 2 ../../build/unit/MOM_unit_tests
<... output ...>

=== test_open_param_file_no_doc

=== test_open_param_file_no_doc
NOTE from PE     0: open_param_file: TEST_input has been opened successfully.
NOTE from PE     0: close_param_file: TEST_input has been closed successfully.

=== test_read_param_int

=== test_read_param_int
NOTE from PE     0: open_param_file: TEST_input has been opened successfully.
NOTE from PE     0: close_param_file: TEST_input has been closed successfully.

WARNING from PE     0: open_param_file: file TEST_input has already been opened. This should NOT happen! Did you specify the same file twice in a namelist?


FATAL from PE     1: open_param_file: Input file 'TEST_input' does not exist.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

This suggests some race condition related to either this specific test (test_read_param_int), the prior test (test_open_param_file_nodoc), or something more fundamental inside of open_param_file.

! Check that this file has not already been opened
if (CS%nfiles > 0) then
reopened_file = .false.
inquire(file=trim(filename), number=iounit)
if (iounit /= -1) then
do i = 1, CS%nfiles
if (CS%iounit(i) == iounit) then
call assert(trim(CS%filename(1)) == trim(filename), &
"open_param_file: internal inconsistency! "//trim(filename)// &
" is registered as open but has the wrong unit number!")
call MOM_error(WARNING, &
"open_param_file: file "//trim(filename)// &
" has already been opened. This should NOT happen!"// &
" Did you specify the same file twice in a namelist?")
reopened_file = .true.
endif ! unit numbers
enddo ! i
endif
if (any_across_PEs(reopened_file)) return
endif

The code block raising this issue should only trigger if CS%nfiles is positive. This ought to not be possible, since param is a new local variable on the stack of test_read_param_int and the function is only called once. (Each rank does call the function, but CS should be local to the rank.)

There are potential issues inside the code block, since inquire() could detect a file created on the other rank, or an IO unit could be left open from a previous test. But given that only a nonzero nfiles should execute these tests, it is confusing that it is even happening.

I don't yet know how to replicate this error, but would like to start tracking this issue as it happens in our CI.

@marshallward
Copy link
Member Author

@marshallward
Copy link
Member Author

Still no idea on what is causing this, but I can now replicate it on my home machine. If I launch an endless stream of jobs using all of the cores (6 in my case), then it eventually fails with the same error. The first attempt took 70 tries and the second took 170. Whatever this problem is, it's not isolated to the GitHub Actions nodes, although it is very infrequent.

@Hallberg-NOAA
Copy link
Member

We believe that this issue has been corrected by #419 . Please reopen this issue if this behavior is found to re-occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants