Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Results When "OpenMP Num Threads" is Greater Than 1 #19

Open
Calebsakhtar opened this issue Jul 17, 2024 · 12 comments
Open

Comments

@Calebsakhtar
Copy link
Contributor

There seems to be an issue with threads reading and writing parts of the memory at the same time.

Here are APCEMM outputs with two consecutive runs when using 8 threads:
Run1
Run2

I was not able to recreate the random jumps with OpenMP Num Threads set to 1, but I was with more threads.

@sdeastham
Copy link
Collaborator

@Calebsakhtar - was this fixed by commit 85a56a3? More generally, do you still see this bug when running with >1 thread?

@Calebsakhtar
Copy link
Contributor Author

Calebsakhtar commented Aug 21, 2024

@sdeastham Just to report that compiling APCEMM on commit 85a56a3 still results in the above bug. I will now attempt compilation on the latest commit 618f20f

@Calebsakhtar
Copy link
Contributor Author

Here are the instructions to replicate the behaviour reported above:

  1. Clone the APCEMM git repo
  2. Follow the README installation instructions from the repo
  3. Run example 3

Please note that this behaviour has been observed in both Windows 11 Docker and on the Linux system of the Cambridge HPC.

@Calebsakhtar
Copy link
Contributor Author

@sdeastham Just to report that compiling APCEMM on the latest commit 618f20f still results in the above bug.

@sdeastham
Copy link
Collaborator

Thanks @Calebsakhtar ! To confirm, is that the result when outputting the standard "depth" variable directly or are you calculating a different kind of depth?

@Calebsakhtar
Copy link
Contributor Author

@sdeastham The standard depth variable straight from APCEMM!

@sdeastham
Copy link
Collaborator

Got it! OK - issue is reproducible on our HPC (in fact, it looks much worse):

image

This seems to have the largest effect on these diagnostic variables. Prognostic variables like ice mass show very small differences (although these should still be nailed down, as they shouldn't happen for this case where there is in theory no randomness as temperature perturbation is disabled for example 3):

image

@michaelxu3 any thoughts you might have on origin would be appreciated! In any case, I'll try to drill down and see if there's an obvious cause of this behaviour.

@Calebsakhtar - can you confirm that this behaviour remains/disappears when:

  • The number of threads is set to 1 in input.yaml (but OMP_NUM_THREADS remains at 8)?
  • The number of threads in the environment is set to 1 (export OMP_NUM_THREADS=1) (but input.yaml still lists 8)?
  • Both the core number in input.yaml and OMP_NUM_THREADS are set to 1?

@sdeastham
Copy link
Collaborator

@Calebsakhtar Also, was the profile you showed in the original post for Example 3 or for a different case? If it's example 3, that raises the question of why our profiles are so different (even setting aside the noise).

@Calebsakhtar
Copy link
Contributor Author

@sdeastham The profile I showed was for one of the cases with my custom met conditions, not any of the examples. Sorry for not specifying this sooner.

@Calebsakhtar
Copy link
Contributor Author

@sdeastham It will take me a while to confirm the other two cases, but at this time I can confirm that setting export OMP_NUM_THREADS=1 and specifying one core in the input.yaml file does result in the bug disappearing.

@Calebsakhtar
Copy link
Contributor Author

Calebsakhtar commented Sep 7, 2024

@sdeastham Finally got around to finishing the HPC runs.

Here are the results:

  • When OpenMP Num Threads (positive int) is set to 8 in input.yaml, the bug appears regardless of the value of OMP_NUM_THREADS and --cpus-per-task in my SLURM script.
  • When OpenMP Num Threads (positive int) is set to 1 in input.yaml, the bug disappears regardless of the value of OMP_NUM_THREADS and --cpus-per-task in my SLURM script.

@sdeastham
Copy link
Collaborator

Well, that is odd... thanks @Calebsakhtar ! I'll see if I can figure out what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants