Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polaris thread test failing on Perlmutter-CPU with Intel #6515

Open
xylar opened this issue Jul 19, 2024 · 5 comments
Open

Polaris thread test failing on Perlmutter-CPU with Intel #6515

xylar opened this issue Jul 19, 2024 · 5 comments
Assignees

Comments

@xylar
Copy link
Contributor

xylar commented Jul 19, 2024

As reported in E3SM-Project/polaris#205, we are seeing failures in the Polaris test:

baroclinic_channel/10km/threads

when running on Perlmutter-CPU with Intel. Differences between runs with 1 and 2 threads are at machine precision but not zero.

I used git bisect to determine that the PR that causes this to emerge is #6035, which set new weights for the split-explicit time stepping.

This tread test still passes with the previous PR, #5989, which introduced the Adams-Bashforth 2nd order time stepping scheme.

@xylar
Copy link
Contributor Author

xylar commented Jul 19, 2024

@mark-petersen and @hyungyukang, do you have any ideas about why #6035 could be causing threading to be non-BFB? My hunch would be that it's some weird order of operations difference or something like that.

While I know things are busy with Omega, this seems worth tracking down sooner or later because this could also affect production runs with E3SM on Perlmutter with Intel. I will run some thread testing with E3SM to see.

@hyungyukang
Copy link
Contributor

@xylar , I'll take a look and run some tests. I agree with your intuition, but just to be sure, I was wondering if you had a chance to run the same tests with the GNU (or NVIDIA, Cray) compiler.

@xylar
Copy link
Contributor Author

xylar commented Jul 19, 2024

The test passes with gnu. Nvidia isn't yet supported.

@xylar
Copy link
Contributor Author

xylar commented Jul 20, 2024

I tried PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-mach-pet with master and it passes so this might suggest this is either an MPAS-Ocean standalone issue or a difference between the baroclinic channel configuration and production E3SM.

@xylar
Copy link
Contributor Author

xylar commented Jul 21, 2024

I'm seeing the same issue in Polaris on Compy with Intel and Intel-MPI (so it seems to be a problem with Intel, but not with Intel on every machine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants