Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIO_INTERNAL_ERROR with ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion #6486

Open
ndkeen opened this issue Jun 25, 2024 · 2 comments
Labels
Machine Files PIO pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jun 25, 2024

With ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion, the test uses 128 tasks on 1 node and was a little slow. I wanted to try using 2 nodes (256 tasks), but hit an error described here.

Note that to still improve speed of these tests, I went ahead with a PR to increase tasks to 192 (still using 2 nodes):
After #6484, we are now using 192 tasks for all components.

To reproduce the errro:

ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion

Note I don't see the error with a SMS test -- so seems to be related to writing restarts.

I was seeing:

128: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t.elm.r.1850-01-07-00000.nc, ncid=54) failed (Number of pending requests on file = 2\
3, Number of variables with pending requests = 23, Number of request blocks = 3, Current block being waited on = 1, Number of requests in current block = 11).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-pelayout-minor-adjustment/externals/scorpio/src/clib/pio_darray_int.c: 2189)
128: Obtained 10 stack frames.
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e301c]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e325e]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e27f5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833a24]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1817bb5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833cd2]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1818b55]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17c5d60]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10ca3d5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10e2643]
128: MPICH ERROR [Rank 128] [job id 27028879.0] [Fri Jun 21 11:00:33 2024] [nid006900] - Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128

Jayesh asked "Does increasing the number of I/O processes (./xmlchange PIO_NUMTASKS=16) fix the issue with ERS.hcru_hcru.I20TRGSWCNPRDCTCBC ? Looks like an error from PnetCDF on the total size of the pending writes from a single process being > INT_MAX"
Which I've not tried, but I don't think we would want to use that going forward.

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 3, 2024

I verified that changing PIO_STRIDE=64 (ie use more IO writers -- from 1 to 2 per pm-cpu node) is potential work-around here.

There may be something with this case such that it wants to send a large message between IO writers and while there may be plenty of memory, it's simply too large based on current algorithm. Breaking up with more IO writers or simply using more MPI's for the case may be best option.

ndkeen added a commit that referenced this issue Sep 4, 2024
…next (PR #6581)

Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes).
Here it's being changed to use -3 (or 384 MPI's).

Example of test that would use this layout: SMS.hcru_hcru.IELM

This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below.

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 4, 2024

#6581 was merged to master and the test above still passes.

It could be a simple case of needing to run with a minimum number of tasks.
Though not a memory issue, we just hit problem in PIO with size of some variable being over INT_MAX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files PIO pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

No branches or pull requests

1 participant