PIO_INTERNAL_ERROR with `ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion` #6486

ndkeen · 2024-06-25T18:44:02Z

With ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion, the test uses 128 tasks on 1 node and was a little slow. I wanted to try using 2 nodes (256 tasks), but hit an error described here.

Note that to still improve speed of these tests, I went ahead with a PR to increase tasks to 192 (still using 2 nodes):
After #6484, we are now using 192 tasks for all components.

To reproduce the errro:

ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion

Note I don't see the error with a SMS test -- so seems to be related to writing restarts.

I was seeing:

128: PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (./ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t.elm.r.1850-01-07-00000.nc, ncid=54) failed (Number of pending requests on file = 2\
3, Number of variables with pending requests = 23, Number of request blocks = 3, Current block being waited on = 1, Number of requests in current block = 11).. Size of I/O request exceeds INT_MAX (err=-237). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-pelayout-minor-adjustment/externals/scorpio/src/clib/pio_darray_int.c: 2189)
128: Obtained 10 stack frames.
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e301c]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e325e]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17e27f5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833a24]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1817bb5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1833cd2]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x1818b55]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x17c5d60]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10ca3d5]
128: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/ERS.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion.C.20240621_095913_j1wl0t/bld/e3sm.exe() [0x10e2643]
128: MPICH ERROR [Rank 128] [job id 27028879.0] [Fri Jun 21 11:00:33 2024] [nid006900] - Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128

Jayesh asked "Does increasing the number of I/O processes (./xmlchange PIO_NUMTASKS=16) fix the issue with ERS.hcru_hcru.I20TRGSWCNPRDCTCBC ? Looks like an error from PnetCDF on the total size of the pending writes from a single process being > INT_MAX"
Which I've not tried, but I don't think we would want to use that going forward.

The text was updated successfully, but these errors were encountered:

ndkeen · 2024-09-03T21:22:47Z

I verified that changing PIO_STRIDE=64 (ie use more IO writers -- from 1 to 2 per pm-cpu node) is potential work-around here.

There may be something with this case such that it wants to send a large message between IO writers and while there may be plenty of memory, it's simply too large based on current algorithm. Breaking up with more IO writers or simply using more MPI's for the case may be best option.

…next (PR #6581) Currently, the tests for this resolution use 192 MPI's on pm-cpu which is an odd value (1.5 nodes). Here it's being changed to use -3 (or 384 MPI's). Example of test that would use this layout: SMS.hcru_hcru.IELM This change is effective work-around (but not fix) for #6521 with #6486 in mind as noted below. [bfb]

ndkeen · 2024-09-04T17:10:07Z

#6581 was merged to master and the test above still passes.

It could be a simple case of needing to run with a minimum number of tasks.
Though not a memory issue, we just hit problem in PIO with size of some variable being over INT_MAX.

ndkeen added Machine Files PIO pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Jun 25, 2024

ndkeen mentioned this issue Sep 3, 2024

Hang for test using nvidia compiler only for certain smaller MPI counts ERS.hcru_hcru.IELM.pm-cpu_nvidia.elm-multi_inst #6521

Open

ndkeen mentioned this issue Sep 3, 2024

For pm-cpu, increase default pelayout to 3 nodes for tests using l%360x720cru #6581

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIO_INTERNAL_ERROR with `ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion` #6486

PIO_INTERNAL_ERROR with `ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion` #6486

ndkeen commented Jun 25, 2024

ndkeen commented Sep 3, 2024

ndkeen commented Sep 4, 2024

PIO_INTERNAL_ERROR with ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion #6486

PIO_INTERNAL_ERROR with ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion #6486

Comments

ndkeen commented Jun 25, 2024

ndkeen commented Sep 3, 2024

ndkeen commented Sep 4, 2024

PIO_INTERNAL_ERROR with `ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion` #6486

PIO_INTERNAL_ERROR with `ERS_P256.hcru_hcru.I20TRGSWCNPRDCTCBC.pm-cpu_intel.elm-erosion` #6486