summit jsrun

JSRUN Issues

The following issues affect the execution of RCT on Summit.

Open

[CCS #401235] jsrun failes on concurrent execution

DESCRIPTION
CONTACT: Jack Morrison
PRIORITY: critical This disables any reasonable RP execution on summit.

[CCS #398323] jsrun stability issues

DESCRIPTION: We see about 1% failure rate for jsrun when using it in quick succession over many tasks. That failure rate seems to increase with shorter tasks, so might be a concurrency issue. The action on this one is on me to follow up with some experiments.
CONTACT: Jack Morrison
PRIORITY: high. This affects the reliability of workload executions. Failures are recoverable.
NOTES: possibly related to [CCS #399015]
TODO: plot failure rate against workload size.

[CCS #398579] clarification on jsrun resource files

DESCRIPTION: this issue with jsrun resource files has been acknowledges as bug: -a is not evaluated in their context. Apparently that issue has been forwarded to IBM.
CONTACT: Thomas Papatheodore
PRIORITY: low. For the time being, we put a workaround in place.
RESOLUTION: RP switched to ERF and thus avoids -a. We left the ticket open.

[no ticket] jsrun core index is off

DESCRIPTION: from the user guide:

jsrun explicit resource file (ERF) allocates incorrect resources
When using an ERF that requests cores on a compute node’s second 
socket (hardware threads 88-171), the set of cores allocated on 
the second socket are shifted upwards by (1) physical core.

For example:
The following ERF requests the first physical core on each socket:

2 : {host: * ; cpu: {0-3},{88-91}}
jsrun currently shifts the second socket’s allocation by (1) physical core, allocating 92-95 instead of the specified 88-91.

$ jsrun --erf_input ERF_filename js_task_info | sort
 
Task 0 ( 0/2, 0/2 ) is bound to cpu[s] 0-3 on host h36n03 with OMP_NUM_THREADS=1 and with OMP_PLACES={0:4}
 
Task 1 ( 1/2, 1/2 ) is bound to cpu[s] 92-95 on host h36n03 with OMP_NUM_THREADS=1 and with OMP_PLACES={92:4}

DESCRIPTION: we hit an issue similar to this: the specifying the first physical core in an ERF spec causes an error (sometimes fatal to the session). We currently work around this by marking the affected core as down, which limits the set of workloads we can run, but otherwise works as expected.
PRIORITY: low. For the time being, we put a workaround in place. Should we open a ticket? The workaround wastes two physical cores per node (~2.5%).

Closed

[CCS #399015] jsrun segfault and failure

DESCRIPTION: jsrun dumped core on a unit, all subsequent jsruns fail, even in new pilot instances, as they fail to contact the PMIx layer.
reproducer provided
CONTACT: George Markomanolis
PRIORITY: very high. This affects the reliability of workload executions. Workloads cannot recover from the failure.
DONE: test reproducibility with multiple users - issue was confirmed, reproducer is viable.
RESOLUTION: this disappeared with the switch to ERF. The issue was closed.

[CCS #398324] jsrun limits (PID limits on batch nodes)

DESCRIPTION: We can only run a certain number of jsruns (~1k) until hitting a process limit (~4k - each jsrun instance needs multiple processes). Jack will look into the limit, and also investigate if jsrun can be used from compute nodes which should not have that limit. We would prefer that second option, as that also makes it easier to load-balance our software framework.
CONTACT: Jack Morrison
PRIORITY: high. This affects the scale at which RP can run workflows.
DONE: testing deployment of jsrun on nodes.
RESOLUTION: jsrun works on the compute nodes, we can scale out! This is solved.
REOPEN: this still works in general, but breaks for consecutive jsrun invocations.
RESOLUTION: this turned out to be unrelated to execution of jsrun on compute nodes, also happens on batch nodes now. - issue closed.

[CCS #398187] non-purged shared filesystem

DESCRIPTION: We miss a world-accessible shared filesystem on the compute nodes to use for software deployment (e.g., ZMQ) - all shared file systems are regularly purged.
CONTACT: Brian Smith
RESOLUTION: Brian installed a software dependency as system module (i.e., ZMQ), which helps for now - but this is likely to pop up again. For the time being, we consider this issue closed.

[CCS #399012] clarification on jsrun resource files (2)

DESCRIPTION: resource files require uniform resources (cpu and gpu) over all resource sets, which limits the set of workloads we can execute.
REPLY: yes, needs to be uniform
CONTACT: George Markomanolis
RESOLUTION: look into ERF as alternative resource specification format. This issue is closed.

[CCS #398578] request for jsrun clarification

DESCRIPTION: another issue with jsrun resource files where 0 gpus lead to the allocation of all GPUs on the target node. This is also accepted as a bug and apparently worked upon by IBM.
CONTACT: Thomas Papatheodore
PRIORITY: medium. It might affect running CPU/GPU-only CUs.
RESOLUTION: this appears to be resolved

JSRUN Stress testing

purpose: determine error rate dependencies
parameters: pilot size, unit size, unit runtime, unit concurrency, spawn rate
test matrix
test script
analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summit jsrun

JSRUN Issues

Open

[CCS #401235] jsrun failes on concurrent execution

[CCS #398323] jsrun stability issues

[CCS #398579] clarification on jsrun resource files

[no ticket] jsrun core index is off

Closed

[CCS #399015] jsrun segfault and failure

[CCS #398324] jsrun limits (PID limits on batch nodes)

[CCS #398187] non-purged shared filesystem

[CCS #399012] clarification on jsrun resource files (2)

[CCS #398578] request for jsrun clarification

JSRUN Stress testing

Clone this wiki locally