Skip to content

ticket_CCS_401235

Andre Merzky edited this page Mar 29, 2019 · 5 revisions

Dear Summit support,

unfortunately, we seem to hit a new jsrun problem. I am fairly confident that this only popped up very recently, like a week ago, since we did see successful runs in that mode in the past.

A demonstrator for the problem is provided in /gpfs/alpine/bip178/scratch/merzky1/issue_399015/runme3/. You will see a test.bsub script there, which, when submitted, runs two shell scripts: test_fg.sh runs twice, followed by test_bg.sh. Both scripts run the same set of jsrun commands with a set of ERF resource files. Those ERF files result from a trace of our test workloads, and represent a heterogeneous bag of tasks. In this reproducer we just run sleep 1 to speed things up, but the failure mode seems independent of the application workload.

The first script (test_fg.sh) executes the jsruns sequentially, and that succeeds as expected. We perform a second run, to show that the script itself leaves the allocation in a viable state.

The second script (test_bg.sh) runs some of the jsruns concurrently, then waits for them to finish, then runs a couple more concurrently, and so on. This mode fails.

We see two different failure modes: (a) one of the jsruns hangs, the wait never returns. This is shown in the outputs test.301459.err and test.301460.err: in both cases the job is killed as it times out.

The second error mode results in jsrun raising the error:

Error: Remote JSM server is not responding on host batch403-18-2019 08:24:20:151 30425 main: Error initializing RM connection. Exiting.

After that error, no subsequent jsrun will succeed, they all raise the same error message. An example for this mode is left in test.301458.err.

Note that the error happens on any random subset of the jsrun's, which lets us believe that they are not related to the specificities of the respective ERF files. But I did not attempt to investigate that more closely, please let me know if you want me to try to simplify the ERF files to tighten the setup.

We would be happy to get advise on how to proceed. Please let us know if you have additional questions on the workload or on our jsrun usage.

Best regards, Andre.

Clone this wiki locally