Skip to content

summit jsrun

Andre Merzky edited this page Feb 14, 2019 · 28 revisions

[CCS #398324] jsrun limits (PID limits on batch nodes)

  • contact: Jack Morrison
  • We can only run a certain number of jsruns (~1k) until hitting a process limit (~4k - each jsrun instance needs multiple processes). Jack will look into the limit, and also investigate if jsrun can be used from compute nodes which should not have that limit. We would prefer that second option, as that also makes it easier to load-balance our software framework.

[CCS #398187] non-purged shared filesystem

  • contact: Brian Smith
  • We miss a world-accessible shared filesystem on the compute nodes to use for software deployment - all shared file systems are regularly purged. Brian installed a software dependency as system module, which helps for now - but this is likely to pop up again
  • For the time being, we consider this issue closed.

[CCS #398579] clarification on jsrun resource files

  • contact: Thomas Papatheodore
  • this issue with jsrun resource files has been acknowledges as bug: -a is not evaluated in their context. Apparently that issue has been forwarded to IBM.

[CCS #398578] request for jsrun clarification

  • contact: Thomas Papatheodore
  • another issue with jsrun resource files where 0 gpus lead to the allocation of all GPUs on the target node. This is also accepted as a bug and apparently worked upon by IBM.

[CCS #398323] jsrun stability issues

  • contact: Jack Morrison
  • We see about 1% failure rate for jsrun when using it in quick succession over many tasks. That failure rate seems to increase with shorter tasks, so might be a concurrency issue. The action on this one is on me to follow up with some experiments.
  • possibly related to [CCS #399015]

[CCS #399012] clarification on jsrun resource files (2)

  • contact: George Markomanolis
  • resource files require uniform resources (cpu and gpu) over all resource sets, which limits the set of workloads we can execute.
  • reply: yes, needs to be uniform : IBM documentation
  • we consider this issue closed.

[CCS #399015] jsrun segfault and failure

  • contact: George Markomanolis
  • jsrun dumped core on a unit, all subsequent jsruns fail, even in new pilot instances, as they fail to contact the PMIx layer.
  • reproducer provided
Clone this wiki locally