RFC.4

Abstract

RP currently starts exactly one pilot agent instance per pilot job. There is no conceptual reason for this (a pilot could in principle host any number of agents concurrently or sequentially), this is just an implementation choice which was made for generality and simplicity. This implementation choice though is now limiting for some use cases which (a) need to run different types of workload on the same resource with defined concurrency (co-scheduling pilot jobs would also solve this, but is not widely available), or (b) have workload requirements which vary over the lifetime of a pilot. Additionally (c), this design also impacts scalability due constraints on the pilot agent size.

This RFC aims to propose a generalization of this design, by introducing dynamic pilot partitions which are dynamic subdivisions of a pilot job, where each partition hosts its own pilot agent.

Pilot Partitions

A pilot partition is defined by (a) a subset of resources acquired by a pilot job, and (b) exactly one pilot agent which manages those resources. Multiple partitions can co-exist on a pilot job's resources, but no part of the resources is shared between partitions (at least not explicitly). Multiple partitions can be created on the same pilot job, and partitions can further be created and destroyed on demand - their lifetime is thus bound, but otherwise independent from the pilot job lifetime.

Configuration

RP's resource configuration file currently serve two distinct purposes: (a) configuration of resource access and system stack, and (b) configuration of the agent. RP does, however, have a notion of an agent configuration, which is where the structure of the agent is defined (number of sub-agents, localization of agent components, etc).

Introducing partitions as intermediate layer between pilot job and pilot agent will benefit from a clearer separation of resource and agent configuration. The partition itself should not need static configuration files on its own, but only the dynamic, application specified partition size (and potentially layout).

API

When partitions are entities with an independent lifetime, which can get created and destroyed dynamically and on demand, and on whose availability other components, such as the unit manager, will make decisions, it seems prudent to equip them with a state model.

(An agent on the other hand would not benefit from its own state model, as it's always bound to exactly one partition, and its lifetime is also defined and constraint by that partition: the agent is created when the partition is, and destroyed when the partition is. The agent state transitions are thus never independent from the partition states, and it essentially has the same state model.)

The partition states are as follows:

NEW: initial state: the application requested the partition to be created
PENDING: the partition request is published, RP is waiting for the pilot to become active and to pick up that request
STARTING: the pilot is creating the partition and starts the agent on it
ACTIVE: the partition is alive, the agent can execute units
DONE | FAILED | CANCELED: final states

Agent Requirements

RP users struggle to remember to add agent resource requirements to pilot job requests. Calculating partition sizes will only aggravate that problem. The addition of partitions will thus presents and opportunity, but also a need, to reconsider that scheme. The obvious alternative is to automatically add the agent requirements to partition requests. A corollary of this approach is that the initial pilot job size is determined by RP, based on initial partition specification and their respective pilot configurations.

Consider though the following case: a pilot with two 30-node partitions are submitted to a Cray and becomes ACTIVE. On a Cray, each agent requires at least one individual node, so the pilot job will run with 62 nodes. After some time, the application requests to terminate both partitions, and instead requests 3 partitions with 20 nodes each -- the pilot would now need to have 63 nodes in order to serve this request.

There does not seem to be any solution to this problem which can completely avoid abstraction leakage of the agent requirements. One way to alleviate that problem is to additionally allow for more abstract specifications, like 50%, all GPUs, remaining (meaning all resources which are not yet used by any other partition), etc, where the RP internal calculation of the actual partition size can again hide the agent requirements. Note that this problem will likely only pop up in a minority of cases, i.e. those, where (i) the number of partitions changes over the pilot lifetime, and (ii) where the agent indeed has additional resource requirements. Upper layer libraries like EnTK will also be able reduce the scope of this problem.

API Proposal

Below is an proposal on how partitions are created and managed on the API level. It covers some exemplary calls to

define partitions via a rp.ComputePartitionDescription()
request partitions during pilot submission (initial configuration)
manage partition destruction and creation during pilot lifetime
inspect pilot and units about partition relationships

Note that the changed relationship chain may also have implications for radical.analytics.

pd1 = rp.ComputePartitionDescription()
pd1.cores   = 16
pd1.gpus    =  4
pd1.config  = ['orte']  # this selects an agent config

pd2 = rp.ComputePartitionDescription()
pd2.cores   = 8
pd2.gpus    = 2
pd2.config  = ['aprun']

pd = rp.ComputePilotDescription()
pd.resource = "local.localhost"
pd.parts    = [pd1, pd2]   # initial configuration, determines pilot job size
pd.runtime  = 60

pm    = rp.PilotManager()
pilot = pm.submit_pilots(pd)
parts = pilot.partitions   # inspection of defined partitions

for p in parts:
   print '%10s [%3d : %3d]: %s' % (p.uid, p.cores, p.gpus, p.state)

parts[0].wait(rp.ACTIVE)   # partition can serve units

# different ways to reconfigure the pilot into other partition setups
pilot.reconfig(stop=parts[0].uid)
pilot.reconfig(start=[pd2, pd2]) 

pilot.reconfig(stop=parts[0].uid, 
               start=[pd2, pd2]) 

pilot.reconfig(stop=parts[0].uid, 
               start=[rp.ComputePartitionDescription("50%")]) 
pilot.reconfig(start=[rp.ComputePartitionDescription(rp.FILL)]) 

pilot.reconfig(stop=rp.ALL, start=[pd2, pd2])            # WARNING (underutilized)
pilot.reconfig(stop=rp.ALL, start=[pd2, pd2, pd2])       # OK
pilot.reconfig(stop=rp.ALL, start=[pd2, pd2, pd2, pd2])  # ERROR   (overutilized)
                                                         # new partitions go in `FAILED` state

assert(print len(pilot.parts) == 18)  # most of these partitions are in `CANCELED` state 

...

units.wait()
for unit in units:
    print '%s: %s' % (unit.uid, unit.part)  # print partition IDs

Implementation

An additional layer is needed between bootstrap_1 (first level in Python) and bootstrap_2 (the process which owns agent_0). That layer will receive and enact the respective partition management commands.

We do have a communication channel in place which can communicate commands from the PMGR to the pilot agent. That channel can also be used to communication partition management commands to the partition management layer of the pilot job.

The agent's RM layer will need an additional method which can subdivide a pilot job's resource set into partitions, according to some partition description as part of the above requests. Since the RM proper is owned by the agent (which does not exist at that point), that functionality will have to be exposed as static method. It can borrow large code parts from the RM's initialize() method - but that initialize() method in turn will have to be adapted to pick up partition configurations instead of pilot job configurations.

The PMGR needs to be extended to derive pilot job requirements from the initial set of partitions. It further needs to channel partition management and inspection requests to the respective communication channels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly