Skip to content

User Requirements

ashleyz edited this page Aug 8, 2013 · 19 revisions

Please answer the following questions as detailed as possible:

  • how saga-pilot will be used in your research?
  • which features of saga-pilot will be most critical for your research?
  • which type of jobs are you planning to run with saga-pilot?
  • what are the performance expectations you have?
  • how critical is support for data handling capabilities?
  • (NEW) if you had to design and implement saga-pilot by yourself, guided only by your specific requirements, how would you do it?

Ashley

  • how saga-pilot will be used in your research?

For now, mostly with regards to figuring out scheduling algorithms -- may be application-level for SCIHM/rock physics/etc later on.

  • which features of saga-pilot will be most critical for your research?

Anything relating to the scheduling -- the ability to refine scheduling decisions iteratively with response to external information service + internal state changes would be nice.

  • which type of jobs are you planning to run with saga-pilot?

Mostly interested in theoretical scheduling at this point in time; may end up doing SCIHM/rock physics/etc applications, but for the most part fairly short-running "heterogeneous" jobs of varying length w/ data dependencies to start.

  • what are the performance expectations you have?

Whatever is needed in order to schedule a "reasonable" application workload in terms of CUs -- still coming up with target applications for this. If submission/etc are slow, the planned ability to account for time spent will be useful.

  • how critical is support for data handling capabilities?

Extremely, at least with regards to scheduling data

  • (NEW) if you had to design and implement saga-pilot by yourself, guided only by your specific requirements, how would you do it?
  1. Start from BigJob (because I am just one person! :D), clean code, add programatically-accessible variables to control internal wait times/# of threads/etc
  2. Add bulk operations for submitting/executing CUs on pilots, removing the need for (several) roundtrips per CU execution
  3. Add more timing code to BigJob

If I were to make it entirely from scratch, while keeping the above considerations in mind I would develop the coordination layer first, possibly with distributed coordination and with support for bulk operations. This seems to be the major constraining factor on CU submission/throughput, so it sounds like the best place to start. I would next develop the pilot manager + agent components in tandem, with limited/small amounts of functionality (support to grab jobs, execute, add to local simple queue), but with all states/functions accessible via TROY (e.g. grab full queue from TROY, change CU execution order/remove CUs/add CUs -- is this fully covered under P* semantics?) After that, it would depend on what TROY would need with regards to scheduling/etc.

Large amounts of debugging output would be appreciated for the entire process, especially where control is given to SAGA-Pilot for extended periods of time e.g. pilot.wait().

What I would not want to do would be to use our SAGA-Pilot specific adaptors, introducing confusion along the lines of BJ's EC2 adaptor causing problems because it doesn't use SAGA-Python's, same for SLURM, etc...

Mark

how saga-pilot will be used in your research?

Primary vehicle to put application workload on infrastructure.

which features of saga-pilot will be most critical for your research?

Semantically equivalent to P*. Support for all current CI, although that will come through saga-python hopefully. Support for stderr/stdout on CU level.

which type of jobs are you planning to run with saga-pilot?

All kinds of, given that I'm not an application owner, application workloads come and go.

what are the performance expectations you have?

In general the pilot-abstraction should not be the bottleneck for the dimension of the infrastructure we work with and the typical application workload we support.

how critical is support for data handling capabilities?

Essential.

(NEW) if you had to design and implement saga-pilot by yourself, guided only by your specific requirements, how would you do it?

  • Would definitely start from scratch.

  • On very high level I would say that saga-pilot is a very narrow scoped piece of functionality. Being able to get an agent onto a resource (obviously using SAGA) and being able to phone home.

  • We would have a central, but not necessarily centralized "master".

  • Agents would be DCI specific, of course making use of a common set of agent functionality. If the communucation libraries would allow it, this would in principle allow us to write agents in different languages.

  • The communucation between agent(s) and master(s) would happen through an well defined communication protocol, but implementation should be abstracted away. Main reason for this is that different DCI's will have different cummunication limitations, and there will not be one-size-that-fits-all.

  • There would be no notion of push/pull per se, we would have hooks where workers and agents would come together, and possibly pluggable "schedulers" to do the placement of CUs. So a queue would not be a fundamental pattern, per se.

  • I would take provenance as a principle design principle, as it would be critical for real production usage.

  • The core package would be a simply python library that exposes the Pilot-API. Additional "richer" functionality would be build on top, as either services, or software like TROY.

  • Agents can represent either a slice or a resource or a whole resource. Every "physical" resource would at least have one agent associated with it, to have a real hook into every system.

Melissa

Matteo

how saga-pilot will be used in your research?

Saga-pilot will be used by a workload manager within AIMES and TROY, and as the pilot layer for F*. The workload manager will offer capabilities to define (automatically) a pilot framework. The framework will be tailored to run the tasks of a given workload. The requirements for the pilot framework will be derived mainly by inspecting the characteristics of the tasks. Tasks will be grouped in 'stages' and, in case, will be related temporally and spatially. More information about the workload manager can be found in the AIMES and TROY wikis:

which features of saga-pilot will be most critical for your research?

Clean separation and free composition of the following functionalities:

  • Framework:
    • Describe pilot;
    • describe *unit;
    • bind *unit;
    • instantiate pilot;
    • submit *unit;
    • execute *unit.
  • *Unit control:
    • suspend *unit execution;
    • restart *unit execution.
  • *Unit inspection:
    • retrieve *unit status;
    • retrieve partial *unit output;
    • retrieve *unit output.

Possibly relevant even if, as discussed, we might want to implement a queue system as a 'service' separated from saga-pilot:

  • Describe a queue;
  • bind a queue to a pilot;
  • 'typical' queue operations (add, delete, list, suspend, etc).

Multiple types of interface:

  • REST;
  • python API;
  • command line.

which type of jobs are you planning to run with saga-pilot?

Both synthetic and real-life workloads of the following type:

  • Bag of tasks;
  • replicas;
  • chained/coupled ensemble;
  • workflows.

what are the performance expectations you have?

Too soon to say from a AIMES/TROY/F* point of view? We know already that scalability will be a big deal - 100K tasks?

how critical is support for data handling capabilities?

Not critical at the moment but basic data transfer capabilities.

Antons

Clone this wiki locally