Performance Tests

Architecture and implementation needs to be performance test driven so that the final product satisfies performance requirements. Performance is measured in terms of scalability along three axes:number of application instances, number of tasks, and number of agents.

We define three test categories:

Large number of cores owned by one agent, represented by one task queue (single HPC scenarios)
Large number of cores owned by small number of agents (multiple HPC scenarios)

through one aggregate task queue (saga-pilot only scenario)
through one task queue per agent (saga-pilot + TROY)

Small number of cores owned by large number of agents (OSG / Cloud scenarios)

through one aggregate task queue (saga-pilot only scenario)
through one task queue per agent (saga-pilot + TROY)

Those tests are to be fed by either one, or by many, application instances.

Tests shall be defined as soon as the user-facing REST API has been defined and periodically run during all stages of the implementation period to ensure performance QoS and get an early handle on overall scalability and performance numbers / limitations.

Performance metrics are:

time to bootstrap the saga-pilot service layer
time to bootstrap the saga-pilot agent layer
early binding: time to schedule 100k CUs
late binding: time to schedule 100k CUs
time to stage input files for 100k CUs
time to execute (NEW->DONE) all 100k CUs
time to stage output files for 100k CUs

For these metrics:

measure averages and variation
understand minimum / maximum / variation
determine overhead (time saga-pilot spends doing things other than CU execution etc)

Scenario 1

Scale up the number of cores owned by a single agent:

The largest HPC cluster we have access to is STAMPEDE:
- normal queue: 256 nodes ( 4K cores)
- large queue: 1024 nodes (10k cores) (on request)

Scenario 2

Similar to Scenario 1, but distribute the number of total cores over 4 distinct HPC resources (XSEDE + Futuregrid).

Scenario 3

Similar to Scenario 2, but distribute the number of total cores over many OSG/Cloud resources -- this basically inverses the pilot-size to #CU ratio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tests

Scenario 1

Scenario 2

Scenario 3

Clone this wiki locally