Add Fake TPU e2e Autoscaling Test Cases #2279

ryanaoleary · 2024-07-31T02:52:44Z

Why are these changes needed?

This PR adds a fake TPU test case, similar to the existing fake GPU test case for autoscaling, that uses detached actors to verify that single-host and multi-host TPU autoscaling behave as expected. The behaviors tested included:

(1) Creating a detached actor that requests resources: {"TPU": 4} will scale up a Ray TPU worker
(2) For a multi-host worker group, the number of workers created should equal replicas * numOfHosts
(3) Terminating detached actors scheduled on a multi-host worker group replica will cause the entire replica to be scaled down

Edit: Removed test behavior for idle nodes being scaled down, since this requires setting the timeout value to a much higher value and scaling down of multi-host replicas is still tested.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim · 2024-08-07T21:40:36Z

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go

+		ExecPodCmd(test, headPod, common.RayHeadContainer, []string{"python", "/home/ray/test_scripts/create_detached_actor.py", "tpu_actor_4", "--custom-resource-name=\"TPU\"", "--num-custom-resources=4"})
+
+		// Each new TPU detached actor should get scheduled to an existing scaled-up worker, so we check that there are still 4 pods in 'tpu-group'.
+		test.Expect(GetGroupPods(test, rayCluster, "tpu-group")).To(HaveLen(4))


This assertion happens pretty quickly after the above commands and is not wrapped in a test.Eventually, wondering if it could falsely pass here before a scale up happens?

Added a test.Eventually to verify the number of replicas stays the same before checking the Pod count in 0c6bb58.

andrewsykim · 2024-08-07T21:41:41Z

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go

+		test.Expect(GetGroupPods(test, rayCluster, "tpu-group")).To(HaveLen(4))
+
+		// Terminating one TPU detached actor will result in the Ray node becoming idle, causing Ray to scale down the entire multi-host
+		// worker group. A new multi-host worker group will then be scaled back up since the remaining detached actors are running.


This behavior seems a bit unexpected to me. What's the reason we expect a scale down and a scale up again in this scenario?

I might just be mis-understanding this comment. Should there be an assertion for this part?

A new multi-host worker group will then be scaled back up since the remaining detached actors are running.

Detached actors keep running when the Ray node they're scheduled on is scaled down, so the autoscaler sees the request for TPUs and scales back up a multi-host worker group to meet the unmet demand. In a regular scenario (i..e non-detached actors), the actors would be terminated along with their respective nodes when the replica scales down.

I can add an assertion that checks that the pod list length becomes 0 before becoming 4 again

Ah I see, I missed the behavior specific to detached actors.

I can add an assertion that checks that the pod list length becomes 0 before becoming 4 again

sgtm!

I ended up removing this section in 0c6bb58, because getting the node to become idle requires setting the timeout to 5+ minutes which I'd imagine would slow down the presubmit too much. The behavior to scale down a multi-host replica is still tested by deleting the detached actors.

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2024-08-12T09:24:24Z

cc: @kevin85421

kevin85421 · 2024-08-12T15:03:22Z

I plan to include this PR in v1.3.0 instead.

ryanaoleary and others added 7 commits July 31, 2024 02:33

Fake TPU test initial commit

aa2b12f

Signed-off-by: Ryan O'Leary <[email protected]>

Add single host test

9bc92a6

Signed-off-by: Ryan O'Leary <[email protected]>

remove comment

81fce4c

Signed-off-by: Ryan O'Leary <[email protected]>

Lint changes

01d5978

Signed-off-by: Ryan O'Leary <[email protected]>

Fix build errors

dd96c66

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'ray-project:master' into tpu-fake-test

eece329

Fix unparam lint error

0bab892

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim reviewed Aug 7, 2024

View reviewed changes

ryanaoleary added 2 commits August 9, 2024 08:29

Change to 2x2x2 topology and remove idle node behavior

0c6bb58

Signed-off-by: Ryan O'Leary <[email protected]>

Add back in new line

2649e1f

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421 self-assigned this Aug 12, 2024

kevin85421 added the 1.3.0 label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fake TPU e2e Autoscaling Test Cases #2279

Add Fake TPU e2e Autoscaling Test Cases #2279

ryanaoleary commented Jul 31, 2024 •

edited

Loading

andrewsykim Aug 7, 2024

ryanaoleary Aug 9, 2024

andrewsykim Aug 7, 2024

andrewsykim Aug 7, 2024

ryanaoleary Aug 7, 2024

ryanaoleary Aug 7, 2024

andrewsykim Aug 7, 2024

andrewsykim Aug 7, 2024

ryanaoleary Aug 9, 2024

ryanaoleary commented Aug 12, 2024

kevin85421 commented Aug 12, 2024

Add Fake TPU e2e Autoscaling Test Cases #2279

Are you sure you want to change the base?

Add Fake TPU e2e Autoscaling Test Cases #2279

Conversation

ryanaoleary commented Jul 31, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanaoleary commented Aug 12, 2024

kevin85421 commented Aug 12, 2024

ryanaoleary commented Jul 31, 2024 •

edited

Loading