Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Should --num-cpu be based on CPU requests instead of limits? #2361

Open
1 of 2 tasks
andrewsykim opened this issue Sep 6, 2024 · 14 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request triage

Comments

@andrewsykim
Copy link
Collaborator

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Opening this issue to gather feedback on whether --num-cpus set on Ray workers should be based on requests instead of limits. From my personal experience, there are few reasons to ever set CPU limits and setting CPU requests is often good enough. Today it's possible to exclude limits but it requires also setting startParams["num-cpu"] to match the CPU requests.

Would it be beneifical for KubeRay to set num-cpu based on requests instead of limits? Would this be considered a breaking change or would it be no-op in most cases since most people configure requests == limits?

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@andrewsykim andrewsykim added enhancement New feature or request triage labels Sep 6, 2024
@anovv
Copy link
Contributor

anovv commented Sep 7, 2024

Hey @andrewsykim, funny enough I was just debugging major perf degradation running a Ray workload on Kubernetes and it seems like having a CPU limit is the problem (due to throttling), started googling and came across this issue.

I tried starting a cluster without CPU limits but pods do not show up. My feedback is that it would make much more sense to use CPU requests as num-cpus.

Also is there a documented behaviour with regards of using KubeRay without CPU limits (you mentioned setting startParams["num-cpu"])?

@anovv
Copy link
Contributor

anovv commented Sep 7, 2024

I'm deploying via ray-cluster helm chart and tried setting num-cpus in values file like this

worker:
  rayStartParams:
    num-cpus: 6

and removed resources.limits.cpu. The worker pod does not show up. What am I doing wrong?

I tried both KubeRay 1.1.1 and 1.2.1

@andrewsykim
Copy link
Collaborator Author

I'm not sure it matters, but the value needs to be string:

num-cpus: "6"

@andrewsykim
Copy link
Collaborator Author

I tried starting a cluster without CPU limits but pods do not show up. My feedback is that it would make much more sense to use CPU requests as num-cpus.

I generally agree with this feedback, but I'm worried it's a breaking change. Maybe it's not if we assume requests == limits?

@andrewsykim
Copy link
Collaborator Author

Also is there a documented behaviour with regards of using KubeRay without CPU limits (you mentioned setting startParams["num-cpu"])?

See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources

@anovv
Copy link
Contributor

anovv commented Sep 8, 2024

I tried starting a cluster without CPU limits but pods do not show up. My feedback is that it would make much more sense to use CPU requests as num-cpus.

I generally agree with this feedback, but I'm worried it's a breaking change. Maybe it's not if we assume requests == limits?

Yes, it can be a breaking change, given I'm now trying to simply get rid of limits and it just does not create pods.
What can be done: if users sets limits, we use limits as num-cpus (default behaviour), if no limits - fallback to requests. In both cases we can add a warning/update docs stating that setting CPU limits on a pod most likely is not a good idea.

@anovv
Copy link
Contributor

anovv commented Sep 8, 2024

I'm not sure it matters, but the value needs to be string:

num-cpus: "6"

I tried setting it as a string, does not help. I checked kuberay-operator logs, it throws an error

Invalid value: \"6\": must be less than or equal to cpu limit of 1

Full trace:

{"level":"error","ts":"2024-09-08T04:43:59.528Z","logger":"controllers.RayCluster","msg":"Error reconcile resources","RayCluster":{"name":"ray-cluster-kuberay","namespace":"ray-system"},"reconcileID":"e280cb05-f017-49bc-a4ce-a4eaf5dbde19","function name":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).reconcilePods-fm","error":"FailedCreateWorkerPod\nPod \"ray-cluster-kuberay-test-group-worker-g2p9j\" is invalid: spec.containers[0].resources.requests: Invalid value: \"6\": must be less than or equal to cpu limit of 1","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:320\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller.go:171\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

I believe the operator sets a default value of 1 cpu if limits are not present, so essentially it enforces setting limits. @andrewsykim can you confirm?

@andrewsykim
Copy link
Collaborator Author

I believe the operator sets a default value of 1 cpu if limits are not present

I don't think the operator does this. Would you happen to have a LimitRange or other admission policy in your cluster that might be defaulting CPU limits in your cluste?

@dcela
Copy link

dcela commented Sep 9, 2024

From my personal experience, there are few reasons to ever set CPU limits and setting CPU requests is often good enough.

Yeah I always thought it was weird that I have to set CPU limits in kuberay (and head pod), which can end up throttling the deployment.

I general I have followed past Google GKE recommendations I have read to not set a CPU limit, and to always match memory requests and limits.

https://www.numeratorengineering.com/requests-are-all-you-need-cpu-limits-and-throttling-in-kubernetes/

@andrewsykim
Copy link
Collaborator Author

I think there's a way to add this without introducing breaking changes, by only defaulting to requests if the limit is not specified. I started a PR here: #2365

@anovv
Copy link
Contributor

anovv commented Sep 11, 2024

I believe the operator sets a default value of 1 cpu if limits are not present

I don't think the operator does this. Would you happen to have a LimitRange or other admission policy in your cluster that might be defaulting CPU limits in your cluste?

I did not set any limitranges. Ran kubectl get limitrange in my namespace, returned nothing. Any other ideas?

@andrewsykim
Copy link
Collaborator Author

Any mutating webhooks in your cluster that may be defaulting limits?

@anovv
Copy link
Contributor

anovv commented Sep 11, 2024

Any mutating webhooks in your cluster that may be defaulting limits?

None that I'm aware of. I'm running on minikube, can it be the reason somehow?

@andrewsykim
Copy link
Collaborator Author

@anovv mind opening a new issue on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

3 participants