Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use HTTP probes for Ray readiness and liviness probes #2360

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

HTTP probes are considered lighter-weight than exec probes. However, exec probes have the advantage of doing multiple health checks. In KubeRay, we use exec probes to execute "wget" commands against multiple endpoints. Use of exec probes seems to be causing some issues, as shown in #2264 and from KubeRay scalability testing.

This PR explores using HTTP probes instead. This PR needs more consideration as using HTTP probes means we can only health check 1 end point per probe. Marking WIP for now until that quesiton is resolved.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -271,7 +256,7 @@ func initLivenessAndReadinessProbe(rayContainer *corev1.Container, rayNodeType r
SuccessThreshold: utils.DefaultLivenessProbeSuccessThreshold,
FailureThreshold: utils.DefaultLivenessProbeFailureThreshold,
}
rayContainer.LivenessProbe.Exec = &corev1.ExecAction{Command: []string{"bash", "-c", strings.Join(commands, " && ")}}
rayContainer.LivenessProbe.HTTPGet = &corev1.HTTPGetAction{Path: healthCheckPath, Port: healthCheckPort}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using HTTP probes means we can only query 1 endpoint per probe now. For head pod this would /api/gcs_healthz and for worker pod it would be api/local_raylet_healthz. I'm not sure if not health checking api/local_raylet_healthz in the head pod is problematic, it would depend on what whether /api/gcs_healthz incorporates raylet health in some way as well

@YQ-Wang
Copy link

YQ-Wang commented Sep 10, 2024

We also face this issue when the workload is high.

@kevin85421
Copy link
Member

@andrewsykim do we still need this PR after #2353 has been merged?

@andrewsykim
Copy link
Collaborator Author

I think we should still consider use of HTTP probes, they are significantly ligher weight. I haven't root caused the issue I'm seeing, but increasing the timeout did not fully resolve the issue I'm seeing where exec probes cause high load

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants