kuberay: [Bug][High Availability] 502 Errors while Head Node in Recovery
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
What you expected to happen
I expect while a head node is terminated that there will be no drop in requests availability when requesting the Kubernetes service for a RayService
What Happened
Following the HA Guide for Ray Serve + Kuberay, I tested dropping the head node pod while issuing requests to the Kubernetes Service -> {cluster_name}-serve-svc
Intermittently, I received 502
errors. About 1/5th of requests for a few seconds while the head pod recovers.
However, if I follow the guide and port-forward
to a Worker pod, I do not receive any 502 errors.
Hypothesis
Since the Kubernetes service ({cluster_name}-serve-svc
) is pointing to worker pods (no head pod), this leads me to believe the 502 errors happen during some transient state induced by a reaction in Kuberay or Ray serve.
Reproduction script
Run a simple request loop while tearing down the head node pod.
import time
import requests
url = "http://127.0.0.1:8000/dummy"
while True:
resp = requests.post(url=url, json={"test": "test"})
print(resp.status_code)
time.sleep(0.1)
Anything else
Using Ray v2.4.0 & Kuberay Nightly @ bc6be0ee3b513648ea929961fed3288164c9fc46
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 15 (7 by maintainers)
Hey @shrekris-anyscale,
Unfortunately, I’m already using
num-cpus: 0
on the head node. I’ll try to post a minimal example.In the ray dashboard for example, I see all the workloads on the
worker
node, and none on the head node:The deployment
numCpus
settings are this if it matters:Dummy
(hello world) ->numCpus: 0.4
andDAGDriver
->numCpus: 0.1
You’re right that
num-cpus: 0
really should be recommended as best practice for HA. We should update the sampleyaml
file and add a comment.https://github.com/ray-project/ray/blob/93c05d1d4a19d423acfc8671251a95221e6e0980/doc/source/serve/doc_code/fault_tolerance/k8s_config.yaml#L87
Do you mind retrying that experiment by first killing the head node pod and then starting the while loop? Do you still see ~10 seconds of 502’s in that case?