kuberay: [Bug][High Availability] 502 Errors while Head Node in Recovery

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

What you expected to happen

I expect while a head node is terminated that there will be no drop in requests availability when requesting the Kubernetes service for a RayService

What Happened

Following the HA Guide for Ray Serve + Kuberay, I tested dropping the head node pod while issuing requests to the Kubernetes Service -> {cluster_name}-serve-svc

Intermittently, I received 502 errors. About 1/5th of requests for a few seconds while the head pod recovers.

However, if I follow the guide and port-forward to a Worker pod, I do not receive any 502 errors.

Hypothesis

Since the Kubernetes service ({cluster_name}-serve-svc) is pointing to worker pods (no head pod), this leads me to believe the 502 errors happen during some transient state induced by a reaction in Kuberay or Ray serve.

Reproduction script

Run a simple request loop while tearing down the head node pod.

import time

import requests

url = "http://127.0.0.1:8000/dummy"

while True:
    resp = requests.post(url=url, json={"test": "test"})
    print(resp.status_code)
    time.sleep(0.1)

Anything else

Using Ray v2.4.0 & Kuberay Nightly @ bc6be0ee3b513648ea929961fed3288164c9fc46

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Hey @shrekris-anyscale,

Unfortunately, I’m already using num-cpus: 0 on the head node. I’ll try to post a minimal example.

In the ray dashboard for example, I see all the workloads on the worker node, and none on the head node:

ray::HTTPProxyActor,
ray::ServeReplica:Dummy
ray::ServeReplica:DAGDriver
ray::ServeController

The deployment numCpus settings are this if it matters: Dummy (hello world) -> numCpus: 0.4 and DAGDriver -> numCpus: 0.1

You’re right that num-cpus: 0 really should be recommended as best practice for HA. We should update the sample yaml file and add a comment.

https://github.com/ray-project/ray/blob/93c05d1d4a19d423acfc8671251a95221e6e0980/doc/source/serve/doc_code/fault_tolerance/k8s_config.yaml#L87

Do you mind retrying that experiment by first killing the head node pod and then starting the while loop? Do you still see ~10 seconds of 502’s in that case?