ray: [Bug] [Serve] Ray hangs on API methods
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Serve
What happened + What you expected to happen
After connecting to Ray and Ray Serve on a remote Ray cluster (running on k8s), running a job, and then waiting for a little while, future serve/ray methods seem to block indefinitely.
Versions / Dependencies
ray[serve]==1.9.0 Python 3.7.12
Reproduction script
Repro script with experiment results commented (Note: must edit remote cluster URL):
import logging
import time
import ray
from ray import serve
from tqdm import tqdm
logger = logging.getLogger("ray")
def init_ray(use_remote: bool = True, verbose: bool = True):
logger.info("Entering init_ray")
if ray.is_initialized():
logger.info("Ray is initialized")
# NOTE: If you put `ray.shutdown()` here and remove the return, the script will also hang on that.
return
if use_remote:
# This should be a remote ray cluster connected to with the Ray Client
address = "ray://<your Ray client URL>:10001"
logger.info("Running ray.init")
ray.init(address=address, namespace="serve", log_to_driver=verbose)
# Start Ray Serve for model serving
# Bind on 0.0.0.0 to expose the HTTP server on external IPs.
logger.info("Running serve.start")
serve.start(detached=True, http_options={"host": "0.0.0.0"})
DEPLOYMENT_NAME = "DeployClass"
ray_autoscaling_config = {
"min_replicas": 1,
"max_replicas": 100,
"target_num_ongoing_requests_per_replica": 5,
}
@serve.deployment(
name=DEPLOYMENT_NAME,
version="v1", # required for autoscaling at the moment
max_concurrent_queries=10,
_autoscaling_config=ray_autoscaling_config,
)
class DeployClass:
def f(self, i: int):
logger.info(f"Handling {i}")
time.sleep(2)
return i
def deploy_deployment():
try:
# NOTE: This is the line it stalls on! The first `serve.` line
logger.info("Trying to get existing deployment")
return serve.get_deployment(DEPLOYMENT_NAME)
except KeyError:
logger.info("DeployClass is not currently deployed, deploying...")
DeployClass.deploy()
return DeployClass
inputs = list(range(10))
for i in range(5):
logger.info("Starting ray init")
init_ray(True, True)
logger.info("Deploying deployment")
deployment = deploy_deployment()
logger.info("Getting handle")
handle = deployment.get_handle()
logger.info("Making method calls")
futures = [handle.f.remote(i) for i in inputs]
logger.info("Getting results")
results = ray.get(futures)
logger.info(f"Results: {results}")
# simulate doing lots of other work...
# Confirmed to not work:
# 1) 10m (waited 5m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 2) 2m (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 3) 1m (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 4) 30s (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# Confirmed to work sometimes:
# 5) 15s (worked 2x, then stalled out on iteration #3)
# 5) 30s (worked 1x, then stalled out on iteration #2)
logger.info(f"Waiting for a while...")
for minute in tqdm(range(1)):
logger.info(f"Waiting a minute (already waited {minute})")
time.sleep(60)
Anything else
Every time for certain wait periods. See Confirmed to work/Confirmed to not work experiments at the bottom of the repro script.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (17 by maintainers)
This issue is actually also reproducible on laptop. The key is to enforce using ray client by
ray start --headThen use
ray://127.0.0.1:10001as address. The symptom on laptop is identical to remote cluster.PR is up #21104
After more digging it looks like the hanging is from deadlock:
callback1(self._current_ref._on_completed(lambda update: self._process_update(update))callback1(self._process_update) sees that a timeout error occurscallback1calls self._poll_next and creates a new object ref with a new listen_for_change.remote(). This is asynchronous, i.e. the object ref isn’t populated yet.callback1tries to registercallback2onto the new object refcallback2, we need to wait for the new object ref to be populated, which requires the dataclient to process the response from the server firstcallback1to return, causing deadlockcc @ckw017 looks to be a ray client-specific issue
@jiaodong
hmm… okay i think i find a way for Serve to get around this. I’ll make a PR by EOD. @ckw017 can you create a separate issue for Ray client to track this?
Hi @spolcyn I’ve confirmed I can reproduce this on my remote cluster as well. I will mark it as P0 issue and release blocker. Thanks for filing this issue with great context !
Logs from start to stuck: https://gist.github.com/jiaodong/e0d29b79f0ee735140e4550e6cea0369