ray: [Bug] [Serve] Ray hangs on API methods

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Serve

What happened + What you expected to happen

After connecting to Ray and Ray Serve on a remote Ray cluster (running on k8s), running a job, and then waiting for a little while, future serve/ray methods seem to block indefinitely.

Versions / Dependencies

ray[serve]==1.9.0 Python 3.7.12

Reproduction script

Repro script with experiment results commented (Note: must edit remote cluster URL):

import logging
import time

import ray
from ray import serve
from tqdm import tqdm

logger = logging.getLogger("ray")


def init_ray(use_remote: bool = True, verbose: bool = True):
    logger.info("Entering init_ray")
    if ray.is_initialized():
        logger.info("Ray is initialized")
        # NOTE: If you put `ray.shutdown()` here and remove the return, the script will also hang on that.
        return

    if use_remote:
        # This should be a remote ray cluster connected to with the Ray Client
        address = "ray://<your Ray client URL>:10001"
        logger.info("Running ray.init")
        ray.init(address=address, namespace="serve", log_to_driver=verbose)

        # Start Ray Serve for model serving
        # Bind on 0.0.0.0 to expose the HTTP server on external IPs.
        logger.info("Running serve.start")
        serve.start(detached=True, http_options={"host": "0.0.0.0"})


DEPLOYMENT_NAME = "DeployClass"
ray_autoscaling_config = {
    "min_replicas": 1,
    "max_replicas": 100,
    "target_num_ongoing_requests_per_replica": 5,
}


@serve.deployment(
    name=DEPLOYMENT_NAME,
    version="v1",  # required for autoscaling at the moment
    max_concurrent_queries=10,
    _autoscaling_config=ray_autoscaling_config,
)
class DeployClass:
    def f(self, i: int):
        logger.info(f"Handling {i}")
        time.sleep(2)
        return i


def deploy_deployment():
    try:
        # NOTE: This is the line it stalls on! The first `serve.` line
        logger.info("Trying to get existing deployment")
        return serve.get_deployment(DEPLOYMENT_NAME)
    except KeyError:
        logger.info("DeployClass is not currently deployed, deploying...")
        DeployClass.deploy()
        return DeployClass


inputs = list(range(10))

for i in range(5):
    logger.info("Starting ray init")
    init_ray(True, True)
    logger.info("Deploying deployment")
    deployment = deploy_deployment()
    logger.info("Getting handle")
    handle = deployment.get_handle()

    logger.info("Making method calls")
    futures = [handle.f.remote(i) for i in inputs]
    logger.info("Getting results")
    results = ray.get(futures)
    logger.info(f"Results: {results}")

    # simulate doing lots of other work...
    # Confirmed to not work:
    # 1) 10m (waited 5m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 2) 2m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 3) 1m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 4) 30s (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # Confirmed to work sometimes:
    # 5) 15s (worked 2x, then stalled out on iteration #3)
    # 5) 30s (worked 1x, then stalled out on iteration #2)
    logger.info(f"Waiting for a while...")
    for minute in tqdm(range(1)):
        logger.info(f"Waiting a minute (already waited {minute})")
        time.sleep(60)

Anything else

Every time for certain wait periods. See Confirmed to work/Confirmed to not work experiments at the bottom of the repro script.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (17 by maintainers)

Most upvoted comments

This issue is actually also reproducible on laptop. The key is to enforce using ray client by

ray start --head

Then use ray://127.0.0.1:10001 as address. The symptom on laptop is identical to remote cluster.

jiaodong on Dec 13, 2021

PR is up #21104

simon-mo on Dec 15, 2021

After more digging it looks like the hanging is from deadlock:

remote call to listen_for_change is made
when response (timeout error) from listen_for_change is back after ~1 minute, the dataclient invokes callback1 ( self._current_ref._on_completed(lambda update: self._process_update(update))
callback1 (self._process_update) sees that a timeout error occurs
callback1 calls self._poll_next and creates a new object ref with a new listen_for_change.remote(). This is asynchronous, i.e. the object ref isn’t populated yet.
callback1 tries to register callback2 onto the new object ref
to register callback2, we need to wait for the new object ref to be populated, which requires the dataclient to process the response from the server first
we can’t process the response (or any future responses), because we’re still waiting for callback1 to return, causing deadlock

ckw017 on Dec 14, 2021

cc @ckw017 looks to be a ray client-specific issue

edoakes on Dec 13, 2021

@jiaodong

spolcyn on Dec 8, 2021

hmm… okay i think i find a way for Serve to get around this. I’ll make a PR by EOD. @ckw017 can you create a separate issue for Ray client to track this?

simon-mo on Dec 14, 2021

Hi @spolcyn I’ve confirmed I can reproduce this on my remote cluster as well. I will mark it as P0 issue and release blocker. Thanks for filing this issue with great context !

Logs from start to stuck: https://gist.github.com/jiaodong/e0d29b79f0ee735140e4550e6cea0369

jiaodong on Dec 13, 2021