ray: [Bug] [Serve] Ray hangs on API methods

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Serve

What happened + What you expected to happen

After connecting to Ray and Ray Serve on a remote Ray cluster (running on k8s), running a job, and then waiting for a little while, future serve/ray methods seem to block indefinitely.

Versions / Dependencies

ray[serve]==1.9.0 Python 3.7.12

Reproduction script

Repro script with experiment results commented (Note: must edit remote cluster URL):

import logging
import time

import ray
from ray import serve
from tqdm import tqdm

logger = logging.getLogger("ray")


def init_ray(use_remote: bool = True, verbose: bool = True):
    logger.info("Entering init_ray")
    if ray.is_initialized():
        logger.info("Ray is initialized")
        # NOTE: If you put `ray.shutdown()` here and remove the return, the script will also hang on that.
        return

    if use_remote:
        # This should be a remote ray cluster connected to with the Ray Client
        address = "ray://<your Ray client URL>:10001"
        logger.info("Running ray.init")
        ray.init(address=address, namespace="serve", log_to_driver=verbose)

        # Start Ray Serve for model serving
        # Bind on 0.0.0.0 to expose the HTTP server on external IPs.
        logger.info("Running serve.start")
        serve.start(detached=True, http_options={"host": "0.0.0.0"})


DEPLOYMENT_NAME = "DeployClass"
ray_autoscaling_config = {
    "min_replicas": 1,
    "max_replicas": 100,
    "target_num_ongoing_requests_per_replica": 5,
}


@serve.deployment(
    name=DEPLOYMENT_NAME,
    version="v1",  # required for autoscaling at the moment
    max_concurrent_queries=10,
    _autoscaling_config=ray_autoscaling_config,
)
class DeployClass:
    def f(self, i: int):
        logger.info(f"Handling {i}")
        time.sleep(2)
        return i


def deploy_deployment():
    try:
        # NOTE: This is the line it stalls on! The first `serve.` line
        logger.info("Trying to get existing deployment")
        return serve.get_deployment(DEPLOYMENT_NAME)
    except KeyError:
        logger.info("DeployClass is not currently deployed, deploying...")
        DeployClass.deploy()
        return DeployClass


inputs = list(range(10))

for i in range(5):
    logger.info("Starting ray init")
    init_ray(True, True)
    logger.info("Deploying deployment")
    deployment = deploy_deployment()
    logger.info("Getting handle")
    handle = deployment.get_handle()

    logger.info("Making method calls")
    futures = [handle.f.remote(i) for i in inputs]
    logger.info("Getting results")
    results = ray.get(futures)
    logger.info(f"Results: {results}")

    # simulate doing lots of other work...
    # Confirmed to not work:
    # 1) 10m (waited 5m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 2) 2m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 3) 1m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 4) 30s (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # Confirmed to work sometimes:
    # 5) 15s (worked 2x, then stalled out on iteration #3)
    # 5) 30s (worked 1x, then stalled out on iteration #2)
    logger.info(f"Waiting for a while...")
    for minute in tqdm(range(1)):
        logger.info(f"Waiting a minute (already waited {minute})")
        time.sleep(60)

Anything else

Every time for certain wait periods. See Confirmed to work/Confirmed to not work experiments at the bottom of the repro script.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

This issue is actually also reproducible on laptop. The key is to enforce using ray client by

ray start --head

Then use ray://127.0.0.1:10001 as address. The symptom on laptop is identical to remote cluster.

PR is up #21104

After more digging it looks like the hanging is from deadlock:

  1. remote call to listen_for_change is made
  2. when response (timeout error) from listen_for_change is back after ~1 minute, the dataclient invokes callback1 ( self._current_ref._on_completed(lambda update: self._process_update(update))
  3. callback1 (self._process_update) sees that a timeout error occurs
  4. callback1 calls self._poll_next and creates a new object ref with a new listen_for_change.remote(). This is asynchronous, i.e. the object ref isn’t populated yet.
  5. callback1 tries to register callback2 onto the new object ref
  6. to register callback2, we need to wait for the new object ref to be populated, which requires the dataclient to process the response from the server first
  7. we can’t process the response (or any future responses), because we’re still waiting for callback1 to return, causing deadlock

cc @ckw017 looks to be a ray client-specific issue

hmm… okay i think i find a way for Serve to get around this. I’ll make a PR by EOD. @ckw017 can you create a separate issue for Ray client to track this?

Hi @spolcyn I’ve confirmed I can reproduce this on my remote cluster as well. I will mark it as P0 issue and release blocker. Thanks for filing this issue with great context !


Logs from start to stuck: https://gist.github.com/jiaodong/e0d29b79f0ee735140e4550e6cea0369