ray: [raylet] Raylet crashes unexpectedly or has lagging heartbeats

What happened + What you expected to happen

Users of our Ray platform are experiencing intermittent failures with their Ray prediction jobs. They are requesting a placement group of 400 actors with 2 CPUs per actor, and the placement group seemingly never becomes ready. The snippet of code is as such:

        import ray
        from ray.util import placement_group
        ...
        resource_per_actor = {"CPU": cpus_per_actor}
        resource_bundles = [resource_per_actor for _ in range(actor_count)]
        pg = placement_group(resource_bundles, strategy="SPREAD")
        logger.info("Waiting for placement group to be ready")

        ray.get(pg.ready())

We see the logged messages above but not other logs downstream after ray.get. Raylet is apprently timing out with this log entry:

[2022-06-10 07:27:51,402 I 125 125] (raylet) main.cc:301: Raylet received SIGTERM, shutting down..."

And the application logs show this entry:

WARNING worker.py:1382 -- The node with node id: xxxxxxxxxxxxxxxxxx and ip: xx.x.xx.xxx has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats."

It appears that the job sometimes work, and sometimes fails with above errors. We are stuck not being able to help our users further debug their job given the information at hand. Is there any known cause for this problems? Where could we further dig to help us uncover the underlying issue?

Would reducing the number of actors be a band-aid solution? From our investigation it does not seem to be related to a lack of memory.

Versions / Dependencies

Ray 1.12.0

Reproduction script

        import ray
        from ray.util import placement_group
        ...
        resource_per_actor = {"CPU": cpus_per_actor}
        resource_bundles = [resource_per_actor for _ in range(actor_count)]
        pg = placement_group(resource_bundles, strategy="SPREAD")
        logger.info("Waiting for placement group to be ready")

        ray.get(pg.ready())

Issue Severity

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (10 by maintainers)

Most upvoted comments

We improved the heartbeat mechanism in the master. If this issue is happening in the master, please ping me! I will add repro-required tag until then

rkooo567 on Dec 8, 2022

Sounds good! Let me know if the user wants to talk to me directly. I’d love to do pair debugging or share more debugging tips in person. Also, what you can try as a short-term quick fix is to increase the heartbeat timeout. I’ve seen it. worked for some users before.

Currently, the heartbeat is sent 1 per second for 30 seconds, and you can control it by setting the env var;

# 30 by default
RAY_num_heartbeats_timeout=30

# You start head node with longer timeout
RAY_num_heartbeats_timeout=120 ray start --head
# same for workers
RAY_num_heartbeats_timeout=120 ray start --address=<head_ip>

rkooo567 on Oct 28, 2022