ray: [gcp] Node mistakenly marked dead: increase heartbeat timeout?

I’m using ray on a GCP gpu cluster for hyperparameter tuning, training, and prediction. For any of these use cases, about 25% of the time, ray crashes with the following message:

The node with node id: xxx and ip: xxx has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

On nodes that are marked dead, I go to raylet.out, and I see many messages like

Last resource report was sent 612 ms ago. There might be resource pressure on this node. If resource reports keep lagging, scheduling decisions of other nodes may become stale

and

Last heartbeat was sent 515 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly

Most warnings are in the 500 ms range. But I do see a few as high as 20 seconds. When they hit 30 seconds, the node gets marked dead. If it helps, it seems like most of the warnings tend to appear when I’m moving data between nodes and cloud storage. To move data, I’m using multithreaded rsync; i.e. gsutil -m rsync -r .... I’ve tried reducing the number of threads for rsync, but that hasn’t helped. Also, nodes are nowhere near max memory/CPU usage when the error occurs.

Any suggestions? Given that the error only occurs on 25% of runs and the logs seem to indicate that long timeouts are rare, I feel like increasing the heartbeat timeout would solve the issue, but I don’t see any options for that anywhere.

Thanks.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

No, I think we shouldn’t increase the default.

Instead, we should just document this particular issue (and link it in the GCP docs).

Hmm, unfortunately this is a real issue, but it’s good to know this has a real workaround. Let’s leave it open if possible 😃

Glad to hear that it was resolved though @ahah-figure !

I used --system-config={"num_heartbeats_timeout":300} and ran the script I provided many times on many nodes, and I didn’t get a single error.

Thanks everyone!

Would you like me to close this issue?

I think it is --system-config

Also, @ahah-figure do you mind sharing your cluster config?

cc @DmitriGekhtman and @ijrsvt