ray: [gcp] Node mistakenly marked dead: increase heartbeat timeout?

I’m using ray on a GCP gpu cluster for hyperparameter tuning, training, and prediction. For any of these use cases, about 25% of the time, ray crashes with the following message:

The node with node id: xxx and ip: xxx has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

On nodes that are marked dead, I go to raylet.out, and I see many messages like

Last resource report was sent 612 ms ago. There might be resource pressure on this node. If resource reports keep lagging, scheduling decisions of other nodes may become stale

and

Last heartbeat was sent 515 ms ago. There might be resource pressure on this node. If heartbeat keeps lagging, this node can be marked as dead mistakenly

Most warnings are in the 500 ms range. But I do see a few as high as 20 seconds. When they hit 30 seconds, the node gets marked dead. If it helps, it seems like most of the warnings tend to appear when I’m moving data between nodes and cloud storage. To move data, I’m using multithreaded rsync; i.e. gsutil -m rsync -r .... I’ve tried reducing the number of threads for rsync, but that hasn’t helped. Also, nodes are nowhere near max memory/CPU usage when the error occurs.

Any suggestions? Given that the error only occurs on 25% of runs and the logs seem to indicate that long timeouts are rare, I feel like increasing the heartbeat timeout would solve the issue, but I don’t see any options for that anywhere.

Thanks.