ray: [Ray Core] Ray agent getting killed unexpectedly
What happened + What you expected to happen
We are writing a module for data-parallel training using ray for our machine learning engine. Currently, we are trying to scale our computation to 64 nodes on an in-house cluster, but while communication between the nodes, the ray agent on some nodes fails unexpectedly(It doesn’t happen just after the start, instead, all the nodes run for a while, and then some nodes start failing). I am pretty sure the program is not running out of memory(As I don’t see any OOMKiller log-in dmesg). The program keeps running if a node fails that is not running a worker, however it terminates as soon as a node fails with a worker running on it. We are using Ray Collective Communication Lib(Gloo through pygloo), but we see the same failures even when using Ray Core for communication.
Error Log(Click to expand)
2022-10-17 14:23:59,720 WARNING worker.py:1839 -- Raylet is terminated: ip=172.29.58.176, id=c28cee44b5cd67e730b3a9f729ca772f9bfd3f4b936ae1999d38cf36. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system
OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] NodeManagerService.grpc_server.GetNodeStats - 475 total (0 active), CPU time: mean = 1.077 ms, total = 511.665 ms
[state-dump] NodeManager.deadline_timer.record_metrics - 432 total (1 active), CPU time: mean = 546.906 us, total = 236.263 ms
[state-dump] NodeManager.deadline_timer.debug_state_dump - 108 total (1 active), CPU time: mean = 1.300 ms, total = 140.400 ms
[state-dump] NodeResourceInfoGcsService.grpc_client.GetResources - 87 total (0 active), CPU time: mean = 13.138 us, total = 1.143 ms
[state-dump] NodeManager.deadline_timer.print_event_loop_stats - 18 total (1 active, 1 running), CPU time: mean = 2.067 ms, total = 37.203 ms
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 15 total (1 active), CPU time: mean = 225.370 us, total = 3.381 ms
[state-dump] Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 11 total (0 active), CPU time: mean = 124.656 us, total = 1.371 ms
[state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (0 active), CPU time: mean = 357.759 us, total = 2.862 ms
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 88.669 us, total = 177.338 us
[state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 341.472 us, total = 341.472 us
[state-dump] Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 1 total (0 active), CPU time: mean = 3.499 us, total = 3.499 us
[state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 1 total (0 active), CPU time: mean = 52.743 us, total = 52.743 us
[state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), CPU time: mean = 893.418 us, total = 893.418 us
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 13.287 ms, total = 13.287 ms
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 345.265 us, total = 345.265 us
[state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 2.833 ms, total = 2.833 ms
[state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 6.246 us, total = 6.246 us
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
(raylet, ip=172.29.58.176) [2022-10-17 14:24:00,156 E 692949 692988] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w
as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
2022-10-17 15:47:23,758 WARNING worker.py:1839 -- The node with node id: c28cee44b5cd67e730b3a9f729ca772f9bfd3f4b936ae1999d38cf36 and address: 172.29.58.176 and node name: 172.29.58.176 has been marked dead because the detector has missed
too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
(scheduler +1h42m22s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +1h42m22s) Restarting 1 nodes of type local.cluster.node (lost contact with raylet).
(raylet, ip=172.29.58.176) [2022-10-17 14:24:00,156 E 692949 692988] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w
as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
2022-10-17 15:57:44,909 WARNING worker.py:1839 -- Raylet is terminated: ip=172.29.58.107, id=d495158e712947f2e6fb8b3fc4a1ddad79adac63589745919c5083ab. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system
OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] RuntimeEnvService.grpc_client.GetOrCreateRuntimeEnv - 2 total (0 active), CPU time: mean = 20.050 ms, total = 40.101 ms
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 85.026 us, total = 170.052 us
[state-dump] ObjectManager.ObjectAdded - 2 total (0 active), CPU time: mean = 233.389 us, total = 466.779 us
[state-dump] NodeManagerService.grpc_server.RequestWorkerLease - 1 total (0 active), CPU time: mean = 1.180 ms, total= 1.180 ms
[state-dump] Subscriber.HandlePublishedMessage_GCS_WORKER_DELTA_CHANNEL - 1 total (0 active), CPU time: mean = 5.108 us, total = 5.108 us
[state-dump] ObjectManager.HandlePull - 1 total (0 active), CPU time: mean = 1.457 ms, total = 1.457 ms
[state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 103.737 us, total = 103.737 us
[state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.774 us, total = 4.774 us
[state-dump] ObjectManagerService.grpc_client.Pull - 1 total (0 active), CPU time: mean = 21.685 us, total = 21.685 us
[state-dump] Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 1 total (0 active), CPU time: mean = 486.567 us, total = 486.567 us
[state-dump] NodeManagerService.grpc_server.GetSystemConfig - 1 total (0 active), CPU time: mean = 187.941 us, total = 187.941 us
[state-dump] ClientConnection.async_write.DoAsyncWrites - 1 total (0 active), CPU time: mean = 39.085 us, total = 39.085 us
[state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 287.678 us, total = 287.678 us
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 11.784 ms, total = 11.784 ms
[state-dump] CoreWorkerService.grpc_client.UpdateObjectLocationBatch - 1 total (0 active), CPU time: mean = 8.580 us, total = 8.580 us
[state-dump] ObjectManager.ObjectDeleted - 1 total (0 active), CPU time: mean = 22.651 us, total = 22.651 us
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 300.936 us, total = 300.936 us
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
(raylet, ip=172.29.58.107) [2022-10-17 15:57:45,682 E 286759 286803] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent w
as unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
Traceback (most recent call last):
File “mlm_training.py”, line 35, in <module>
wrapped_model.train()
File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/distributed.py”, line 204, in train
train_state_manager.train_batch(epoch, batch_id)
File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py”, line 92, in train_batch
self._compute_and_store_batch_gradients(batch_id)
File “/usr/local/lib/python3.8/dist-packages/thirdai/_distributed_bolt/backend/train_state_manager.py”, line 105, in _compute_and_store_batch_gradients
ray.get(
File “/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 2291, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Log Files from Head Node: logs_head.zip Log Files from Node 107: logs_107.zip Log Files from Node 176: logs_176.zip
Could it happen that we are hitting object store benchmark from ray-benchmarks?
Ray Discuss Link: https://discuss.ray.io/t/ray-actor-dying-unexpectedly/7797/6
Versions / Dependencies
Ray version: Using Daily release(As ray collective communication(for pyglooo) is working only after https://github.com/ray-project/ray/issues/29036) OS: ubuntu Python: 3.8.10
Cluster Info: Number of training nodes: 64 vCPUs per node: 4 RAM per node: 32GB
Reproduction script
This code doesn’t exactly reproduces the error, but it do fails in almost the similar manner as the issue mentioned above and very similar to how the main script runs.
Code to Reproduce Error
import os
import ray
import numpy as np
import ray.util.collective as col
from ray.util.collective.types import Backend, ReduceOp
@ray.remote(num_cpus=4, max_restarts=-1)
class communicating_actor:
def __init__(self, rank, world_size, group_name, init_data):
self.init_data = init_data
col.init_collective_group(world_size, rank, Backend.GLOO, group_name)
def test_allreduce(self, group_name):
'''
rank # Rank of this process within list of participating processes
world_size # Number of participating processes
fileStore_path # The path to create filestore
'''
self.sendbuf = np.ones((4096,1024,256), dtype=np.float32)
col.allreduce(self.sendbuf, group_name, ReduceOp.SUM)
if __name__ == "__main__":
ray.init(address='auto')
world_size = 64
init_data = np.ones((4096,1024,256), dtype=np.float32)
ref = ray.put(init_data)
communicating_actors = [communicating_actor.remote(rank, world_size, "default", ref) for rank in range(world_size)]
for i in range(1000):
ray.get([actor.test_allreduce.remote("default") for actor in communicating_actors])
Cluster Configuration File
auth:
ssh_user: root
cluster_name: default
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"num_heartbeats_timeout":5000,"worker_register_timeout_seconds":500}'
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 87
min_workers: 87
provider:
head_ip: 172.29.58.24
type: local
worker_ips:
- 172.29.58.102
- 172.29.58.103
- 172.29.58.104
- 172.29.58.105
- 172.29.58.106
- 172.29.58.107
- 172.29.58.108
- 172.29.58.109
- 172.29.58.110
- 172.29.58.111
- 172.29.58.112
- 172.29.58.113
- 172.29.58.114
- 172.29.58.115
- 172.29.58.116
- 172.29.58.117
- 172.29.58.118
- 172.29.58.119
- 172.29.58.120
- 172.29.58.121
- 172.29.58.122
- 172.29.58.123
- 172.29.58.124
- 172.29.58.125
- 172.29.58.126
- 172.29.58.127
- 172.29.58.128
- 172.29.58.129
- 172.29.58.130
- 172.29.58.131
- 172.29.58.132
- 172.29.58.133
- 172.29.58.134
- 172.29.58.135
- 172.29.58.136
- 172.29.58.137
- 172.29.58.138
- 172.29.58.139
- 172.29.58.140
- 172.29.58.141
- 172.29.58.142
- 172.29.58.143
- 172.29.58.144
- 172.29.58.145
- 172.29.58.146
- 172.29.58.147
- 172.29.58.148
- 172.29.58.149
- 172.29.58.150
- 172.29.58.151
- 172.29.58.152
- 172.29.58.153
- 172.29.58.154
- 172.29.58.155
- 172.29.58.156
- 172.29.58.157
- 172.29.58.158
- 172.29.58.159
- 172.29.58.160
- 172.29.58.161
- 172.29.58.162
- 172.29.58.163
- 172.29.58.164
- 172.29.58.165
- 172.29.58.166
- 172.29.58.167
- 172.29.58.168
- 172.29.58.169
- 172.29.58.170
- 172.29.58.171
- 172.29.58.172
- 172.29.58.173
- 172.29.58.174
- 172.29.58.175
- 172.29.58.176
- 172.29.58.177
- 172.29.58.178
- 172.29.58.179
- 172.29.58.180
- 172.29.58.181
- 172.29.58.182
- 172.29.58.183
- 172.29.58.184
- 172.29.58.185
- 172.29.58.186
- 172.29.58.187
- 172.29.58.188
rsync_exclude:
- '**/.git'
- '**/.git/**'
rsync_filter:
- .gitignore
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 38 (20 by maintainers)
Got it. I am not sure what causes this, but psutil definitely seems broken in your env…
I think you can get around the issue by
RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=100000000for now. I will discuss the fix plan and get back to you…Do you guys see the issue in the latest version? (ray 2.6)
Sounds good to me! I’d be also great if you test with a shorter time like 60 seconds (RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60)! Please note that what you did means you disabled the health check, and agent may not be terminated properly although raylet is killed. In the medium term, we will see if it’s possible to find a better health checking mechanism…
Hmm… it’s interesting. So we are using psutil.Parent (https://psutil.readthedocs.io/en/latest/) to see if the parent process is alive and if this returns None, we consider the parent is dead (which means the parent pid is not known anymore). The issue now is although the parent (raylet) is alive, psutil.parent returns None (so we misunderstand the raylet is dead and agent kills itself which kills the raylet because they fate share).
I assume there may be a subtle bug in psutil that returns the incorrect result. This is their
parentimplementation. https://github.com/giampaolo/psutil/blob/08964b08700fcfc5a4917a01107009a3ee913341/psutil/__init__.py#L464While I will talk about the potential solution with my collegue, do you mind trying one more time with higher RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S? It looks like this env var is not set based on your current error message. (maybe very long value like 30~60). You should set this on every node (e.g., RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60 ray start --address=<HEAD> & RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=60 ray start --head)
Also for the temporary workaround, if you don’t mind having process leak (agent.py), you can also set really really long env var RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=10000000. Then this will disable the parent health checking. This may solve your issue temporarily, but then when the raylet is actually dead, the agent cannot detect the parent is dead.
@rkooo567, It failed again.
Error Output
logs_head.zip logs_162.zip(node that failed) logs_133.zip(node that failed, but no actor was running on it)
There are 2 potential solutions here;
@pratkpranav Can you try this one more time with the latest master? I think this will be the last time before I decide how to fix the issues. Thanks again for your help!!
I will lyk after this PR is merged https://github.com/ray-project/ray/pull/29802
Yes. Let me tell you a bit more detail about what is happening here. So in every ray node, there’s a raylet. And raylet spawns a child process called “dashboard agent”. Although the name has “dashboard”, it is actually doing more work than the dashboard things, so you can just think it as an “agent”.
When the agent is dead, the raylet is dead. They fate share.
Also in Ray, every raylet sends heartbeat to the central control plane (GCS). If raylet’s heartbeat is not delivered to the GCS within 30 seconds, this warning is printed.
Your workloads consistently fail with this order.
agent thinks the raylet is dead when it is not -> agent kills itself because it thinks raylet is dead -> raylet kills itself because the agent is dead. In this case, you are expected to see the above warning because raylet is dead ungracefully, and the heartbeat won’t be sent to the GCS.
And this is the code how agent detects the parent’s status (it keeps checking the parent’s pid): https://github.com/ray-project/ray/blob/d1662d68b96803c629ff982268563a14688646d3/dashboard/agent.py#L174. Apparently this detecting code doesn’t seem to work in your environment for some unknown reasons, and I am trying to figure out why,
It looks like the agent is dead because of different exceptions from your latest run. Unfortunately, exception is not logged from our code…
I think we may need a couple more iterations of merge code with more debugging info -> you run and tell me how this goes since it seems to be hard to be reproduced in my env… Would this be okay? I will create another PR to add more information in this case.
Okay, I merged the PR. The wheel will be available in 2~3 hours. You can get the download link from https://docs.ray.io/en/master/ray-overview/installation.html#daily-releases-nightlies. Please make sure the commit is after 98da294f0b8fbf0ab9d94ebd552c4d78b4d3aaf8.
You can try 2 things.
RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2. You can start your ray like this to apply the env var.RAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2 ray start --headandRAY_DASHBOARD_AGENT_CHECK_PARENT_INTERVAL_S=2 ray start --address=<head_ip>I will ping you after the PR is merged!