ray: [Core] `ray.init()` errors inside docker container on M1
What happened + What you expected to happen
Inside a clean rayproject/ray:1.12.1 container on M1, ray.init() fails with
[2022-05-31 00:25:34,784 E 915 915] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
raylet.err:
[2022-05-30 17:46:13,392 E 145 202] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
dashboard_agent.log:
2022-05-30 17:46:13,079 INFO agent.py:83 -- Parent pid is 145
2022-05-30 17:46:13,120 INFO agent.py:109 -- Dashboard agent grpc address: 0.0.0.0:65046
2022-05-30 17:46:13,125 ERROR agent.py:150 -- Raylet is dead, exiting.
dashboard.log:
2022-05-30 17:46:05,029 INFO head.py:122 -- Dashboard head grpc address: 0.0.0.0:34177
2022-05-30 17:46:05,084 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-05-30 17:46:07,199 WARNING tune_head.py:23 -- tune module is not available: No module named 'tensorboard'
2022-05-30 17:46:07,204 INFO utils.py:132 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.log.log_head.LogHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.serve.serve_head.ServeHead'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.tune.tune_head.TuneController'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
2022-05-30 17:46:07,204 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
2022-05-30 17:46:07,204 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
2022-05-30 17:46:07,205 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>
2022-05-30 17:46:07,205 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.log.log_head.LogHead'>
2022-05-30 17:46:07,206 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>
2022-05-30 17:46:07,207 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>
2022-05-30 17:46:07,208 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.serve.serve_head.ServeHead'>
2022-05-30 17:46:07,208 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>
2022-05-30 17:46:07,208 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.tune.tune_head.TuneController'>
2022-05-30 17:46:07,208 INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>
2022-05-30 17:46:07,216 INFO head.py:188 -- Loaded 10 modules.
2022-05-30 17:46:07,222 INFO http_server_head.py:61 -- Setup static dir for dashboard: /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build
2022-05-30 17:46:07,240 INFO http_server_head.py:132 -- Dashboard head http address: 127.0.0.1:8265
2022-05-30 17:46:07,241 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x404db8d4d0>
2022-05-30 17:46:07,241 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /logical/actors> -> <function ActorHead.get_all_actors[cache ttl=2, max_size=128] at 0x404db8d680>
2022-05-30 17:46:07,242 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /logical/kill_actor> -> <function ActorHead.kill_actor at 0x404db8d830>
2022-05-30 17:46:07,242 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /events> -> <function EventHead.get_event[cache ttl=2, max_size=128] at 0x404dbb2b90>
2022-05-30 17:46:07,242 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/version> -> <function JobHead.get_version at 0x404dc0ee60>
2022-05-30 17:46:07,242 INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource /api/packages/{protocol}/{package_name}> -> <function JobHead.get_package at 0x404dc15b90>
2022-05-30 17:46:07,243 INFO http_server_head.py:137 -- <ResourceRoute [PUT] <DynamicResource /api/packages/{protocol}/{package_name}> -> <function JobHead.upload_package at 0x404dc15d40>
2022-05-30 17:46:07,243 INFO http_server_head.py:137 -- <ResourceRoute [POST] <PlainResource /api/jobs/> -> <function JobHead.submit_job at 0x404dc15ef0>
2022-05-30 17:46:07,243 INFO http_server_head.py:137 -- <ResourceRoute [POST] <DynamicResource /api/jobs/{job_id}/stop> -> <function JobHead.stop_job at 0x404dc180e0>
2022-05-30 17:46:07,243 INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource /api/jobs/{job_id}> -> <function JobHead.get_job_info at 0x404dc18290>
2022-05-30 17:46:07,244 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/jobs/> -> <function JobHead.list_jobs at 0x404dc18440>
2022-05-30 17:46:07,244 INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource /api/jobs/{job_id}/logs> -> <function JobHead.get_job_logs at 0x404dc185f0>
2022-05-30 17:46:07,244 INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource /api/jobs/{job_id}/logs/tail> -> <function JobHead.tail_job_logs at 0x404dc187a0>
2022-05-30 17:46:07,244 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /log_index> -> <function LogHead.get_log_index at 0x404dc1e9e0>
2022-05-30 17:46:07,245 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /log_proxy> -> <function LogHead.get_log_from_proxy at 0x404dc1eb00>
2022-05-30 17:46:07,245 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /nodes> -> <function NodeHead.get_all_nodes[cache ttl=2, max_size=128] at 0x404dc29320>
2022-05-30 17:46:07,245 INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource /nodes/{node_id}> -> <function NodeHead.get_node[cache ttl=2, max_size=128] at 0x404dc29560>
2022-05-30 17:46:07,245 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /memory/memory_table> -> <function NodeHead.get_memory_table at 0x404dc29710>
2022-05-30 17:46:07,246 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /memory/set_fetch> -> <function NodeHead.set_fetch_memory_info at 0x404dc29830>
2022-05-30 17:46:07,246 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /node_logs> -> <function NodeHead.get_logs at 0x404dc29950>
2022-05-30 17:46:07,246 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /node_errors> -> <function NodeHead.get_errors at 0x404dc29a70>
2022-05-30 17:46:07,246 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/launch_profiling> -> <function ReportHead.launch_profiling at 0x404e023830>
2022-05-30 17:46:07,247 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/ray_config> -> <function ReportHead.get_ray_config at 0x404e023950>
2022-05-30 17:46:07,247 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x404e023a70>
2022-05-30 17:46:07,247 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/serve/deployments/> -> <function ServeHead.get_all_deployments at 0x404e039290>
2022-05-30 17:46:07,247 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/serve/deployments/status> -> <function ServeHead.get_all_deployment_statuses at 0x404e039440>
2022-05-30 17:46:07,248 INFO http_server_head.py:137 -- <ResourceRoute [DELETE] <PlainResource /api/serve/deployments/> -> <function ServeHead.delete_serve_application at 0x404e0395f0>
2022-05-30 17:46:07,248 INFO http_server_head.py:137 -- <ResourceRoute [PUT] <PlainResource /api/serve/deployments/> -> <function ServeHead.put_all_deployments at 0x404e0397a0>
2022-05-30 17:46:07,248 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/actors/kill> -> <function APIHead.kill_actor_gcs at 0x404ddfe830>
2022-05-30 17:46:07,248 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /api/snapshot> -> <function APIHead.snapshot at 0x404ddfe950>
2022-05-30 17:46:07,249 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /tune/info> -> <function TuneController.tune_info at 0x4051acd560>
2022-05-30 17:46:07,249 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /tune/availability> -> <function TuneController.get_availability at 0x4051acd680>
2022-05-30 17:46:07,249 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /tune/set_experiment> -> <function TuneController.set_tune_experiment at 0x4051acd7a0>
2022-05-30 17:46:07,249 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /tune/enable_tensorboard> -> <function TuneController.enable_tensorboard at 0x4051acd8c0>
2022-05-30 17:46:07,250 INFO http_server_head.py:137 -- <ResourceRoute [GET] <StaticResource /logs -> PosixPath('/tmp/ray/session_2022-05-30_17-46-00_406068_65/logs')> -> <bound method StaticResource._handle of <StaticResource /logs -> PosixPath('/tmp/ray/session_2022-05-30_17-46-00_406068_65/logs')>>
2022-05-30 17:46:07,250 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /> -> <function HttpServerDashboardHead.get_index at 0x4051adab90>
2022-05-30 17:46:07,250 INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource /favicon.ico> -> <function HttpServerDashboardHead.get_favicon at 0x4051ade4d0>
2022-05-30 17:46:07,251 INFO http_server_head.py:137 -- <ResourceRoute [GET] <StaticResource /static -> PosixPath('/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource /static -> PosixPath('/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build/static')>>
2022-05-30 17:46:07,251 INFO http_server_head.py:138 -- Registered 38 routes.
2022-05-30 17:46:07,255 INFO datacenter.py:70 -- Purge data.
2022-05-30 17:46:07,258 INFO event_utils.py:127 -- Monitor events logs modified after 1653956165.531906 on /tmp/ray/session_2022-05-30_17-46-00_406068_65/logs/events, the source types are ['GCS'].
2022-05-30 17:46:07,268 INFO usage_stats_head.py:89 -- Usage reporting is disabled.
2022-05-30 17:46:07,269 INFO actor_head.py:105 -- Getting all actor info from GCS.
2022-05-30 17:46:07,282 INFO actor_head.py:131 -- Received 0 actor info from GCS.
2022-05-30 17:46:14,344 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957974.336987089","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957974.336881964","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:15,367 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957975.364356756","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957975.364337922","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:16,387 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957976.379970548","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957976.379939173","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:17,402 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957977.399386965","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957977.399369548","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:18,419 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957978.414999591","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957978.414942966","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:19,440 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957979.435969383","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957979.435943299","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:20,456 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957980.452296841","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957980.452258591","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:21,473 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957981.468743842","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957981.468706634","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
2022-05-30 17:46:22,488 ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
timeout=2,
File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1653957982.485163634","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957982.485120467","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
Versions / Dependencies
ray: 1.12.1 docker: 20.10.14 macOS: 12.4
Reproduction script
import ray
ray.init()
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (18 by maintainers)
Can we maybe close this since the docker image is not actually supported on non amd64 systems?
I’ve managed to narrow down the issue to these lines. It turns out there’s an issue with psutil (giampaolo/psutil#2112) that makes
curr_proc.parent()=Nonein this set up. The dashboard agent then thinks raylet is dead and exits, then raylet exited soon after because of fate-sharing. I don’t think there’s a lot we can do here but to wait for the upstream bug to resolve.