ray: [Core] `ray.init()` errors inside docker container on M1

What happened + What you expected to happen

Inside a clean rayproject/ray:1.12.1 container on M1, ray.init() fails with

[2022-05-31 00:25:34,784 E 915 915] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

raylet.err:

[2022-05-30 17:46:13,392 E 145 202] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

dashboard_agent.log:

2022-05-30 17:46:13,079	INFO agent.py:83 -- Parent pid is 145
2022-05-30 17:46:13,120	INFO agent.py:109 -- Dashboard agent grpc address: 0.0.0.0:65046
2022-05-30 17:46:13,125	ERROR agent.py:150 -- Raylet is dead, exiting.

dashboard.log:

2022-05-30 17:46:05,029	INFO head.py:122 -- Dashboard head grpc address: 0.0.0.0:34177
2022-05-30 17:46:05,084	INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-05-30 17:46:07,199	WARNING tune_head.py:23 -- tune module is not available: No module named 'tensorboard'
2022-05-30 17:46:07,204	INFO utils.py:132 -- Available modules: [<class 'ray.dashboard.modules.actor.actor_head.ActorHead'>, <class 'ray.dashboard.modules.event.event_head.EventHead'>, <class 'ray.dashboard.modules.job.job_head.JobHead'>, <class 'ray.dashboard.modules.log.log_head.LogHead'>, <class 'ray.dashboard.modules.node.node_head.NodeHead'>, <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>, <class 'ray.dashboard.modules.serve.serve_head.ServeHead'>, <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>, <class 'ray.dashboard.modules.tune.tune_head.TuneController'>, <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>]
2022-05-30 17:46:07,204	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
2022-05-30 17:46:07,204	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
2022-05-30 17:46:07,205	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>
2022-05-30 17:46:07,205	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.log.log_head.LogHead'>
2022-05-30 17:46:07,206	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>
2022-05-30 17:46:07,207	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>
2022-05-30 17:46:07,208	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.serve.serve_head.ServeHead'>
2022-05-30 17:46:07,208	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>
2022-05-30 17:46:07,208	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.tune.tune_head.TuneController'>
2022-05-30 17:46:07,208	INFO head.py:184 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.usage_stats.usage_stats_head.UsageStatsHead'>
2022-05-30 17:46:07,216	INFO head.py:188 -- Loaded 10 modules.
2022-05-30 17:46:07,222	INFO http_server_head.py:61 -- Setup static dir for dashboard: /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build
2022-05-30 17:46:07,240	INFO http_server_head.py:132 -- Dashboard head http address: 127.0.0.1:8265
2022-05-30 17:46:07,241	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x404db8d4d0>
2022-05-30 17:46:07,241	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /logical/actors> -> <function ActorHead.get_all_actors[cache ttl=2, max_size=128] at 0x404db8d680>
2022-05-30 17:46:07,242	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /logical/kill_actor> -> <function ActorHead.kill_actor at 0x404db8d830>
2022-05-30 17:46:07,242	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /events> -> <function EventHead.get_event[cache ttl=2, max_size=128] at 0x404dbb2b90>
2022-05-30 17:46:07,242	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/version> -> <function JobHead.get_version at 0x404dc0ee60>
2022-05-30 17:46:07,242	INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.get_package at 0x404dc15b90>
2022-05-30 17:46:07,243	INFO http_server_head.py:137 -- <ResourceRoute [PUT] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.upload_package at 0x404dc15d40>
2022-05-30 17:46:07,243	INFO http_server_head.py:137 -- <ResourceRoute [POST] <PlainResource  /api/jobs/> -> <function JobHead.submit_job at 0x404dc15ef0>
2022-05-30 17:46:07,243	INFO http_server_head.py:137 -- <ResourceRoute [POST] <DynamicResource  /api/jobs/{job_id}/stop> -> <function JobHead.stop_job at 0x404dc180e0>
2022-05-30 17:46:07,243	INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}> -> <function JobHead.get_job_info at 0x404dc18290>
2022-05-30 17:46:07,244	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/jobs/> -> <function JobHead.list_jobs at 0x404dc18440>
2022-05-30 17:46:07,244	INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs> -> <function JobHead.get_job_logs at 0x404dc185f0>
2022-05-30 17:46:07,244	INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs/tail> -> <function JobHead.tail_job_logs at 0x404dc187a0>
2022-05-30 17:46:07,244	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /log_index> -> <function LogHead.get_log_index at 0x404dc1e9e0>
2022-05-30 17:46:07,245	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /log_proxy> -> <function LogHead.get_log_from_proxy at 0x404dc1eb00>
2022-05-30 17:46:07,245	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /nodes> -> <function NodeHead.get_all_nodes[cache ttl=2, max_size=128] at 0x404dc29320>
2022-05-30 17:46:07,245	INFO http_server_head.py:137 -- <ResourceRoute [GET] <DynamicResource  /nodes/{node_id}> -> <function NodeHead.get_node[cache ttl=2, max_size=128] at 0x404dc29560>
2022-05-30 17:46:07,245	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /memory/memory_table> -> <function NodeHead.get_memory_table at 0x404dc29710>
2022-05-30 17:46:07,246	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /memory/set_fetch> -> <function NodeHead.set_fetch_memory_info at 0x404dc29830>
2022-05-30 17:46:07,246	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /node_logs> -> <function NodeHead.get_logs at 0x404dc29950>
2022-05-30 17:46:07,246	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /node_errors> -> <function NodeHead.get_errors at 0x404dc29a70>
2022-05-30 17:46:07,246	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/launch_profiling> -> <function ReportHead.launch_profiling at 0x404e023830>
2022-05-30 17:46:07,247	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/ray_config> -> <function ReportHead.get_ray_config at 0x404e023950>
2022-05-30 17:46:07,247	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x404e023a70>
2022-05-30 17:46:07,247	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/serve/deployments/> -> <function ServeHead.get_all_deployments at 0x404e039290>
2022-05-30 17:46:07,247	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/serve/deployments/status> -> <function ServeHead.get_all_deployment_statuses at 0x404e039440>
2022-05-30 17:46:07,248	INFO http_server_head.py:137 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/deployments/> -> <function ServeHead.delete_serve_application at 0x404e0395f0>
2022-05-30 17:46:07,248	INFO http_server_head.py:137 -- <ResourceRoute [PUT] <PlainResource  /api/serve/deployments/> -> <function ServeHead.put_all_deployments at 0x404e0397a0>
2022-05-30 17:46:07,248	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/actors/kill> -> <function APIHead.kill_actor_gcs at 0x404ddfe830>
2022-05-30 17:46:07,248	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /api/snapshot> -> <function APIHead.snapshot at 0x404ddfe950>
2022-05-30 17:46:07,249	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /tune/info> -> <function TuneController.tune_info at 0x4051acd560>
2022-05-30 17:46:07,249	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /tune/availability> -> <function TuneController.get_availability at 0x4051acd680>
2022-05-30 17:46:07,249	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /tune/set_experiment> -> <function TuneController.set_tune_experiment at 0x4051acd7a0>
2022-05-30 17:46:07,249	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /tune/enable_tensorboard> -> <function TuneController.enable_tensorboard at 0x4051acd8c0>
2022-05-30 17:46:07,250	INFO http_server_head.py:137 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-05-30_17-46-00_406068_65/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-05-30_17-46-00_406068_65/logs')>>
2022-05-30 17:46:07,250	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /> -> <function HttpServerDashboardHead.get_index at 0x4051adab90>
2022-05-30 17:46:07,250	INFO http_server_head.py:137 -- <ResourceRoute [GET] <PlainResource  /favicon.ico> -> <function HttpServerDashboardHead.get_favicon at 0x4051ade4d0>
2022-05-30 17:46:07,251	INFO http_server_head.py:137 -- <ResourceRoute [GET] <StaticResource  /static -> PosixPath('/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource  /static -> PosixPath('/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/client/build/static')>>
2022-05-30 17:46:07,251	INFO http_server_head.py:138 -- Registered 38 routes.
2022-05-30 17:46:07,255	INFO datacenter.py:70 -- Purge data.
2022-05-30 17:46:07,258	INFO event_utils.py:127 -- Monitor events logs modified after 1653956165.531906 on /tmp/ray/session_2022-05-30_17-46-00_406068_65/logs/events, the source types are ['GCS'].
2022-05-30 17:46:07,268	INFO usage_stats_head.py:89 -- Usage reporting is disabled.
2022-05-30 17:46:07,269	INFO actor_head.py:105 -- Getting all actor info from GCS.
2022-05-30 17:46:07,282	INFO actor_head.py:131 -- Received 0 actor info from GCS.
2022-05-30 17:46:14,344	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957974.336987089","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957974.336881964","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:15,367	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957975.364356756","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957975.364337922","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:16,387	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957976.379970548","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957976.379939173","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:17,402	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957977.399386965","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957977.399369548","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:18,419	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957978.414999591","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957978.414942966","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:19,440	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957979.435969383","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957979.435943299","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:20,456	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957980.452296841","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957980.452258591","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:21,473	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957981.468743842","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957981.468706634","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-05-30 17:46:22,488	ERROR node_head.py:259 -- Error updating node stats of 871aa1d2a051174425cfcde498e5d2ddd74afc210dd0a09b389cdd1b.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 254, in _update_node_stats
    timeout=2,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1653957982.485163634","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1653957982.485120467","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

Versions / Dependencies

ray: 1.12.1 docker: 20.10.14 macOS: 12.4

Reproduction script

import ray
ray.init()

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (18 by maintainers)

Most upvoted comments

Can we maybe close this since the docker image is not actually supported on non amd64 systems?

I’ve managed to narrow down the issue to these lines. It turns out there’s an issue with psutil (giampaolo/psutil#2112) that makes curr_proc.parent()=None in this set up. The dashboard agent then thinks raylet is dead and exits, then raylet exited soon after because of fate-sharing. I don’t think there’s a lot we can do here but to wait for the upstream bug to resolve.