ray: Dashboard failures with include_dashboard set to false
Running Tune with A3C fails straight at the beginning with the following traceback:
2020-11-11 14:13:37,114 WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
loop.run_until_complete(agent.run())
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
agent_ip_address=self.ip))
File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False. It might be related to https://github.com/ray-project/ray/issues/11943 but it shouldn’t happen if this flag is set to False, so it’s a different issue. **
Ray version and other system information (Python version, TensorFlow version, OS): Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only on both latest master and releases/1.0.1
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
ray.init(include_dashboard=False)
tune.run(
A3CTrainer,
config=<any config>,
stop={
"timesteps_total": 50e6,
},
)
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 63 (61 by maintainers)
When I run the following script:
on a machine with http_proxy and https_proxy set, it spits out
This looks like the same error, right? (The sleeps are necessary so that the script doesn’t exit before the error appears; the length of sleep necessary presumably varies by machine.)
Obviously, on any machine with http_proxy and https_proxy set, no_proxy is also going to be set, presumably with localhost and 127.0.0.1…but no_proxy usually won’t include the machine’s external IP address. Ray is using that external IP address from
get_node_ip_address().For my machine, at least, adding the external IP address to no_proxy makes everything go through without that error message.
I think @fyrestone hit the nail on the head.
Unfortunately, the problem being diagnosed is not the same thing as the problem being solved. Setting no_proxy that way works for a simple standalone script like that one, but for the more complicated operations such asray startand tune, the new processes don’t get started with the new value of no_proxy, even if youexport no_proxy.The new processes must pull the values of the variables from some deeper level when they get started up, and I’m not sure where. Not .bashrc, I assume, since these new processes aren’t starting in shells as such.~~Looking at https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1438, the dashboard process doesn’t have shell=True, so I’m really not sure where it’s pulling the proxy information from. And yet setting no_proxy on the command line works when running a simple ray.init() script…? https://stackoverflow.com/questions/12060863/python-subprocess-call-a-bash-alias~~
It’s sort of baffling, because actors are also separate processes, but apparently those actors started from that script do somehow inherit the value of no_proxy. (export no_proxy="$(hostname -i),$no_proxy"makes that script go through just fine; it doesn’t matter whether no_proxy is set on the same line, no_proxy just needs to be set.) Yet other workers created byray startdo not, so thatstill results in workers spitting out that error.All the various processes started by Ray inherit no_proxy. I dunno how, but they do. You do need to set no_proxy on all machines involved, though, with the numerical IP addresses of all machines involved (comma-separated), including its own. Remember that the IP address by which one machine can find another machine is not necessarily the same IP address that
hostname -ibrings up on the target machine.You might not know in advance the IP address of every machine that will be joining. But you could probably brute-force that by just adding every IP address ending in a number to no_proxy:
(Presumably, the problem only happens because we’re using raw numerical IP addresses; presumably, no_proxy is already set to cover all relevant domains.) We could add that to the documentation as a recommendation to just always do.
I’m not sure what to do about this with respect to the port-checking documentation. netcat and nmap (natually, I think?) completely ignore http_proxy and https_proxy for non-HTTP traffic. (This isn’t HTTP traffic, is it? This is a metric-export thing and that’s why it happens even with the dashboard disabled? I’m not sure why gRPC is using the proxy settings. I’m guessing there are some kind of gRPC-over-HTTP shenanigans going on, for some reason?)
(Okay, I guess they just always ignore proxy settings.
I still don’t get why gRPC is using the proxy. I assume this is dashboard-specific somehow, since nothing else goes wrong if you run
http_proxy=http://some.imaginary.proxy:80 https_proxy=http://some.imaginary.proxy:80 python test_actors.py, just the dashboard thing. You even still get the correct answer, despite the error messages the dashboard is spitting out.)(To be clear, if you literally use an imaginary proxy like
http://some.random.proxy:80, you’ll get a different error message. But the computation will still go through, so it’s only the dashboard gRPC thing that’s looking at http_proxy.)In any case, we could have an error message that gives the IP and port that failed to reach, possibly with a suggestion to add them to no_proxy.
This looks like a bad bug. @mfitton can you take a look at it?
I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks! I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn’t work. Setting
http_proxyandhttps_proxyto correct values didn’t do the trick either. In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False. In order get rid of these nasty warning messages, I made a modification to the file at:envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.pyto stop printing out warnings. Precisely, I commented out the part inelseat line 285 as below:It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?
We should fix this ASAP
@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(…) only, the arguments don’t pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well…?
It’s not a blocker for me at the moment, although I think it’s a low hanging fruit to make this fix complete. Otherwise, this issue can be closed.
According to the logs, I found two problems:
OSError: [Errno 98] Address already in use.ip = ray._private.services.get_node_ip_address(),port = node_manager_port. Then the agent is exit withgrpc.experimental.aio._call.AioRpcError.@mfitton - I’ll try to provide some more information in the meantime:
This is how I setup ray cluster (handshakes between nodes):
Head (inside docker):
ray start --block --head --port=$redis_port --redis-password=$redis_password --node-ip-address=$head_node_ip \ --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 \ --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=falseWorker Nodes (inside docker, different machine(s)):
ray start --block --address=$head_node_ip:$redis_port --redis-password=$redis_password --node-ip-address=$worker_node_ip --node-manager-port=6007 --object-manager-port=6008 --min-worker-port=6100 --max-worker-port=6299After I do that, I call:
The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above).
Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the *** part is for security reasons):
@roireshef Can you help confirm whether the environment variables http_proxy or https_proxy exists?
The GCS logs:
The Dashboard head logs:
It seems that the GRPC Python client uses the correct address, but can’t connect to the GRPC server.
Thanks. I will create a fix PR about the (2) by passing the
--node-ip-addressvalue to the agent.@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by: The agent connects to the raylet with
ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip--node-ip-address=$worker_node_ip.If anyone’s interested, a temporary fix to disable the dashboard is commenting out those 2 lines:
https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1447 https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1448
Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints.
I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard
I’ve noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether
include_dashboardis specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.@fyrestone I’m planning on creating a PR to make the dashboard agent not start when
include_dashboardis false. Am I missing any issues that doing this could cause?This actually happens even if I run “ray start” with “–include-dashboard=false”