ray: Dashboard failures with include_dashboard set to false

Running Tune with A3C fails straight at the beginning with the following traceback:

2020-11-11 14:13:37,114	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False. It might be related to https://github.com/ray-project/ray/issues/11943 but it shouldn’t happen if this flag is set to False, so it’s a different issue. **

Ray version and other system information (Python version, TensorFlow version, OS): Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only on both latest master and releases/1.0.1

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

    ray.init(include_dashboard=False)
    tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 63 (61 by maintainers)

Most upvoted comments

environment variables http_proxy or https_proxy

When I run the following script:

import time
import ray
import ray.services

@ray.remote
def f():
    time.sleep(8)
    return ray.services.get_node_ip_address()

if __name__ == "__main__":
  ray.init(num_cpus=1)
  IPaddresses = set(ray.get([f.remote() for _ in range(4)]))
  print('IPaddresses =', IPaddresses)
  ray.shutdown()

on a machine with http_proxy and https_proxy set, it spits out

Traceback (most recent call last):
  File "ray/new_dashboard/agent.py", line 305, in <module>
    loop.run_until_complete(agent.run())
  File "python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "python3.8/site-packages/ray/new_dashboard/agent.py", line 169, in run
    await raylet_stub.RegisterAgent(
  File "python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4165,"referenced_errors":[{"description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"

This looks like the same error, right? (The sleeps are necessary so that the script doesn’t exit before the error appears; the length of sleep necessary presumably varies by machine.)

Obviously, on any machine with http_proxy and https_proxy set, no_proxy is also going to be set, presumably with localhost and 127.0.0.1…but no_proxy usually won’t include the machine’s external IP address. Ray is using that external IP address from get_node_ip_address().

For my machine, at least, adding the external IP address to no_proxy makes everything go through without that error message.

$ no_proxy="$(hostname -i),$no_proxy" python test_actors.py

I think @fyrestone hit the nail on the head.

Unfortunately, the problem being diagnosed is not the same thing as the problem being solved. Setting no_proxy that way works for a simple standalone script like that one, but for the more complicated operations such as ray start and tune, the new processes don’t get started with the new value of no_proxy, even if you export no_proxy. The new processes must pull the values of the variables from some deeper level when they get started up, and I’m not sure where. Not .bashrc, I assume, since these new processes aren’t starting in shells as such.

~~Looking at https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1438, the dashboard process doesn’t have shell=True, so I’m really not sure where it’s pulling the proxy information from. And yet setting no_proxy on the command line works when running a simple ray.init() script…? https://stackoverflow.com/questions/12060863/python-subprocess-call-a-bash-alias~~

It’s sort of baffling, because actors are also separate processes, but apparently those actors started from that script do somehow inherit the value of no_proxy. (export no_proxy="$(hostname -i),$no_proxy" makes that script go through just fine; it doesn’t matter whether no_proxy is set on the same line, no_proxy just needs to be set.) Yet other workers created by ray start do not, so that

$ export no_proxy="$(hostname -i),$no_proxy"
$ ray start

still results in workers spitting out that error.

All the various processes started by Ray inherit no_proxy. I dunno how, but they do. You do need to set no_proxy on all machines involved, though, with the numerical IP addresses of all machines involved (comma-separated), including its own. Remember that the IP address by which one machine can find another machine is not necessarily the same IP address that hostname -i brings up on the target machine.

You might not know in advance the IP address of every machine that will be joining. But you could probably brute-force that by just adding every IP address ending in a number to no_proxy:

no_proxy="0,1,2,3,4,5,6,7,8,9,$no_proxy" python test_actors.py

(Presumably, the problem only happens because we’re using raw numerical IP addresses; presumably, no_proxy is already set to cover all relevant domains.) We could add that to the documentation as a recommendation to just always do.

I’m not sure what to do about this with respect to the port-checking documentation. netcat and nmap (natually, I think?) completely ignore http_proxy and https_proxy for non-HTTP traffic. (This isn’t HTTP traffic, is it? This is a metric-export thing and that’s why it happens even with the dashboard disabled? I’m not sure why gRPC is using the proxy settings. I’m guessing there are some kind of gRPC-over-HTTP shenanigans going on, for some reason?)

(Okay, I guess they just always ignore proxy settings.

$ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nc -vv -z www.google.com 80
Connection to www.google.com 80 port [tcp/http] succeeded!
$ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nmap -p 80 www.google.com
PORT   STATE SERVICE
80/tcp open  http

I still don’t get why gRPC is using the proxy. I assume this is dashboard-specific somehow, since nothing else goes wrong if you run http_proxy=http://some.imaginary.proxy:80 https_proxy=http://some.imaginary.proxy:80 python test_actors.py, just the dashboard thing. You even still get the correct answer, despite the error messages the dashboard is spitting out.)

(To be clear, if you literally use an imaginary proxy like http://some.random.proxy:80, you’ll get a different error message. But the computation will still go through, so it’s only the dashboard gRPC thing that’s looking at http_proxy.)

In any case, we could have an error message that gives the IP and port that failed to reach, possibly with a suggestion to add them to no_proxy.

This looks like a bad bug. @mfitton can you take a look at it?

I was having a similar issue for couple of hours yesterday and your comments enlightened me a lot. Thanks! I am using a gpu-server in the university and dashboard was not a priority for me so I tried to set it False but as you already experience it didn’t work. Setting http_proxy and https_proxy to correct values didn’t do the trick either. In the end, my tmux pane was spammed with dashboard related warnings eventhough it was set to False. In order get rid of these nasty warning messages, I made a modification to the file at: envs/myenv/lib/python3.7/site-packages/grpc/aio/_call.py to stop printing out warnings. Precisely, I commented out the part in else at line 285 as below:

        if response is cygrpc.EOF:
            if self._cython_call.is_locally_cancelled():
                raise asyncio.CancelledError()
            #else:
                #raise _create_rpc_error(self._cython_call._initial_metadata, self._cython_call._status)
        else:
            return response

It resolved the cluttered tmux pane problem and I doubt that I will have any consequences due to this dirty work around. Could you confirm if that is actually the case?

We should fix this ASAP

@fyrestone @dHannasch - while this works great now when started with ray start, it seems when started with ray init(…) only, the arguments don’t pass well to the dashboard, creating a different issue at startup. Did you happen to test that setup as well…?

It’s not a blocker for me at the moment, although I think it’s a low hanging fruit to make this fix complete. Otherwise, this issue can be closed.

According to the logs, I found two problems:

  1. The prometheus exporter has a port conflict, so the agent is exit with OSError: [Errno 98] Address already in use.
  2. The agent can’t register to raylet by using ip = ray._private.services.get_node_ip_address(), port = node_manager_port. Then the agent is exit with grpc.experimental.aio._call.AioRpcError.

@mfitton - I’ll try to provide some more information in the meantime:

This is how I setup ray cluster (handshakes between nodes):

Head (inside docker): ray start --block --head --port=$redis_port --redis-password=$redis_password --node-ip-address=$head_node_ip \ --gcs-server-port=6005 --dashboard-port=6006 --node-manager-port=6007 --object-manager-port=6008 \ --redis-shard-ports=6400,6401,6402,6403,6404,6405,6406,6407,6408,6409 --min-worker-port=6100 --max-worker-port=6299 --include-dashboard=false

Worker Nodes (inside docker, different machine(s)): ray start --block --address=$head_node_ip:$redis_port --redis-password=$redis_password --node-ip-address=$worker_node_ip --node-manager-port=6007 --object-manager-port=6008 --min-worker-port=6100 --max-worker-port=6299

After I do that, I call:

ray.init(address=$head_node_ip:$redis_port, _redis-password=$redis_password)
tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

The tune.run() part you could find in the examples, including environment implementations. Alternatively, this reproduces also without setting a ray cluster, like I described in the body of this issue (above).

Observing the console produces a flow of exceptions, all look similar (note that this time I captured more informative one than the one attached to the body of this issue, the *** part is for security reasons):

2020-11-12 16:53:56,179	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
    c = cls(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
    options=option, gatherer=option.registry, collector=collector)
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
    self.serve_http()
  File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
    port=self.options.port, addr=str(self.options.address))
  File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
    httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
    server = server_class((host, port), handler_class)
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

(pid=raylet, ip=***) Traceback (most recent call last):
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 308, in <module>
(pid=raylet, ip=***)     raise e
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
(pid=raylet, ip=***)     loop.run_until_complete(agent.run())
(pid=raylet, ip=***)   File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
(pid=raylet, ip=***)     return future.result()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 123, in run
(pid=raylet, ip=***)     modules = self._load_modules()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 82, in _load_modules
(pid=raylet, ip=***)     c = cls(self)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
(pid=raylet, ip=***)     self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/metrics_agent.py", line 42, in __init__
(pid=raylet, ip=***)     namespace="ray", port=metrics_export_port)))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter
(pid=raylet, ip=***)     options=option, gatherer=option.registry, collector=collector)
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 266, in __init__
(pid=raylet, ip=***)     self.serve_http()
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/ray/prometheus_exporter.py", line 321, in serve_http
(pid=raylet, ip=***)     port=self.options.port, addr=str(self.options.address))
(pid=raylet, ip=***)   File "/usr/local/lib/python3.6/dist-packages/prometheus_client/exposition.py", line 78, in start_wsgi_server
(pid=raylet, ip=***)     httpd = make_server(addr, port, app, ThreadingWSGIServer, handler_class=_SilentHandler)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 153, in make_server
(pid=raylet, ip=***)     server = server_class((host, port), handler_class)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
(pid=raylet, ip=***)     self.server_bind()
(pid=raylet, ip=***)   File "/usr/lib/python3.6/wsgiref/simple_server.py", line 50, in server_bind
(pid=raylet, ip=***)     HTTPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
(pid=raylet, ip=***)     socketserver.TCPServer.server_bind(self)
(pid=raylet, ip=***)   File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
(pid=raylet, ip=***)     self.socket.bind(self.server_address)
(pid=raylet, ip=***) OSError: [Errno 98] Address already in use
2020-11-12 16:53:56,392	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605218036.477366833","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605218036.477361267","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"

@roireshef Can you help confirm whether the environment variables http_proxy or https_proxy exists?

The GCS logs:

[2020-11-19 15:33:29,110 I 7616 7616] grpc_server.cc:74: GcsServer server started, listening on port 6005.
[2020-11-19 15:33:29,118 I 7616 7616] gcs_server.cc:273: Gcs server address = 10.67.34.148:6005
[2020-11-19 15:33:29,118 I 7616 7616] gcs_server.cc:277: Finished setting gcs server address: 10.67.34.148:6005

The Dashboard head logs:

2020-11-19 15:33:29,615	INFO head.py:161 -- Connect to GCS at b'10.67.34.148:6005'

2020-11-19 15:33:29,940	ERROR head.py:108 -- Got AioRpcError when updating nodes.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/head.py", line 74, in _update_nodes
    nodes = await self._get_nodes()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/head.py", line 61, in _get_nodes
    request, timeout=2)
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605792809.939876173","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605792809.939868059","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

It seems that the GRPC Python client uses the correct address, but can’t connect to the GRPC server.

@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by: The agent connects to the raylet with ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip --node-ip-address=$worker_node_ip.

As said, I’m working in docker containers, which abstract away the node’s external IP (the only one that is valid to use if you were to connect from another machine). For applications running inside a docker container and asking for an IP (and I’m assuming what you’re doing there is similar to running “ifconfig” or “hostname” in the bash shell), the docker will provide a different “virtual” IP that is accessible only from that same machine (or even only from within the same docker container, I’m not entirely sure).

Since the valid node’s IP is already passed in --node-ip-address=$worker_node_ip - why isn’t that the only IP used across all services? In case the user has already provided the application with the “right IP” of the machine, Wouldn’t propagating it across all services be the right thing to do here?

Thanks. I will create a fix PR about the (2) by passing the --node-ip-address value to the agent.

@roireshef Is your worker node has multiple network interface card? I guess the second problem is caused by: The agent connects to the raylet with ip = ray._private.services.get_node_ip_address(), the ip is different with your worker command ip --node-ip-address=$worker_node_ip.

Also about the solution; We can probably collect stats only when the include_dashboard is set to be True. Otherwise, start only agents + we can stop collecting stats from the endpoints.

I think that’s not ideal though. I can imagine users who want to export metrics while they don’t have the dashboard

I’ve noticed this is because in the new dashboard architecture we start-up the dashboard agent regardless of whether include_dashboard is specified. This could be because the dashboard agent is the entity that receives ray stats via GRPC for export to Prometheus.

@fyrestone I’m planning on creating a PR to make the dashboard agent not start when include_dashboard is false. Am I missing any issues that doing this could cause?

This actually happens even if I run “ray start” with “–include-dashboard=false”