ray: Ray node fails to connect to head, claims redis error despite redis connection working
What is the problem?
Problem summary: We can connect to a redis server located on a head node from a non-head computer via python, but ray throws a redis connection error when it tries to connect from said non-head computer.
Related thread: https://github.com/ray-project/ray/issues/6900
The downgraded version of psutil mentioned does not solve our issue.
(commands on Head and Worker denoted by H$ and W$ respectively)
H$ ray start --head
2020-07-01 11:32:35,976 INFO scripts.py:394 -- Using IP address 192.168.1.13 for this node.
2020-07-01 11:32:36,011 INFO resource_spec.py:204 -- Starting Ray with 6.84 GiB memory available for workers and up to 3.44 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:32:36,646 INFO services.py:1163 -- View the Ray dashboard at localhost:8265
2020-07-01 11:32:36,733 INFO scripts.py:410 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
W$ ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'
2020-07-01 11:35:45,744 INFO scripts.py:467 -- Using IP address 192.168.1.216 for this node.
2020-07-01 11:35:45,816 INFO resource_spec.py:204 -- Starting Ray with 6.64 GiB memory available for workers and up to 2.85 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:35:45,911 INFO scripts.py:477 --
Started Ray on this node. If you wish to terminate the processes that have been started, run
ray stop
W$ nc -vz 192.168.1.13 6379
Connection to 192.168.1.13 6379 port [tcp/*] succeeded! (<-- this is a redis test)
W$ ray timeline
2020-07-01 19:00:24,445 INFO scripts.py:1036 -- Connecting to Ray instance at 192.168.1.13:6379.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0701 19:00:24.472828 2152 2152 global_state_accessor.cc:25] Redis server address = 192.168.1.13:6379, is test flag = 0
W0701 19:00:24.478263 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
...
W0701 19:00:24.483278 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483487 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483750 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484005 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484279 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484525 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484755 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484956 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485133 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485313 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485491 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485673 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485865 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486061 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486239 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486409 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486563 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486718 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486873 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487048 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487211 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487366 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487530 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487686 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487840 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487994 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.488168 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
F0701 19:00:24.488487 2152 2152 redis_context.cc:302] Could not establish connection to redis 192.168.1.13:6379 (context.err = 1)
*** Check failure stack trace: ***
Aborted (core dumped)
Some notes:
- All important firewall ports are open.
- We can successfully set & get entries on a non-head node using the redis client object from
create_redis_client()inray/services.py - We’ve traced the issue down to
ray/state.py:87 self.global_state_accessor.connect()failing to return. Raylet code is called from this point on. - When W connects to H, the ray dashboard on H stops working and throws the following javascript error twice:
react-dom.production.min.js:209:194
TypeError: "e is undefined"
rt Errors.tsx:42
React 6
unstable_runWithPriority scheduler.production.min.js:19
React 4
Redux 6
t Dashboard.tsx:80
u runtime.js:45
_invoke runtime.js:274
t runtime.js:97
Babel 2
r
l
Issue https://github.com/ray-project/ray/issues/9135 has this same javascript error.
Ray version and other system information (Python version, TensorFlow version, OS):
Two computers:
Server2019 (HEAD, H):
OS: Windows Server 2019 Version 1809 Build 17763.1282
WSL- Ubuntu 20.04
ray 0.8.6
Python 3.8.2
Tensorflow 2.2.0
Desktop-GTPUF8 (WORKER, W):
OS: Windows 10 Version 1909 Build 18363.900
WSL- Ubuntu 20.04
ray 0.8.6
Python 3.8.2
Tensorflow 2.2.0
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
(commands on Head and Worker denoted by H$ and W$ respectively)
H$ ray start --head
W$ ray start --address='REDIS_ADDR_FROM_PREV_COMMAND' --redis-password='5241590000000000'
W$ ray timeline
- [✔] I have verified my script runs in a clean environment and reproduces the issue.
- [✔] I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 12
- Comments: 25 (15 by maintainers)
Have anybody managed to solve the issue for Linux machine?
To add to this point: I’m also facing this issue, but on Linux system. Where in Kubernetes cluster Ray cluster has been started, while trying to access it from another machine via python client which is in the same network, failing with W0701 19:00:24.483278 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.
@mehrdadn : Just an brief on the envs.
Getting the message in console W0629 17:55:35.521113 31524 253148608 redis_context.cc:307] Failed to connect to Redis, retrying.
But can access the redis via python client, it works.
There is an similar issue in github , where that person has come with work around by making network changes to make it accessible. https://github.com/ray-project/ray/issues/6108
Just an thought: May be it could be connection issue where Ray couldn’t establish(send back) connection back to the client machine while client can communicate the Ray server/head.
@mehrdadn , Its same issue as above - where both of us facing the same issue, we have communicated on the Ray -slack channel. It’s fair to see when their is fix for this issue resolves for Linux as well. Will wait till the issue is fixed.
Disabling my firewall completely worked for me also (for testing only of course.), so 6379 isn’t the only port that needs opened.
This is expected as we’ve only worked on single-node support so far on Windows, but I’ll add this to #9114 to track. Thanks for reporting!