ray: Ray node fails to connect to head, claims redis error despite redis connection working

What is the problem?

Problem summary: We can connect to a redis server located on a head node from a non-head computer via python, but ray throws a redis connection error when it tries to connect from said non-head computer.

Related thread: https://github.com/ray-project/ray/issues/6900 The downgraded version of psutil mentioned does not solve our issue.

(commands on Head and Worker denoted by H$ and W$ respectively)

H$ ray start --head
2020-07-01 11:32:35,976 INFO scripts.py:394 -- Using IP address 192.168.1.13 for this node.
2020-07-01 11:32:36,011 INFO resource_spec.py:204 -- Starting Ray with 6.84 GiB memory available for workers and up to 3.44 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:32:36,646 INFO services.py:1163 -- View the Ray dashboard at localhost:8265
2020-07-01 11:32:36,733 INFO scripts.py:410 --
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop
 
W$ ray start --address='192.168.1.13:6379' --redis-password='5241590000000000'
2020-07-01 11:35:45,744 INFO scripts.py:467 -- Using IP address 192.168.1.216 for this node.
2020-07-01 11:35:45,816 INFO resource_spec.py:204 -- Starting Ray with 6.64 GiB memory available for workers and up to 2.85 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-01 11:35:45,911 INFO scripts.py:477 --
Started Ray on this node. If you wish to terminate the processes that have been started, run

    ray stop

W$ nc -vz 192.168.1.13 6379
Connection to 192.168.1.13 6379 port [tcp/*] succeeded!   (<-- this is a redis test)

W$ ray timeline
2020-07-01 19:00:24,445 INFO scripts.py:1036 -- Connecting to Ray instance at 192.168.1.13:6379.                                                                                                                                             
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                                                                                                                             
I0701 19:00:24.472828  2152  2152 global_state_accessor.cc:25] Redis server address = 192.168.1.13:6379, is test flag = 0
W0701 19:00:24.478263  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
...
W0701 19:00:24.483278  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483487  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.483750  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484005  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484279  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484525  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484755  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.484956  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485133  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485313  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485491  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485673  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.485865  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486061  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486239  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486409  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486563  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486718  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.486873  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487048  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487211  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487366  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487530  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487686  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487840  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.487994  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
W0701 19:00:24.488168  2152  2152 redis_context.cc:307] Failed to connect to Redis, retrying.
F0701 19:00:24.488487  2152  2152 redis_context.cc:302] Could not establish connection to redis 192.168.1.13:6379 (context.err = 1)                                                                                                          
*** Check failure stack trace: ***                                           
Aborted (core dumped) 

Some notes:

  • All important firewall ports are open.
  • We can successfully set & get entries on a non-head node using the redis client object from create_redis_client() in ray/services.py
  • We’ve traced the issue down to ray/state.py:87 self.global_state_accessor.connect() failing to return. Raylet code is called from this point on.
  • When W connects to H, the ray dashboard on H stops working and throws the following javascript error twice:
react-dom.production.min.js:209:194
TypeError: "e is undefined"
    rt Errors.tsx:42
    React 6
    unstable_runWithPriority scheduler.production.min.js:19
    React 4
    Redux 6
    t Dashboard.tsx:80
    u runtime.js:45
    _invoke runtime.js:274
    t runtime.js:97
    Babel 2
        r
        l 

Issue https://github.com/ray-project/ray/issues/9135 has this same javascript error.

Ray version and other system information (Python version, TensorFlow version, OS):

Two computers:

Server2019 (HEAD, H):
  OS: Windows Server 2019 Version 1809 Build 17763.1282
  WSL- Ubuntu 20.04
  ray 0.8.6
  Python 3.8.2
  Tensorflow 2.2.0

Desktop-GTPUF8 (WORKER, W):
  OS: Windows 10 Version 1909 Build 18363.900
  WSL- Ubuntu 20.04
  ray 0.8.6 
  Python 3.8.2
  Tensorflow 2.2.0

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

(commands on Head and Worker denoted by H$ and W$ respectively)

H$ ray start --head
W$ ray start --address='REDIS_ADDR_FROM_PREV_COMMAND' --redis-password='5241590000000000'
W$ ray timeline
  • [✔] I have verified my script runs in a clean environment and reproduces the issue.
  • [✔] I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 12
  • Comments: 25 (15 by maintainers)

Most upvoted comments

Have anybody managed to solve the issue for Linux machine?

To add to this point: I’m also facing this issue, but on Linux system. Where in Kubernetes cluster Ray cluster has been started, while trying to access it from another machine via python client which is in the same network, failing with W0701 19:00:24.483278 2152 2152 redis_context.cc:307] Failed to connect to Redis, retrying.

@mehrdadn : Just an brief on the envs.

  1. Ray 0.8.6 - Autoscaler - example-full.yaml in Kubernetes cluster. (Linux)
  2. Trying to access it via another VM(Linux - not a part of k8 cluster) which is in the same network- via python client. import ray ray.init(address=‘<server_ip>:<exposed_redisport’ -6379>, redis_password=‘5241590000000000’)

Getting the message in console W0629 17:55:35.521113 31524 253148608 redis_context.cc:307] Failed to connect to Redis, retrying.

But can access the redis via python client, it works.

image

There is an similar issue in github , where that person has come with work around by making network changes to make it accessible. https://github.com/ray-project/ray/issues/6108

Just an thought: May be it could be connection issue where Ray couldn’t establish(send back) connection back to the client machine while client can communicate the Ray server/head.

@mehrdadn , Its same issue as above - where both of us facing the same issue, we have communicated on the Ray -slack channel. It’s fair to see when their is fix for this issue resolves for Linux as well. Will wait till the issue is fixed.

Disabling my firewall completely worked for me also (for testing only of course.), so 6379 isn’t the only port that needs opened.

This is expected as we’ve only worked on single-node support so far on Windows, but I’ll add this to #9114 to track. Thanks for reporting!