ray: [release/core] scalability envelope distributed test throws `max number of clients reached`

Running with ray submit --start config.yaml test_distributed.py throws hundreds of these:

(raylet, ip=172.31.29.217) Traceback (most recent call last):
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 186, in <module>
(raylet, ip=172.31.29.217)     connect_only=True)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/node.py", line 164, in __init__
(raylet, ip=172.31.29.217)     session_name = _get_with_retry(redis_client, "session_name")
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/node.py", line 41, in _get_with_retry
(raylet, ip=172.31.29.217)     result = redis_client.get(key)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1606, in get
(raylet, ip=172.31.29.217)     return self.execute_command('GET', name)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
(raylet, ip=172.31.29.217)     conn = self.connection or pool.get_connection(command_name, **options)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
(raylet, ip=172.31.29.217)     connection.connect()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 567, in connect
(raylet, ip=172.31.29.217)     self.on_connect()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 643, in on_connect
(raylet, ip=172.31.29.217)     auth_response = self.read_response()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 739, in read_response
(raylet, ip=172.31.29.217)     response = self._parser.read_response()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 484, in read_response
(raylet, ip=172.31.29.217)     raise response
(raylet, ip=172.31.29.217) redis.exceptions.ConnectionError: max number of clients reached
(raylet, ip=172.31.29.217) Traceback (most recent call last):
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py", line 186, in <module>
(raylet, ip=172.31.29.217)     connect_only=True)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/node.py", line 164, in __init__
(raylet, ip=172.31.29.217)     session_name = _get_with_retry(redis_client, "session_name")
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/ray/node.py", line 41, in _get_with_retry
(raylet, ip=172.31.29.217)     result = redis_client.get(key)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/client.py", line 1606, in get
(raylet, ip=172.31.29.217)     return self.execute_command('GET', name)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/client.py", line 898, in execute_command
(raylet, ip=172.31.29.217)     conn = self.connection or pool.get_connection(command_name, **options)
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
(raylet, ip=172.31.29.217)     connection.connect()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 567, in connect
(raylet, ip=172.31.29.217)     self.on_connect()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 643, in on_connect
(raylet, ip=172.31.29.217)     auth_response = self.read_response()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 739, in read_response
(raylet, ip=172.31.29.217)     response = self._parser.read_response()
(raylet, ip=172.31.29.217)   File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/redis/connection.py", line 484, in read_response
(raylet, ip=172.31.29.217)     raise response
(raylet, ip=172.31.29.217) redis.exceptions.ConnectionError: max number of clients reached

Wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/releases/1.3.0/cb3661e547662f309a0cc55c5495b3adb779a309/ray-1.3.0-cp37-cp37m-manylinux2014_x86_64.whl

cc @wuisawesome @ericl

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 59 (59 by maintainers)

Most upvoted comments

@amogkam can you set it to 1 so that you can continue running the release test? We shouldn’t commit a change that sets it to 1 though. We should just fix the bug.

WTH, this should not happen, CC @DmitriGekhtman , can you please help investigate this (this is a release blocker)?

Running this now with updated cluster config

Ok the node scalability test ended up failing

Traceback (most recent call last):
  File "/home/ubuntu/test_distributed.py", line 188, in <module>
    test_nodes()
  File "/home/ubuntu/test_distributed.py", line 36, in test_nodes
    test_max_running_tasks()
  File "/home/ubuntu/test_distributed.py", line 85, in test_max_running_tasks
    assert max_cpus - min_cpus_available > 2000, err_str
AssertionError: Only 540.5/7837.0 cpus used.

@wuisawesome

I’m running it again rn

No I’m replacing it with the 1.3 wheels.

@rkooo567 yep I’m trying that now