ray: No logging of SSH failures of Ray nodes?

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): RHEL 6 HPC environment
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.3
  • Python version: 3.7
  • Exact command to reproduce:

I’ve designed a custom NodeProvider to provide autoscaling support on Univa Grid Engine HPC environments. I’m having the hardest time, though, debugging errors w.r.t. permanent “waiting-for-ssh” states. I have at one point gotten a Ray head node properly qsub’d and running via ray up blarg.yml -y (which then goes on to spawn two initial workers as configured), but the workers are not able to do that dummy SSH step back to the head for whatever reason. Unfortunately, I’ve run through all the logs that Ray and the job script spit out, and have had no luck finding specifics of why the workers’ SSH efforts have failed. Is there a particular way to get the reason(s) behind the SSH failure exported to the log files?

Now that I’ve been messing with this a bit, e.g., creating new SSH keys, I am having trouble even getting the qsub’d ray start connecting to the ray up process (with the “waiting-for-ssh” problem). I’m sure it’s some configuration stuff on my end, but because I get so little feedback on the failure mode, this has been painful to say the least to debug, haha.

Command to start the head node that my NodeProvider qsubs from ray up:

ray start --block --num-cpus 1 --temp-dir /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head --head --redis-port 31236 --autoscaling-config /home/myusername/ray_cluster//autoscaling_config.yml

Note that I can see the head node job + the two worker jobs have been submitted and are running on the cluster

Log outputs:

monitor.out

This is the output from the head node’s NodeProvider (my custom debugging output). It’s too long to paste in this issue, so find here:

https://gist.github.com/zbarry/0282655b01bc51f247a7a8fa6477e9e9

Cluster / job log

[INFO] [1550510728] [Mon Feb 18 12:25:28 EST 2019] [16366] [18509843] Starting the workflow
/home/myusername/miniconda3/envs/ray/bin/python
/home/myusername/miniconda3/envs/ray/bin/ray
2019-02-18 12:25:32,108	INFO scripts.py:288 -- Using IP address <redacted> for this node.
2019-02-18 12:25:32,113	INFO node.py:278 -- Process STDOUT and STDERR is being redirected to /home/myusername/ray_cluster//clusters/pipeline/ray-pipeline-head/logs.
2019-02-18 12:25:32,231	INFO services.py:396 -- Waiting for redis server at 127.0.0.1:31236 to respond...
2019-02-18 12:25:32,348	INFO services.py:396 -- Waiting for redis server at 127.0.0.1:48029 to respond...
2019-02-18 12:25:32,349	INFO services.py:798 -- Starting Redis shard with 10.0 GB max memory.
2019-02-18 12:25:32,384	WARNING services.py:1206 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-02-18 12:25:32,385	INFO services.py:1360 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-02-18 12:25:32,433	INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address redacted:31236

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(redis_address="redacted:31236")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

debug_state.txt

NodeManager:
InitialConfigResources: {GPU,0.000000},{CPU,1.000000}
ClusterResources:
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- total: {CPU,1.000000},{GPU,0.000000}
- avail: {CPU,1.000000},{GPU,0.000000}
8e0ad117daf498c80775ec6bf916f9ab098b0969: 
- total: {GPU,0.000000},{CPU,1.000000}
- avail: {GPU,0.000000},{CPU,1.000000}
ObjectManager:
- num local objects: 0
- num active wait requests: 0
- num unfulfilled push requests: 0
- num pull requests: 0
- num buffered profile events: 0
ObjectDirectory:
- num listeners: 0
- num eviction entries: 0
ObjectStoreNotificationManager:
- num adds processed: 0
- num removes processed: 0
BufferPool:
- get buffer state map size: 0
- create buffer state map size: 0
ConnectionPool:
- num message send connections: 0
- num transfer send connections: 0
- num avail message send connections: 0
- num avail transfer send connections: 0
- num message receive connections: 0
- num transfer receive connections: 0
AsyncGcsClient:
- TaskTable: num lookups: 0, num adds: 0
- ActorTable: num lookups: 0, num appends: 0
- TaskReconstructionLog: num lookups: 0, num appends: 0
- TaskLeaseTable: num lookups: 0, num adds: 0
- HeartbeatTable: num lookups: 0, num adds: 5901
- ErrorTable: num lookups: 0, num appends: 0
- ProfileTable: num lookups: 0, num appends: 0
- ClientTable: num lookups: 0, num appends: 1, cache size: 3, num removed: 0
- DriverTable: num lookups: 0, num appends: 0
WorkerPool:
- num workers: 1
- num drivers: 0
SchedulingQueue:
- num placeable tasks: 0
- num waiting tasks: 0
- num ready tasks: 0
- num running tasks: 0
- num infeasible tasks: 0
- num waiting for actor creation tasks: 0
- num tasks blocked: 0
ReconstructionPolicy:
- num reconstructing: 0
TaskDependencyManager:
- task dep map size: 0
- task req map size: 0
- req objects map size: 0
- local objects map size: 0
- pending tasks map size: 0
LineageCache:
- committed tasks: 0
- child map size: 0
- num subscribed tasks: 0
- lineage size: 0
ActorRegistry:
- num live actors: 0
- num reconstructing actors: 0
- num dead actors: 0
- max num handles: 0
RemoteConnections:
499b4c9a4193e9825cc616cf2051285f576b6dc0: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
9314824bc109cd85b897aacc8d6b917b2bcc80d2: 
- bytes read: 0
- bytes written: 0
- num async writes: 0
- num sync writes: 0
- writing: 0
- pending async bytes: 0
DebugString() time ms: 1

monitor.err

2019-02-18 12:25:33,473	INFO autoscaler.py:351 -- StandardAutoscaler: {'cluster_name': 'pipeline', 'min_workers': 2, 'max_workers': 10, 'initial_workers': 2, 'docker': {'image': '', 'container_name': ''}, 'target_utilization_fraction': 0.8, 'idle_timeout_minutes': 10, 'provider': {'type': 'external', 'module': 'pipeline.hpc.UGENodeProvider'}, 'auth': {'ssh_user': 'myusername', 'ssh_private_key': '/home/myusername/.ssh/id_dsa'}, 'head_node': {}, 'worker_nodes': {}, 'file_mounts': {}, 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': []}
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:34,188	INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:34,189	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:34,189	INFO autoscaler.py:588 -- StandardAutoscaler: Launching 2 new nodes
2019-02-18 12:25:34,798	INFO autoscaler.py:597 -- StandardAutoscaler: 0/2 target nodes (2 pending) (bringup=True)
2019-02-18 12:25:34,798	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=-1 Mean=-1 Max=-1, NumNodesConnected=0, NumNodesUsed=0.0, ResourceUsage=, TimeSinceLastHeartbeat=Min=-1 Mean=-1 Max=-1
2019-02-18 12:25:39,122	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (bringup=True)
2019-02-18 12:25:39,122	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=4 Mean=4 Max=4, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:43,406	INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_hbfty_8z: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,613	INFO autoscaler.py:512 -- StandardAutoscaler: ray-pipeline-worker_lclafqgd: Runtime state is None, want 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614	INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:44,614	INFO updater.py:126 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Updating to 0c8a44b274b693301ee007458fceccd7c210c4b4
2019-02-18 12:25:45,373	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:45,373	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=10 Mean=10 Max=10, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:46,386	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:46,481	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:50,393	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:50,393	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=15 Mean=15 Max=15, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:55,470	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:25:55,470	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=20 Mean=20 Max=20, NumNodesConnected=1, NumNodesUsed=0.0, ResourceUsage=0.0/1.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:25:57,619	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_lclafqgd...
2019-02-18 12:25:57,657	INFO updater.py:88 -- NodeUpdater: Waiting for IP of ray-pipeline-worker_hbfty_8z...
2019-02-18 12:25:58,199	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got IP [LogTimer=12396ms]
2019-02-18 12:25:58,210	INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:25:58,222	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got IP [LogTimer=12404ms]
2019-02-18 12:25:58,251	INFO updater.py:153 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Waiting for SSH...
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
Monitor: could not find ip for client 9314824bc109cd85b897aacc8d6b917b2bcc80d2
Monitor: could not find ip for client 499b4c9a4193e9825cc616cf2051285f576b6dc0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:00,507	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:00,507	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=1 Mean=9 Max=25, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:05,560	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:05,560	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=6 Mean=14 Max=30, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:26:10,580	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:10,580	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=11 Mean=19 Max=35, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:15,686	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:15,687	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=16 Mean=24 Max=40, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/paramiko/ecdsakey.py:164: CryptographyDeprecationWarning: Support for unsafe construction of public numbers from encoded data will be removed in a future version. Please use EllipticCurvePublicKey.from_encoded_point
  self.ecdsa_curve.curve_class(), pointinfo
2019-02-18 12:26:20,706	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 updating)
2019-02-18 12:26:20,706	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=21 Mean=29 Max=45, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

2019-02-18 12:30:51,015	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Got SSH [LogTimer=292805ms]
2019-02-18 12:30:51,015	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306401ms]
2019-02-18 12:30:51,015	ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_lclafqgd: Error updating Unable to SSH to node
2019-02-18 12:30:51,086	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Got SSH [LogTimer=292835ms]
2019-02-18 12:30:51,086	INFO log_timer.py:21 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Applied config 0c8a44b274b693301ee007458fceccd7c210c4b4 [LogTimer=306472ms]
2019-02-18 12:30:51,086	ERROR updater.py:138 -- NodeUpdater: ray-pipeline-worker_hbfty_8z: Error updating Unable to SSH to node
Exception in thread Thread-48:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

Exception in thread Thread-47:
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 141, in run
    raise e
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 130, in run
    self.do_update()
  File "/home/myusername/miniconda3/envs/ray/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 188, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

2019-02-18 12:30:54,114	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:54,114	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=295 Mean=303 Max=319, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=3 Max=5
2019-02-18 12:30:56,944	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:30:56,944	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=297 Mean=305 Max=321, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:03,112	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:03,112	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=304 Mean=312 Max=328, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-18 12:31:09,433	INFO autoscaler.py:597 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2019-02-18 12:31:09,434	INFO autoscaler.py:598 -- LoadMetrics: NodeIdleSeconds=Min=310 Mean=318 Max=334, NumNodesConnected=3, NumNodesUsed=0.0, ResourceUsage=0.0/3.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

~~ snip ~~

redis-shard_0.out

16540:C 18 Feb 2019 12:25:32.248 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16540:C 18 Feb 2019 12:25:32.248 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16540, just started
16540:C 18 Feb 2019 12:25:32.248 # Configuration loaded
16540:M 18 Feb 2019 12:25:32.249 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16540:M 18 Feb 2019 12:25:32.249 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16540:M 18 Feb 2019 12:25:32.249 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16540:M 18 Feb 2019 12:25:32.249 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16540:M 18 Feb 2019 12:25:32.249 # Server initialized
16540:M 18 Feb 2019 12:25:32.249 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

redis.out

16536:C 18 Feb 2019 12:25:32.135 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
16536:C 18 Feb 2019 12:25:32.135 # Redis version=5.0.3, bits=64, commit=d2b6db3d, modified=0, pid=16536, just started
16536:C 18 Feb 2019 12:25:32.135 # Configuration loaded
16536:M 18 Feb 2019 12:25:32.136 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
16536:M 18 Feb 2019 12:25:32.136 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
16536:M 18 Feb 2019 12:25:32.136 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
16536:M 18 Feb 2019 12:25:32.136 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
16536:M 18 Feb 2019 12:25:32.136 # Server initialized
16536:M 18 Feb 2019 12:25:32.136 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (13 by maintainers)

Most upvoted comments

Not sure if this is the issue, but there are multiple ports that need to be open between the different machines.

For example, by default there are two Redis servers (one “primary” and one or more “non-primary” shards). The “object manager” and “node manager” on each machine also need their own port.

These ports can be specified in the ray start command with flags like --redis-port=6379, --redis-shard-ports=6380, --node-manager-port=12345, and --object-manager-port=12346. Note that the Redis ones are only relevant on the head node.

For ray.init(), only the “primary” Redis address is relevant. That’s the first one that is printed.

When you get the error message Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node, can you check the logs on the same machine in /tmp/ray/session*/ (just the most recent session) and see if anything looks like it crashed?