ray: [autoscaler] "AssertionError: Unable to SSH to node", when starting large cluster
Running
ray create_or_update ~/Workspace/ray/python/ray/autoscaler/aws/example.yaml
where example.yaml
is modified to start 300 nodes, I monitor the auto-scaling activity with
ssh -i /Users/rkn/.ssh/ray-autoscaler_us-west-2.pem ubuntu@54.200.2.70 'tail -f /tmp/raylogs/monitor-*'
After a while (after around 64 nodes are in the cluster), I see
==> /tmp/raylogs/monitor-2018-02-04_07-20-44-04701.err <==
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
raise e
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
self.do_update()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Inspecting the monitor logs on the head node, I see
Process NodeUpdaterProcess-28:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
raise e
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
self.do_update()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
raise e
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
self.do_update()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
cc @ericl
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 22 (10 by maintainers)
This should be fixed if you upgrade to 0.7.4 ray. For now, try reducing the cluster bame.
Thanks for being patient and responsive!
On Sat, Sep 21, 2019 at 12:25 AM Tianhe Yu notifications@github.com wrote:
Sorry; this is a generated command by the autoscaler (you may need to step through it. It should probably end with “uptime”.
Also, maybe it is a cluster-name issue (try killing cluster, set name: “test” and rerun exec).
On Sat, Sep 21, 2019 at 12:08 AM Tianhe Yu notifications@github.com wrote:
I am having the same issue and am able to ssh into the head node without issues. I am only trying to create a very small cluster (2 nodes).
@richardliaw @robertnishihara
We are having a similar issue here, we are running the command:
ray up /Users/kapleesh/cal/cs262a/angela/manifest.yaml
This gives us the error:
This is our manifest.yaml file:
we are able to ssh into the headnode but we do not see any logs anywhere here:
ssh "~/.ssh/ray-autoscaler_us-west-2.pem" ec2-user@ec2-54-201-227-119.us-west-2.compute.amazonaws.com
Do you have any advice here?