ray: [autoscaler] "AssertionError: Unable to SSH to node", when starting large cluster

Running

ray create_or_update ~/Workspace/ray/python/ray/autoscaler/aws/example.yaml

where example.yaml is modified to start 300 nodes, I monitor the auto-scaling activity with

ssh -i /Users/rkn/.ssh/ray-autoscaler_us-west-2.pem ubuntu@54.200.2.70 'tail -f /tmp/raylogs/monitor-*'

After a while (after around 64 nodes are in the cluster), I see

==> /tmp/raylogs/monitor-2018-02-04_07-20-44-04701.err <==
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

Inspecting the monitor logs on the head node, I see

Process NodeUpdaterProcess-28:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Process NodeUpdaterProcess-53:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 68, in run
    raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 55, in run
    self.do_update()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 118, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node

cc @ericl

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

This should be fixed if you upgrade to 0.7.4 ray. For now, try reducing the cluster bame.

Thanks for being patient and responsive!

On Sat, Sep 21, 2019 at 12:25 AM Tianhe Yu notifications@github.com wrote:

Thanks! Seems like setting the cluster name to “test” works. Is there anything we should be careful about naming the cluster?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZOHXNGXHGAN2TIZ523QKXD7BA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IMLNA#issuecomment-533775796, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCRZZM5KBJV4HHPFVU7UMDQKXD7BANCNFSM4EPB5FTA .

Sorry; this is a generated command by the autoscaler (you may need to step through it. It should probably end with “uptime”.

Also, maybe it is a cluster-name issue (try killing cluster, set name: “test” and rerun exec).

On Sat, Sep 21, 2019 at 12:08 AM Tianhe Yu notifications@github.com wrote:

Okay, here’s the list of the ssh commands being run: [‘ray stop’, ‘ray start --head --redis-port=6379 –object-manager-port=8076 –autoscaling-config=~/ray_bootstrap_config.yaml’]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/1513?email_source=notifications&email_token=ABCRZZMRBWS2O276QGWIH5LQKXB5PA5CNFSM4EPB5FTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IMD7Q#issuecomment-533774846, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCRZZIT5GFMZFO5TMZCT3LQKXB5PANCNFSM4EPB5FTA .

I am having the same issue and am able to ssh into the head node without issues. I am only trying to create a very small cluster (2 nodes).

@richardliaw @robertnishihara

We are having a similar issue here, we are running the command:

ray up /Users/kapleesh/cal/cs262a/angela/manifest.yaml

This gives us the error:

This will restart cluster services [y/N]: y
Updating files on head node...
NodeUpdater: Updating i-060669574ffde1697 to 2fe43d49ebaad0c0f0a39bc37474a87c1d072a43, logging to (console)
NodeUpdater: Waiting for IP of i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Waiting for SSH to i-060669574ffde1697...
NodeUpdater: Error updating Unable to SSH to nodeSee (console) for remote logs.
Process NodeUpdaterProcess-1:
Traceback (most recent call last):
  File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 103, in run
    raise e
  File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 88, in run
    self.do_update()
  File "/Users/kapleesh/.pyenv/versions/3.7.0/lib/python3.7/site-packages/ray/autoscaler/updater.py", line 154, in do_update
    assert ssh_ok, "Unable to SSH to node"
AssertionError: Unable to SSH to node
Updating 54.201.227.119 failed

This is our manifest.yaml file:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 9

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 9

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
    container_name: "" # e.g. ray_docker

# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
# target_utilization is 0.8, it would resize the cluster to 13. This fraction
# can be decreased to increase the aggressiveness of upscaling.
# This value must be less than 1.0 for scaling to happen.
target_utilization_fraction: 1.0

# If a node is idle for this many minutes, it will be removed.
# idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes are currently spread between zones by a round-robin approach,
    # however this implementation detail should not be relied upon.
    availability_zone: us-west-2a,us-west-2b, us-west-2c

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
    InstanceType: m5.large
    ImageId: ami-a0cfeed8  # US West Oregon, HVM (SSD) EBS-Backed 64-bit

    # You can provision additional disk space with a conf as follows
    # BlockDeviceMappings:
    #     - DeviceName: /dev/sda1
    #       Ebs:
    #           VolumeSize: 50

    # Additional options in the boto docs.

# Provider-specific config for worker nodes, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
worker_nodes:
    InstanceType: m5.large
    ImageId: ami-a0cfeed8  # US West Oregon, HVM (SSD) EBS-Backed 64-bit

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        # Additional options can be found in the boto docs, e.g.
        #   SpotOptions:
        #       MaxPrice: MAX_HOURLY_PRICE

    # Additional options in the boto docs.

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
   "/": "/Users/kapleesh/cal/cs262a/angela",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# List of shell commands to run to set up nodes.
# setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - git clone https://github.com/ramjk/angela
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp27-cp27mu-manylinux1_x86_64.whl
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.5.3-cp35-cp35m-manylinux1_x86_64.whl
    # Consider uncommenting these if you also want to run apt-get commands during setup
    # - sudo pkill -9 apt-get || true
    # - sudo pkill -9 dpkg || true
    # - sudo dpkg --configure -a

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076

we are able to ssh into the headnode but we do not see any logs anywhere here:

ssh "~/.ssh/ray-autoscaler_us-west-2.pem" ec2-user@ec2-54-201-227-119.us-west-2.compute.amazonaws.com

Do you have any advice here?