ray: [ray] worker_start_ray_commands are not executed for private cluster

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
Ray installed from (source or binary): pip
Ray version: 0.7.0
Python version: 3.6.7
Exact command to reproduce:

Describe the problem

I am following private cluster setup instructions, but only head node starts. Few interesting points:

Seems similar to issue https://github.com/ray-project/ray/issues/3408
Adding initialization_commands: [] fixes the KeyError mentioned in https://github.com/ray-project/ray/issues/4559

Source code / logs

cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
    type: local
    head_ip: ip1
    worker_ips: [ip2, ip3, ip4]
auth:
    ssh_user: tesq
    ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []

head_start_ray_commands:
    - source activate py3_prod && ray stop
    - echo 'I am here' >> /home/tesq/new_file.txt
    - source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray stop
    - echo 'I am there' >> /home/tesq/new_file.txt
    - source activate py3_prod && ray start --redis-address=ip1:6379

After that only head node starts, and only on the head node I see the created file new_file.txt Example output of command ray.global_state.client_table()

{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
  'IsInsertion': True,
  'NodeManagerAddress': 'ip1',
  'NodeManagerPort': 45759,
  'ObjectManagerPort': 34107,
  'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
  'Resources': {'GPU': 3.0, 'CPU': 24.0}},

Update: Seems very similar to issue https://github.com/ray-project/ray/issues/3190 But files monitor.err and monitor.out are empty.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 21 (8 by maintainers)

Most upvoted comments

@ijrsvt I’m under the company’s firewall, sorry will not be able to post the complete YAML.

I got the YAML from here and updated the head_ip worker_ips and ssh_user.

When I run the command ray up config.yaml it brings up the ray on the head_ip as head node and also

prints the command to add additional node to the cluster
prints the UI address
but does not brings up the ray on the worker nodes

Whereas upon manually running the command ray start --address=head_ip:port on each of the worker machine, the worker nodes gets added to the cluster.

So may be if you could share a working YAML-which can bring up ray on a head node and worker nodes, i could use that as a reference. appreciate your help. – thanks

solacerace on Jun 24, 2020

ray v0.8.4 python 3.6.9 Ubuntu 18.04.4 I’m running into this same thing, none of the commands (setup_commands) or (worker_start_ray_commands) appear to be executing.

I guess it might not be obvious, but that includes the bit about starting up the worker clients. Basically only the head node is launched, none of the workers appear to be executing any commands “ray start” or otherwise.

gimzmoe on Apr 20, 2020

@solacerace Make sure that min_workers == initial_workers == max_workers and those all are equal to the number of worker nodes.

ijrsvt on Jun 24, 2020

@dclong @jaromrax @gimzmoe Do you have --autoscaling-config=~/ray_bootstrap_config.yaml as a flag for your ray start command on the head node?

*** Clarification: This should be specified in the head_start_ray_commands section of your YAML.

ijrsvt on Jun 1, 2020

@sp608 I just tried this out, and this should work now; can you try the nightlies/latest master? You should install this on all nodes (put it in setup_commands as pip install -U whl).

https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master

Same issue, version 0.8.5 doesn’t ssh to worker’s IPs on ray up cluster.yaml.

I tried 0.9.0.dev0, with the same effect. https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl

jaromrax on May 20, 2020