ray: [ray] worker_start_ray_commands are not executed for private cluster
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
- Ray installed from (source or binary): pip
- Ray version: 0.7.0
- Python version: 3.6.7
- Exact command to reproduce:
Describe the problem
I am following private cluster setup instructions, but only head node starts. Few interesting points:
- Seems similar to issue https://github.com/ray-project/ray/issues/3408
- Adding
initialization_commands: []fixes theKeyErrormentioned in https://github.com/ray-project/ray/issues/4559
Source code / logs
cluster_name: tesq_cluster
min_workers: 48
max_workers: 48
initial_workers: 48
provider:
type: local
head_ip: ip1
worker_ips: [ip2, ip3, ip4]
auth:
ssh_user: tesq
ssh_private_key: /home/me/.ssh/keys/local_user
file_mounts: {}
setup_commands: []
initialization_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- source activate py3_prod && ray stop
- echo 'I am here' >> /home/tesq/new_file.txt
- source activate py3_prod && ulimit -c unlimited && ray start --head --redis-port=6379
worker_start_ray_commands:
- echo 'I am there' >> /home/tesq/new_file.txt
- source activate py3_prod && ray stop
- echo 'I am there' >> /home/tesq/new_file.txt
- source activate py3_prod && ray start --redis-address=ip1:6379
After that only head node starts, and only on the head node I see the created file new_file.txt
Example output of command ray.global_state.client_table()
{'ClientID': 'a7ce937ffcbece9b25a779fa126ba47edef27267',
'IsInsertion': True,
'NodeManagerAddress': 'ip1',
'NodeManagerPort': 45759,
'ObjectManagerPort': 34107,
'ObjectStoreSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/plasma_store',
'RayletSocketName': '/tmp/ray/session_2019-05-30_15-51-46_16481/sockets/raylet',
'Resources': {'GPU': 3.0, 'CPU': 24.0}},
Update:
Seems very similar to issue https://github.com/ray-project/ray/issues/3190
But files monitor.err and monitor.out are empty.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 21 (8 by maintainers)
@ijrsvt I’m under the company’s firewall, sorry will not be able to post the complete YAML.
I got the YAML from here and updated the
head_ipworker_ipsandssh_user.When I run the command
ray up config.yamlit brings up the ray on the head_ip as head node and alsoWhereas upon manually running the command
ray start --address=head_ip:porton each of the worker machine, the worker nodes gets added to the cluster.So may be if you could share a working YAML-which can bring up ray on a head node and worker nodes, i could use that as a reference. appreciate your help. – thanks
ray v0.8.4 python 3.6.9 Ubuntu 18.04.4 I’m running into this same thing, none of the commands (setup_commands) or (worker_start_ray_commands) appear to be executing.
I guess it might not be obvious, but that includes the bit about starting up the worker clients. Basically only the head node is launched, none of the workers appear to be executing any commands “ray start” or otherwise.
@solacerace Make sure that
min_workers == initial_workers == max_workersand those all are equal to the number of worker nodes.@dclong @jaromrax @gimzmoe Do you have
--autoscaling-config=~/ray_bootstrap_config.yamlas a flag for yourray startcommand on the head node?*** Clarification: This should be specified in the
head_start_ray_commandssection of your YAML.Same issue, version 0.8.5 doesn’t ssh to worker’s IPs on
ray up cluster.yaml.I tried 0.9.0.dev0, with the same effect. https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl