ray: Local cluster YAML no longer working in 0.9.0.dev0
What is the problem?
With my previous version of Ray (0.7.7), I had a cluster.yaml file that worked well, but it has since stopped working since I upgraded to 0.9.0.dev0 to include a recent tune bug fix for PAUSED trials. When I run a test script after running ray up cluster.yaml, only the head node is visible and I’m getting this warning:
2020-03-16 19:48:44,344 WARNING worker.py:802 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
I know that there is a firewall between my machines, so I had to open specific ports and force Ray to use them in my cluster YAML file previously, so maybe there were some new port changes that are blocking communication?
Ray version and other system information (Python version, TensorFlow version, OS): Ray: 0.9.0.dev0 OS: Centos 7
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
My cluster.yaml is:
cluster_name: asedler_nesu
## NOTE: Typically for local clusters, min_workers == initial_workers == max_workers.
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == initial_workers == max_workers.
min_workers: 1
# The initial number of worker nodes to launch in addition to the head node.
# Typically, min_workers == initial_workers == max_workers.
initial_workers: 1
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == initial_workers == max_workers.
max_workers: 1
# Autoscaling parameters.
# Ignore this if min_workers == initial_workers == max_workers.
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
container_name: "" # e.g. ray_docker
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: [] # Extra options to pass into "docker run"
# Local specific configuration.
provider:
type: local
head_ip: neuron.bme.emory.edu
worker_ips:
- sulcus.bme.emory.edu
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: asedler
ssh_private_key: ~/.ssh/id_rsa
# Leave this empty.
head_node: {}
# Leave this empty.
worker_nodes: {}
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands: []
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# NOTE: Modified the following commands to use the tf2-gpu environment
# and to use specific ports that have been opened for this purpose
# by Andrew Sedler (asedler3@gatech.edu)
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- conda activate tf2-gpu && ray stop
- conda activate tf2-gpu && ulimit -c unlimited && ray start --head --redis-port=6379 --redis-shard-ports=59519 --node-manager-port=19580 --object-manager-port=39066 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- conda activate tf2-gpu && ray stop
- conda activate tf2-gpu && ray start --redis-address=$RAY_HEAD_IP:6379 --node-manager-port=19580 --object-manager-port=39066
The test script is:
ray.init(address="localhost:6379")
import time
from pprint import pprint
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
pprint(set(ray.get([f.remote() for _ in range(1000)])))
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 35 (33 by maintainers)
Sorry for the delay @rkooo567, but I just ran my test script from earlier after adding
--gcs-server-portand it seems to work! (see below) Thanks so much. One thing I noticed is that my Ray Dashboard seems to be having issues now - when I loadlocalhost:8265I see it flash on briefly in my browser, but then it disappears. Is that a known issue? Thanks again!@mfitton for @arsedler9’s issue. Sorry again I didn’t see this message!
@chanshing It is actually in progress! Here is the PR. Please read it and give me some review 😃 https://github.com/ray-project/ray/pull/10281
@arsedler9 this is now merged into master. You only need to set
--min-worker-portand--max-worker-portand then worker and driver ports will be selected from that range. By default the range is 10000-10999.Sorry for the delay, had some extremely annoying CI issues to deal with…
Other flags will stay the same, this will only apply to workers’ gRPC servers. You’ll need to specify at least the min port (max will default to 65535).