ray: Local cluster YAML no longer working in 0.9.0.dev0

What is the problem?

With my previous version of Ray (0.7.7), I had a cluster.yaml file that worked well, but it has since stopped working since I upgraded to 0.9.0.dev0 to include a recent tune bug fix for PAUSED trials. When I run a test script after running ray up cluster.yaml, only the head node is visible and I’m getting this warning: 2020-03-16 19:48:44,344 WARNING worker.py:802 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config. I know that there is a firewall between my machines, so I had to open specific ports and force Ray to use them in my cluster YAML file previously, so maybe there were some new port changes that are blocking communication?

Ray version and other system information (Python version, TensorFlow version, OS): Ray: 0.9.0.dev0 OS: Centos 7

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

My cluster.yaml is:

cluster_name: asedler_nesu

## NOTE: Typically for local clusters, min_workers == initial_workers == max_workers.

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == initial_workers == max_workers.
min_workers: 1
# The initial number of worker nodes to launch in addition to the head node.
# Typically, min_workers == initial_workers == max_workers.
initial_workers: 1

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == initial_workers == max_workers.
max_workers: 1

# Autoscaling parameters.
# Ignore this if min_workers == initial_workers == max_workers.
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "" # e.g., tensorflow/tensorflow:1.5.0-py3
    container_name: "" # e.g. ray_docker
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: []  # Extra options to pass into "docker run"

# Local specific configuration.
provider:
    type: local
    head_ip: neuron.bme.emory.edu
    worker_ips: 
        - sulcus.bme.emory.edu

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: asedler
    ssh_private_key: ~/.ssh/id_rsa

# Leave this empty.
head_node: {}

# Leave this empty.
worker_nodes: {}

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# NOTE: Modified the following commands to use the tf2-gpu environment 
# and to use specific ports that have been opened for this purpose
# by Andrew Sedler (asedler3@gatech.edu)

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - conda activate tf2-gpu && ray stop
    - conda activate tf2-gpu && ulimit -c unlimited && ray start --head --redis-port=6379 --redis-shard-ports=59519 --node-manager-port=19580 --object-manager-port=39066 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - conda activate tf2-gpu && ray stop
    - conda activate tf2-gpu && ray start --redis-address=$RAY_HEAD_IP:6379 --node-manager-port=19580 --object-manager-port=39066

The test script is:

ray.init(address="localhost:6379")
import time
from pprint import pprint
@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
pprint(set(ray.get([f.remote() for _ in range(1000)])))

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 35 (33 by maintainers)

Most upvoted comments

Sorry for the delay @rkooo567, but I just ran my test script from earlier after adding --gcs-server-port and it seems to work! (see below) Thanks so much. One thing I noticed is that my Ray Dashboard seems to be having issues now - when I load localhost:8265 I see it flash on briefly in my browser, but then it disappears. Is that a known issue? Thanks again!

(tf2-test) [asedler@neuron] ~ $ python ~/ray_test.py
/snel/home/asedler/anaconda3/envs/tf2-test/lib/python3.7/site-packages/ray/__init__.py
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0720 11:46:21.957347 69627 69627 global_state_accessor.cc:25] Redis server address = <HEAD-IP>:10000, is test flag = 0
I0720 11:46:21.959774 69627 69627 redis_client.cc:147] RedisClient connected.
I0720 11:46:21.969166 69627 69627 redis_gcs_client.cc:90] RedisGcsClient Connected.
I0720 11:46:21.973222 69627 69627 service_based_gcs_client.cc:193] Reconnected to GCS server: <HEAD-IP>:10004
I0720 11:46:21.973912 69627 69627 service_based_accessor.cc:91] Reestablishing subscription for job info.
I0720 11:46:21.973960 69627 69627 service_based_accessor.cc:401] Reestablishing subscription for actor info.
I0720 11:46:21.973990 69627 69627 service_based_accessor.cc:768] Reestablishing subscription for node info.
I0720 11:46:21.974018 69627 69627 service_based_accessor.cc:1040] Reestablishing subscription for task info.
I0720 11:46:21.974046 69627 69627 service_based_accessor.cc:1212] Reestablishing subscription for object locations.
I0720 11:46:21.974081 69627 69627 service_based_accessor.cc:1323] Reestablishing subscription for worker failures.
I0720 11:46:21.974112 69627 69627 service_based_gcs_client.cc:86] ServiceBasedGcsClient Connected.
{'CPU': 128.0,
 'GPU': 20.0,
 'memory': 10152.0,
 'node:<WORKER-IP>': 1.0,
 'node:<HEAD-IP>': 1.0,
 'object_store_memory': 3058.0}
{'<WORKER-IP>', '<HEAD-IP>'}

arsedler9 on Jul 20, 2020

@mfitton for @arsedler9’s issue. Sorry again I didn’t see this message!

@chanshing It is actually in progress! Here is the PR. Please read it and give me some review 😃 https://github.com/ray-project/ray/pull/10281

rkooo567 on Aug 26, 2020

@arsedler9 this is now merged into master. You only need to set --min-worker-port and --max-worker-port and then worker and driver ports will be selected from that range. By default the range is 10000-10999.

Sorry for the delay, had some extremely annoying CI issues to deal with…

edoakes on Apr 16, 2020

Other flags will stay the same, this will only apply to workers’ gRPC servers. You’ll need to specify at least the min port (max will default to 65535).

edoakes on Apr 1, 2020