ray: Initializing cluster on SLURM causes "we can not found the matched Raylet address" warning

What is the problem?

Running Ray 1.5.1

Reproduction (REQUIRED)

I’m using the following SLURM script, which is basically copypasted from the documentation:

#!/bin/bash
#SBATCH --job-name="experiment"
#SBATCH --output=experiment.out
#SBATCH -N 5
#SBATCH --mem=32gb
#SBATCH --tasks-per-node=1
#SBATCH -p class -C gpu2080
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=2
#SBATCH --time=36:00:00


echo "Loading modules..."

module swap intel gcc
module load cuda/10.1

source ~/miniconda3/etc/profile.d/conda.sh
conda activate [mycondaenv]


# __doc_head_address_start__

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)


# __doc_head_ray_start__
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" \
    --dashboard-host "$head_node_ip" --block &
# __doc_head_ray_end__

# __doc_worker_ray_start__

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$node_i" \
        ray start --address "$ip_head" \
        --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
    sleep 1
done
# __doc_worker_ray_end__

# __doc_script_start__


# Run experiment
# "sbatch -p class run_experiment.slurm [path/to/config]"
python run_experiment.py ${SLURM_CPUS_PER_TASK} $1

run_experiment.py can really have anything as the warnings happen before ray.init()

The warnings are as follows:

[2021-08-01 20:06:41,407 I 28552 28552] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,435 I 7652 7652] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,435 I 14548 14548] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,437 I 4763 4763] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

[ X] I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Is this something we should be concerned about?

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 3
Comments: 19 (14 by maintainers)

Most upvoted comments

Assigning to @tupui for now since our documentations page indicates they are the maintainer of the module. But we will get back with a plan for this issue

zhe-thoughts on Oct 25, 2022

@richardliaw Was there any conclusion from the investigation of this issue?

rkooo567 on Jan 6, 2022

Hi! I’m getting the same error on an HPC SLURM cluster:

global_state_accessor.cc:356: This node has an IP address of 172.16.204.7, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

However, the experiments start running after this error and finish. So can this error be safely ignored?

hw-ju on Aug 29, 2023

cc @tupui

DmitriGekhtman on Oct 25, 2022