ray: Initializing cluster on SLURM causes "we can not found the matched Raylet address" warning
What is the problem?
Running Ray 1.5.1
Reproduction (REQUIRED)
I’m using the following SLURM script, which is basically copypasted from the documentation:
#!/bin/bash
#SBATCH --job-name="experiment"
#SBATCH --output=experiment.out
#SBATCH -N 5
#SBATCH --mem=32gb
#SBATCH --tasks-per-node=1
#SBATCH -p class -C gpu2080
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=2
#SBATCH --time=36:00:00
echo "Loading modules..."
module swap intel gcc
module load cuda/10.1
source ~/miniconda3/etc/profile.d/conda.sh
conda activate [mycondaenv]
# __doc_head_address_start__
# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# __doc_head_ray_start__
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" --port=$port \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" \
--dashboard-host "$head_node_ip" --block &
# __doc_head_ray_end__
# __doc_worker_ray_start__
# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 --gres=gpu:2 -w "$node_i" \
ray start --address "$ip_head" \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
sleep 1
done
# __doc_worker_ray_end__
# __doc_script_start__
# Run experiment
# "sbatch -p class run_experiment.slurm [path/to/config]"
python run_experiment.py ${SLURM_CPUS_PER_TASK} $1
run_experiment.py
can really have anything as the warnings happen before ray.init()
The warnings are as follows:
[2021-08-01 20:06:41,407 I 28552 28552] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,435 I 7652 7652] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,435 I 14548 14548] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
[2021-08-01 20:06:41,437 I 4763 4763] global_state_accessor.cc:326: This node has an IP address of ***, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- [ X] I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Is this something we should be concerned about?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 3
- Comments: 19 (14 by maintainers)
Assigning to @tupui for now since our documentations page indicates they are the maintainer of the module. But we will get back with a plan for this issue
@richardliaw Was there any conclusion from the investigation of this issue?
Hi! I’m getting the same error on an HPC SLURM cluster:
However, the experiments start running after this error and finish. So can this error be safely ignored?
cc @tupui