ray: Ray on a cluster: ConnectionError: Could not find any running Ray instance

I’m trying to test ray on a university cluster with the code below

import ray ray.init(address=“auto”) import time

@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address() set(ray.get([f.remote() for _ in range(1000)]))

But it returns error like this. Did I use ray in a wrong way or what?

File “<stdin>”, line 2, in <module> File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py”, line 643, in init address, redis_address) File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 273, in validate_redis_address address = find_redis_address_or_die() File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 165, in find_redis_address_or_die "Could not find any running Ray instance. " ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 33 (5 by maintainers)

Most upvoted comments

I have managed to have Ray run on a PBS cluster using the following script

#!/bin/bash
#PBS -l ncpus=192
#PBS -l mem=600GB
#PBS -l walltime=48:00:00
#PBS -l wd

module load python3/3.7.4

ip_prefix=`hostname -i`
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)

echo parameters: $ip_head $redis_password

/path/to/ray start --head --port=6379 \
--redis-password=$redis_password \
--num-cpus 48 --num-gpus 0
sleep 10

for (( n=48; n<$PBS_NCPUS; n+=48 ))
do
  pbsdsh -n $n -v /path/to/startWorkerNode.sh \
  $ip_head $redis_password &
  sleep 10
done

cd /path/to/working/directory || exit
./Script.py --pw $redis_password

/path/to/ray stop

with startWorkerNode.sh being

#!/bin/bash -l

module load python3/3.7.4

/path/to/ray start --block --address=$1 \
--redis-password=$2 --num-cpus 48 --num-gpus 0

/path/to/ray stop

Within Script.py, I have

ray.init(address='auto', redis_password=args.pw)

where the Redis password is retrieved through argparse.

Hope that helps. 😃

Great news–Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:

#!/bin/bash

#PBS -N pythoncpu_testray
#PBS -l select=2:ncpus=10:mpiprocs=10
#PBS -q five_day
#PBS -m abe
#PBS -M xxx@xxx.xx  
#PBS -j oe
#PBS -W sandbox=PRIVATE
#PBS -k n

ln -s $PWD $PBS_O_WORKDIR/$PBS_JOBID

cd $PBS_O_WORKDIR

jobnodes=`uniq -c ${PBS_NODEFILE} | awk -F. '{print $1 }' | awk '{print $2}' | paste -s -d " "`
 
thishost=`uname -n | awk -F. '{print $1.}'`
thishostip=`hostname -i`
rayport=6379
 
thishostNport="${thishostip}:${rayport}"
echo "Allocate Nodes = <$jobnodes>"
 
echo "set up ray cluster..." 
for n in `echo ${jobnodes}`
do
        if [[ ${n} == "${thishost}" ]]
        then
                echo "first allocate node - use as headnode ..."
                module load PyTorch
                ray start --head
                sleep 5
        else
                ssh ${n}  $PBS_O_WORKDIR/startWorkerNode.sh ${thishostNport}
                sleep 10
        fi
done 
 
python <Main.py

rm $PBS_O_WORKDIR/$PBS_JOBID
#

The startWorkerNode.sh script:

#!/bin/bash -l
source $HOME/.bashrc
cd $PBS_O_WORKDIR
param1=$1
destnode=`uname -n`
echo "destnode is = [$destnode]"
module load PyTorch
ray start --address="${param1}" --redis-password='5241590000000000'

Note that for the PBS cluster I’m using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file

chmod +x startWorkerNode.sh

I hope this is a general solution for everyone. I finally made it work with huge help from my uni’s HPC specialist

Hey @Lewisracing,

In my case, having #!/bin/bash -l in startWorkerNode.sh was necessary; I think it had something to do with environment variables not being loaded otherwise. Additionally, I think you need to --block in the ray start command in that same file, otherwise the pbsdsh process will simply finish and the worker will not be active. To go with that change, you also need to add an & at the end of the pbsdsh command within the for loop. It is necessary because, otherwise, the program would hang forever, waiting for the call to complete. This should, hopefully, fix your issue. Another possibility is to install the nightly version of Ray; I remember that I did not have any problem with Ray 0.8.7, but I could not manage to make Ray 1.0 work. Let me know how you go. 😃