ray: Ray on a cluster: ConnectionError: Could not find any running Ray instance
I’m trying to test ray on a university cluster with the code below
import ray ray.init(address=“auto”) import time
@ray.remote def f(): time.sleep(0.01) return ray.services.get_node_ip_address() set(ray.get([f.remote() for _ in range(1000)]))
But it returns error like this. Did I use ray in a wrong way or what?
File “<stdin>”, line 2, in <module>
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/worker.py”, line 643, in init
address, redis_address)
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 273, in validate_redis_address
address = find_redis_address_or_die()
File “/apps/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/site-packages/ray/services.py”, line 165, in find_redis_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 33 (5 by maintainers)
I have managed to have Ray run on a PBS cluster using the following script
with startWorkerNode.sh being
Within Script.py, I have
where the Redis password is retrieved through argparse.
Hope that helps. 😃
Great news–Final solution-- works for ray 1.0+ For the PBS cluster, we have one .sub script for job submission and one shell script to start worker node. The scripts are as follows: The job.sub script:
The startWorkerNode.sh script:
Note that for the PBS cluster I’m using, before submitting the .sub file, I need to go into the directory to run chmod command on the .sh file
I hope this is a general solution for everyone. I finally made it work with huge help from my uni’s HPC specialist
Hey @Lewisracing,
In my case, having
#!/bin/bash -linstartWorkerNode.shwas necessary; I think it had something to do with environment variables not being loaded otherwise. Additionally, I think you need to--blockin theray startcommand in that same file, otherwise thepbsdshprocess will simply finish and the worker will not be active. To go with that change, you also need to add an&at the end of thepbsdshcommand within theforloop. It is necessary because, otherwise, the program would hang forever, waiting for the call to complete. This should, hopefully, fix your issue. Another possibility is to install the nightly version of Ray; I remember that I did not have any problem with Ray 0.8.7, but I could not manage to make Ray 1.0 work. Let me know how you go. 😃