ray: [Core] Failed to register worker . Slurm - srun -

What happened + What you expected to happen

I can’t start ray.

I instantiate a node in a slurm cluster using:

srun -n 1 --exclusive -G 1 --pty bash

This allocates a node with 112 cpus and 4 gpus.

Then, within python:

import ray ray.init(num_cpus=20) 2022-11-03 21:17:31,752 INFO worker.py:1509 – Started a local Ray instance. View the dashboard at 127.0.0.1:8265 [2022-11-03 21:18:32,436 E 251378 251378] core_worker.cc:149: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

On a different test: mport ray ray.init(ignore_reinit_error=True, num_cpus=10) 2022-11-03 21:19:01,734 INFO worker.py:1509 – Started a local Ray instance. View the dashboard at 127.0.0.1:8265 RayContext(dashboard_url=‘127.0.0.1:8265’, python_version=‘3.9.13’, ray_version=‘2.0.1’, ray_commit=‘03b6bc7b5a305877501110ec04710a9c57011479’, address_info={‘node_ip_address’: ‘172.20.6.24’, ‘raylet_ip_address’: ‘172.20.6.24’, ‘redis_address’: None, ‘object_store_address’: ‘/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/plasma_store’, ‘raylet_socket_name’: ‘/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630/sockets/raylet’, ‘webui_url’: ‘127.0.0.1:8265’, ‘session_dir’: ‘/scratch/fast/6920449/ray/session_2022-11-03_21-18-49_765770_252630’, ‘metrics_export_port’: 62537, ‘gcs_address’: ‘172.20.6.24:49967’, ‘address’: ‘172.20.6.24:49967’, ‘dashboard_agent_listen_port’: 52365, ‘node_id’: ‘0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78’})

(raylet) [2022-11-03 21:19:31,639 E 252725 252765] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause. 2022-11-03 21:20:00,798 WARNING worker.py:1829 – The node with node id: 0debcceedbef73619ccc8347450f5086693743e005ba9e907ae98c78 and address: 172.20.6.24 and node name: 172.20.6.24 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload.

Versions / Dependencies

DEPENDENCIES: Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0] on linux

RAY VERSION: 2.0.1 INSTALLATION: pip install -U “ray[default]” grpcio: 1.43.0

Reproduction script

import ray ray.init(num_cpus=20)

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 62 (23 by maintainers)

Most upvoted comments

This issue has collected a number of different reports, I think I saw these:

* the head node dies

* worker nodes fail to register with the proper head node when more than one is running

* worker nodes die when starting up
  All these can apparently lead to the log message "Failed to register worker"

When commenting “same issue”, please be more specific: what exactly did you try, on what hardware, and what happened.

Would it make sense and be possible to have Ray emit a more detailed error message here? One thing that makes it hard for me to report the problem in more detail is that the main log only shows the “Failed to register worker” and “IOError: [RayletClient] Unable to register worker with raylet. No such file or directory” messages. And it’s impossible for me to figure out what other logs or information could be relevant.

At the very least, could Ray log which file the “no such file or directory” message refers to?

I have exactly the same problem in ray 2.3.0. What solved problem is that I manually run the following command

 ray start --head 

then the script runs fine.

Without do it manually, tune.fit() will start a local Ray instance and somehow it leads to above error. I have this problem only in one new machine. Other machine runs fine. On the other hand, slurm jobs runs fine without the manual intervention. Sounds like an issue related to head ip_address

The top solution on stack overflow solved this issue for me:

Limit the number of CPUs

Ray will launch as many worker processes as your execution node has CPUs (or CPU cores). If that’s more than you reserved, slurm will start killing processes.

You can limit the number of worker processes as such:

import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
print("success")

(though I don’t think it solves the OP’s problem as they have already set num_cups)

Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)

In my case, I could see messages in the log files similar to the ones described here. I’m guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.

The solution in my case was to increase the number of pending signals, by running

ulimit -u 127590

This resolved the error and at least allowed running the full Ray pipeline. I can’t say whether this is a good idea at a system level – maybe someone else can advise about the pros and cons of this approach – but it worked for me, ymmv.

@mgerstgrasser when you say " at least 2 cores for the slurm job" do you refer to “#SBATCH --cpus-per-task=2” or decorator @ray.remote(num_cpus=2) for the task inside the code itself? Thank you!

@Pkulyte The former - I don’t recall if it was --cpus-per-task or one of the equivalent slurm options, but shouldn’t make a difference. Note that it still wasn’t 100% for me, it just greatly reduced the frequency of failures.

this solves it for me.

Just in case this is helpful to anyone else running into this, for me it seems I’ve been able to work around this problem by putting a try-except block around ray.init() and re-trying to start Ray multiple times if it fails, with exponential backoff. So something like the following, and call that instead of ray.init() directly. (Exponential backoff because it still seems to me that this might be related to two instances starting at the same time on the same physical machine, although I’ve never been able to figure that out with certainty.)

Since I’ve started doing this I’ve not seen any failed slurm jobs. But I did see in my logs that the except-block was triggered on occasion, so I think the underlying issue still occurs sometimes.

def try_start_ray(num_cpus, local_mode):
    depth = 0
    while True:
        try:
            print("Trying to start ray.")
            ray.init(num_cpus=num_cpus, local_mode=local_mode, include_dashboard=False)
            break
        except:
            waittime = np.random.randint(1, 10 * 2**depth)
            print(f"Failed to start ray on attempt {depth+1}. Retrying in {waittime} seconds...")
            sleep(waittime)
            depth += 1
  1. Post what version of ray and how you installed it.
  2. Post what versions of grpcio you are using and how you installed it.

Hey people,

I has the same error as mentioned earlier, however, after I do “pip uninstall grpcio” and then reinstall using conda “conda install grpcio”.

The error gone. and its working fine for me now! Peace.

Following up here, after hacking in the logs, it looks like at least in my case, this might be related to too many OpenBLAS threads. (This makes sense, because it was happening on a machine with a large number of CPUs, 96 to be exact.)

In my case, I could see messages in the log files similar to the ones described here. I’m guessing that when running ray jobs that use many CPUs, there is some kind of issue with too many threads being used that prevents the workers from registering properly.

The solution in my case was to increase the number of pending signals, by running

ulimit -u 127590

This resolved the error and at least allowed running the full Ray pipeline. I can’t say whether this is a good idea at a system level – maybe someone else can advise about the pros and cons of this approach – but it worked for me, ymmv.

thanks, it works for me after modifying it as ray.init(num_cpus=1)

I have the same issue, by running Ubuntu20.04 in Singularity container on SLURM. I use ray.version in [2.2.0, 2.3.1]. This seems to be a show-stopper for many people here. I tried all proposed workarounds such as _temp_dir=f"/scratch/dx4/tmp", object_store_memory=78643200, specifying num_cpus=32, num_gpus=1 and nothing worked.

The only thing that worked for me is downgrading ray to 1.10.0, but this is certainly just a temporary solution.

I’m also experiencing this issue with ray 2.2.0, Python 3.8 on Linux. I can reproduce it just by running

>>> import ray
>>> ray.init()
2022-12-24 21:59:47,962	INFO worker.py:1529 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
[2022-12-24 21:59:50,597 E 73199 73199] core_worker.cc:179: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

I’ll note that this is within a conda environment, and the same command, run from within the same conda environment, works fine (does not raise this error) on two other machines (Mac OS 11.7.2, and interactive nodes on a slurm cluster with GPUs).