ray: ray.init() does not detect local resources correctly on SLURM

What is the problem?

When running Ray inside slurm, it does not detect the resources correctly. In comparison, joblib does correctly detect at least the number of CPU cores.

Ray version and other system information (Python version, TensorFlow version, OS):

ray v1.0.1 installed via conda
python 3.7
CentOS 7

Reproduction (REQUIRED)

start slurm job with 16 cores and 64GB of memory:

srun -c16 --mem=64G --pty python3

run the following snippet:

import ray
import joblib

ray.init()

print(joblib.cpu_count())
# 16

print(ray.cluster_resources())
# {'GPU': 1.0, 
#  'memory': 6900.0, 
#  'node:192.168.16.15': 1.0, 
#  'accelerator_type:RTX': 1.0, 
#  'CPU': 128.0, 
#  'object_store_memory': 2097.0}

What I would expect:

16 CPU cores
64GB memory
No GPU since I did not request one with slurm. Is this changeable via CUDA_VISIBLE_DEVICES?
I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 17 (11 by maintainers)

Most upvoted comments

In ray, memory and object_store_memory are different concepts. Object store memory uses shared memory.
Usually 30% is allocated for object store & 10% memory is set for Redis (only in a head node), and everything else is for memory (meaning worker’s heap memory) by default.
Given your original memory was 6900 => 50MB * 6900 / 1024 == 336GB. So, I guess we definitely have a bug here.
Object store memory is for shared memory (it is ray’s distributed object store). There’s no mechanism to detect the spilling directory’s size (but it is in the backlog). Please create an issue if this is something important for you!
Also as Richard said, from 1.3.0 (not 1.2.0), the object spilling will be turned on by default. We start spilling objects once the object store memory reaches OOM. Btw, there are times that you are still facing OOM although you turn on this feature, so if you face that, please create an issue. I can explain scenarios in details if you’d want to. (Also, object spilling is a relatively new feature, so I am glad you are trying that out! Please send me DM through slack or create an issue and tag me whenever you see any issue).

rkooo567 on Feb 5, 2021

Hm, thanks for raising this @Hoeze - as a workaround, you can use ray.init(num_cpus=..., num_gpus=..)

richardliaw on Feb 4, 2021