ray: ray.init() does not detect local resources correctly on SLURM
What is the problem?
When running Ray inside slurm, it does not detect the resources correctly. In comparison, joblib does correctly detect at least the number of CPU cores.
Ray version and other system information (Python version, TensorFlow version, OS):
- ray v1.0.1 installed via conda
- python 3.7
- CentOS 7
Reproduction (REQUIRED)
- start slurm job with 16 cores and 64GB of memory:
srun -c16 --mem=64G --pty python3
- run the following snippet:
import ray
import joblib
ray.init()
print(joblib.cpu_count())
# 16
print(ray.cluster_resources())
# {'GPU': 1.0,
# 'memory': 6900.0,
# 'node:192.168.16.15': 1.0,
# 'accelerator_type:RTX': 1.0,
# 'CPU': 128.0,
# 'object_store_memory': 2097.0}
What I would expect:
-
16 CPU cores
-
64GB memory
-
No GPU since I did not request one with slurm. Is this changeable via
CUDA_VISIBLE_DEVICES? -
I have verified my script runs in a clean environment and reproduces the issue.
-
I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 17 (11 by maintainers)
memory(meaning worker’s heap memory) by default.Hm, thanks for raising this @Hoeze - as a workaround, you can use
ray.init(num_cpus=..., num_gpus=..)