ignite: Can't get `Parallel.run` with NCCL in SLURM environment to work
🐛 Bug description
Dispatching a distributed multi-node/multi-gpu script via SLURM sbatch
raises a RuntimeError
To reproduce:
Slurm invocation:
OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run --nnodes=1 --nproc_per_node=2"
test_dist.py
Python script
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import fire
import torch
import ignite
import ignite.distributed as idist
def run_diagnostic(local_rank):
prefix = f"{local_rank}) "
print(f"{prefix}Rank={idist.get_rank()}")
print(f"{prefix}torch version: {torch.version.__version__}")
print(f"{prefix}torch git version: {torch.version.git_version}")
if torch.cuda.is_available():
print(f"{prefix}torch version cuda: {torch.version.cuda}")
print(f"{prefix}number of cuda devices: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"{prefix}\t- device {i}: {torch.cuda.get_device_properties(i)}")
else:
print("{prefix}no cuda available")
if "SLURM_JOBID" in os.environ:
for k in ["SLURM_PROCID", "SLURM_LOCALID", "SLURM_NTASKS", "SLURM_JOB_NODELIST"]:
print(f"{k}: {os.environ[k]}")
if local_rank == 0:
hostnames = subprocess.check_output(["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]])
print(f"hostnames: {hostnames}")
def run(**spawn_kwargs):
with idist.Parallel(backend='nccl', **spawn_kwargs) as parallel:
parallel.run(run_diagnostic)
if __name__ == '__main__':
fire.Fire({'run': run})
Error message (and logged output):
2021-04-25 19:06:35,565 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes:
nproc_per_node: 2
nnodes: 1
node_rank: 0
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run_diagnostic at 0x1555554741e0>' in 2 processes
Traceback (most recent call last):
File "test_dist.py", line 45, in <module>
fire.Fire({'run': run})
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "test_dist.py", line 42, in run
parallel.run(run_diagnostic)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 309, in run
idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/utils.py", line 324, in spawn
fn, args=args, kwargs_dict=kwargs_dict, nproc_per_node=nproc_per_node, backend=backend, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 380, in spawn
**spawn_kwargs,
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 323, in _dist_worker_task_fn
backend, init_method=init_method, world_size=arg_world_size, rank=arg_rank, **kw
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 72, in create_from_backend
backend=backend, init_method=init_method, world_size=world_size, rank=rank, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 93, in __init__
backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 124, in _create_from_backend
dist.init_process_group(backend, init_method=init_method, **init_pg_kwargs)
File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Expected behavior
I would like the code to run run_diagnostic
for each local_rank
in the allocation. Right now it does not reach that point, it seems to complete parallel configuration on the master process but fails in the auxiliary processes.
Environment
- PyTorch Version (e.g., 1.4):
1.8.1+cu102
- Ignite Version: was using
0.4.4
, also tried on0.5.0.dev20210423
- OS: Linux (CentOS)
- How you installed Ignite (
conda
,pip
, source):pip
- Python version:
3.7.3
- Any other relevant information:
- I’ve checked that this isn’t a problem of zombie processes lingering on the compute nodes from previous failed runs.
- there is an
dist.barrier()
afterinit_process_group()
inignite/distributed/comp_models/native.py
(Parallel._create_from_backend
). Clearly theinit_process_group
completes for at least one process, but in syncing viabarrier()
, the other auxiliary processes fail. - Not sure if this is handled elsewhere (pretty sure not) or even relevant, but in
ignite.distributed.comp_models.native.Parallel.setup_env_vars
, upon discovering that the environment is a SLURM environment, the system makes a call toself._setup_env_in_slurm
and returns without settingself._local_rank
,self._master_addr
, andself._master_port
(as it would otherwise). - Thanks!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 23 (12 by maintainers)
@sdesrozis re: docs - Personally I would have benefited greatly from a clear explanation of the interaction between slurm and ignite and how to dispatch jobs on such a system. . I’m awaiting in excitement the upcoming blog post you mentioned a few days ago. Thanks again for working on this so much for the last few days it really has helped a lot. The why-ignite package was also very helpful!
I also think the usage you suggested is good because it is has the least deviation from any other slurm script meaning there is a low barrier to entry and doesn’t require acute understanding of the initialization process of distributed jobs as much as
torch.distributed.[launch|spawn]
.@fco-dv we definitely need the blog post about
idist
😉@djberenberg from my side, the following command works
It means
torch.distributed.launch
(orspawn
) is replaced bysrun
. However this is the commandline approach. To submit a batch job, the following command works@djberenberg The error you faced comes from the fact that the
torch.distributed.spawn
spawns processes from a slurm environment which is not well defined forignite
. Indeed, if a slurm environment is detected, slurm variables for the initialization are used. Using your command, only one slurm job is spawned, it meansSLURM_LOCALRANK=0
is defined. When thetorch.distributed.launch
command is triggered, 2 processes are spawned and both use the same local rank. This explains the error message.I think I can fix this error. I do it as soon as possible. The idea is to use local information provided by the
torch.distributed.launch
(orspawn
) command if it exists rather than theSLURM_
variable. @vfdev-5 what do you think ?Good ! Currently we are writing a blog post to explain how ignite can help to use multiple launching methods with the same code. I hope we will release this post soon.
yes it does. Each workers have to sync using
master_addr:master_port
. Using slurm, we build a port number using the slurm jobid (no reduction needed).https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L245
The master address is the first slurm hostname address. To get the hostname’s list, we use the command
scontrol
.https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L249
HTH
I rapidly check the code and something seems weird
https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L231
When slurm is defined,
self._local_rank
is never initialized. Maybe thereturn
L215 should be removed.I will fix that soon. Btw, it doesn’t explain why the case freezes.
@sdesrozis do you think we should handle the case like that by showing a warning or runtime error ? Same question about the issue in MONAI …
@djberenberg Thank you for this report.
When using SLURM, the arguments for the parallel distribution are automatically handled from the underlying configuration (
srun
environment). Therefore, no need to specify in the python cmd line.Could you try the following command
and tell me if it solves your issue ?
Please, see https://github.com/sdesrozis/why-ignite HTH