ignite: Can't get `Parallel.run` with NCCL in SLURM environment to work

🐛 Bug description

Dispatching a distributed multi-node/multi-gpu script via SLURM sbatch raises a RuntimeError

To reproduce:

Slurm invocation:

 OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run --nnodes=1 --nproc_per_node=2"

test_dist.py

Python script

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

import fire

import torch
import ignite
import ignite.distributed as idist

def run_diagnostic(local_rank):
    prefix = f"{local_rank}) "
    print(f"{prefix}Rank={idist.get_rank()}")
    print(f"{prefix}torch version: {torch.version.__version__}")
    print(f"{prefix}torch git version: {torch.version.git_version}")
    
    if torch.cuda.is_available():
        print(f"{prefix}torch version cuda: {torch.version.cuda}")
        print(f"{prefix}number of cuda devices: {torch.cuda.device_count()}")

        for i in range(torch.cuda.device_count()):
            print(f"{prefix}\t- device {i}: {torch.cuda.get_device_properties(i)}")
    else:
        print("{prefix}no cuda available")


    if "SLURM_JOBID" in os.environ:
        for k in ["SLURM_PROCID", "SLURM_LOCALID", "SLURM_NTASKS", "SLURM_JOB_NODELIST"]:
            print(f"{k}: {os.environ[k]}")
        
        if local_rank == 0:
            hostnames = subprocess.check_output(["scontrol", "show", "hostnames", os.environ["SLURM_JOB_NODELIST"]])
            print(f"hostnames: {hostnames}")


def run(**spawn_kwargs):
    with idist.Parallel(backend='nccl', **spawn_kwargs) as parallel:
        parallel.run(run_diagnostic)

if __name__ == '__main__':
    fire.Fire({'run': run})

Error message (and logged output):

2021-04-25 19:06:35,565 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: 
	nproc_per_node: 2
	nnodes: 1
	node_rank: 0
2021-04-25 19:06:35,566 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run_diagnostic at 0x1555554741e0>' in 2 processes
Traceback (most recent call last):
  File "test_dist.py", line 45, in <module>
    fire.Fire({'run': run})
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "path_to_python_env/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "test_dist.py", line 42, in run
    parallel.run(run_diagnostic)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/launcher.py", line 309, in run
    idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/utils.py", line 324, in spawn
    fn, args=args, kwargs_dict=kwargs_dict, nproc_per_node=nproc_per_node, backend=backend, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 380, in spawn
    **spawn_kwargs,
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "path_to_python_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 323, in _dist_worker_task_fn
    backend, init_method=init_method, world_size=arg_world_size, rank=arg_rank, **kw
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 72, in create_from_backend
    backend=backend, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 93, in __init__
    backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
  File "path_to_python_env/lib/python3.7/site-packages/ignite/distributed/comp_models/native.py", line 124, in _create_from_backend
    dist.init_process_group(backend, init_method=init_method, **init_pg_kwargs)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "path_to_python_env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Expected behavior

I would like the code to run run_diagnostic for each local_rank in the allocation. Right now it does not reach that point, it seems to complete parallel configuration on the master process but fails in the auxiliary processes.

Environment

  • PyTorch Version (e.g., 1.4): 1.8.1+cu102
  • Ignite Version: was using 0.4.4, also tried on 0.5.0.dev20210423
  • OS: Linux (CentOS)
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.7.3
  • Any other relevant information:
  1. I’ve checked that this isn’t a problem of zombie processes lingering on the compute nodes from previous failed runs.
  2. there is an dist.barrier() after init_process_group() in ignite/distributed/comp_models/native.py (Parallel._create_from_backend). Clearly the init_process_group completes for at least one process, but in syncing via barrier(), the other auxiliary processes fail.
  3. Not sure if this is handled elsewhere (pretty sure not) or even relevant, but in ignite.distributed.comp_models.native.Parallel.setup_env_vars, upon discovering that the environment is a SLURM environment, the system makes a call to self._setup_env_in_slurm and returns without setting self._local_rank, self._master_addr, and self._master_port (as it would otherwise).
  4. Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

@sdesrozis re: docs - Personally I would have benefited greatly from a clear explanation of the interaction between slurm and ignite and how to dispatch jobs on such a system. . I’m awaiting in excitement the upcoming blog post you mentioned a few days ago. Thanks again for working on this so much for the last few days it really has helped a lot. The why-ignite package was also very helpful!

I also think the usage you suggested is good because it is has the least deviation from any other slurm script meaning there is a low barrier to entry and doesn’t require acute understanding of the initialization process of distributed jobs as much as torch.distributed.[launch|spawn].

@fco-dv we definitely need the blog post about idist 😉

@djberenberg from my side, the following command works

OMP_NUM_THREADS=1 srun  --nodes 1 --ntasks-per-node 2 -p gpu --gres=gpu:v100-32gb:02 python -u test_dist.py run 

It means torch.distributed.launch (or spawn) is replaced by srun. However this is the commandline approach. To submit a batch job, the following command works

OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "srun python -u test_dist.py run"

@djberenberg The error you faced comes from the fact that the torch.distributed.spawn spawns processes from a slurm environment which is not well defined for ignite. Indeed, if a slurm environment is detected, slurm variables for the initialization are used. Using your command, only one slurm job is spawned, it means SLURM_LOCALRANK=0 is defined. When the torch.distributed.launch command is triggered, 2 processes are spawned and both use the same local rank. This explains the error message.

I think I can fix this error. I do it as soon as possible. The idea is to use local information provided by the torch.distributed.launch (or spawn) command if it exists rather than the SLURM_ variable. @vfdev-5 what do you think ?

@sdesrozis Whoops, I didn’t even see that directory in the repo. Yes, it’s quite helpful.

Good ! Currently we are writing a blog post to explain how ignite can help to use multiple launching methods with the same code. I hope we will release this post soon.

(Naive) question - does master_port have to be unique for each worker?

yes it does. Each workers have to sync using master_addr:master_port. Using slurm, we build a port number using the slurm jobid (no reduction needed).

https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L245

The master address is the first slurm hostname address. To get the hostname’s list, we use the command scontrol.

https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L249

HTH

I rapidly check the code and something seems weird

https://github.com/pytorch/ignite/blob/d90a2efded4ddaf5265f9d346be8f1ee2b94d12a/ignite/distributed/comp_models/native.py#L231

When slurm is defined, self._local_rank is never initialized. Maybe the return L215 should be removed.

I will fix that soon. Btw, it doesn’t explain why the case freezes.

@sdesrozis do you think we should handle the case like that by showing a warning or runtime error ? Same question about the issue in MONAI …

@djberenberg Thank you for this report.

When using SLURM, the arguments for the parallel distribution are automatically handled from the underlying configuration (srun environment). Therefore, no need to specify in the python cmd line.

Could you try the following command

 OMP_NUM_THREADS=1 sbatch -N1 -n2 -p gpu --gres=gpu:v100-32gb:02 --wrap "python -u test_dist.py run"

and tell me if it solves your issue ?

Please, see https://github.com/sdesrozis/why-ignite HTH