ignite: Using ignite.distributed with 3 or more processes hangs indefinitely

❓ Questions/Help/Support

Trying to use the ignite.distributed to train a model with DDP. The issue I encounter is that when spawning 3 or processes to run my code, it seems to hang indefinitely. Works fine with 2 processes. I even tried a very basic script and still hangs (similar to the tutorial).

# run.py
import torch
import ignite.distributed as idist

def run(rank, config):
    print(f"Running basic DDP example on rank {rank}.")

def main():
    world_size = 4  # if this is 3 or more it hangs

    # some dummy config
    config = {}

    # run task
    idist.spawn("nccl", run, args=(config,), nproc_per_node=world_size)
    
    # the same happens even in this case
    # with idist.Parallel(backend="nccl", nproc_per_node=world_size) as parallel:
    #     parallel.run(run, config)

if __name__ == "__main__":
    main()

Executing this with:

python -m module.run

I’d be very grateful if anyone can weigh in on this.

Environment

PyTorch Version: 1.9.0
Ignite Version: 0.4.6
OS: Ubuntu 20.04.2 LTS
How you installed Ignite (conda, pip, source): conda
Python version: 3.9.6
Any other relevant information: Running on 4 A100-PCIE-40GB GPUs

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (3 by maintainers)

Most upvoted comments

Thanks for the quick feedback again. We are working on a package (aitlas) that has multiple modules in it. I am rnning the script in the root of package with python -m package.test. The package/test.py script has the code I shared earlier.

The server has 4 A100-PCIE-40GB GPUs. It has 2TB of RAM memory and 256 AMD EPYC 7742 64-Core Processors.

Pytorch (1.9.0) and Pytorch-ignite (0.4.6.) are installed with conda (4.9.2).

This is appears in the logs when running:

(aitlas) user@kt-gpu2:~/aitlas$ python -m aitlas.test
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes:
        nproc_per_node: 4
        nnodes: 1
        node_rank: 0
2021-09-07 18:43:57,015 ignite.distributed.launcher.Parallel INFO: Spawn function '<function run at 0x7f51227a2e50>' in 4 processes
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

And, it indefinetly stays in this state.

ivankitanovski on Sep 7, 2021

Thanks for the feedback, @ivankitanovski ! I’d expect gloo a bit slower than nccl. For completeness sake, could you still please run above commands to understand what happened with nccl.

vfdev-5 on Sep 13, 2021

This did it.

@ivankitanovski Could you try using the gloo backend instead of nccl ? Thanks in advance.

Don’t why I didn’t try this in the first place. But, any how, it works now. Not sure, if there are major drawbacks to using this over nccl, but I see the GPUs are getting used and things are processed.

Thanks for the help @vfdev-5 @sdesrozis

ivankitanovski on Sep 13, 2021

@ivankitanovski thanks for the info. I’m really interested to understand what happens in your case on your infrastructure. There are few things which seem a bit strange to me.

cudatoolkit is installed from conda-forge. Technically, pytorch recommended that for v1.8.0 but for v1.9.0 they say to use nvidia channel.

Could you please try the following things:

run again your hanging script with NCCL_DEBUG=INFO and report the full output

NCCL_DEBUG=INFO python -m package.test

change the backend to “gloo” and run again (without NCCL_DEBUG=INFO)

python -m package.test

I think this should work

let’s update pure pytorch code snippet from https://github.com/pytorch/ignite/issues/2185#issuecomment-913652792 as following:

# main.py
# NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py

import time
import torch
import torch.distributed as dist

def pprint(rank, msg):
    # We add sleep to avoid printing clutter
    time.sleep(0.5 * rank)
    print(rank, msg)

if __name__ == "__main__":

    # you can try "gloo" as well instead of "nccl"    
    dist.init_process_group("nccl")

    rank = dist.get_rank()
    ws = dist.get_world_size()
    torch.cuda.set_device(rank)

    pprint(rank, f"Hello from process {rank} among {ws} others")
    pprint(rank, f"Group type: {type(dist.get_backend())} : {dist.get_backend()}")

    pprint(rank, "Call barrier")
    dist.barrier()

    dist.destroy_process_group()

and run it

NCCL_DEBUG=INFO python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py

I would expect this also to hang.

Optionally, if you have docker installed, let’s use our prebuilt docker image to check with another software stack

docker pull pytorchignite/base:latest
# Assuming that the current folder contains your hanging script with NCCLbackend
docker run --rm -it -v $PWD:/repro -w /repro pytorchignite/base:latest /bin/bash -c "NCCL_DEBUG=INFO python -m package.test"

Thanks a lot for helping to debug that.

vfdev-5 on Sep 9, 2021

@ivankitanovski thanks for the info. I was suspecting that if your server has more than 4 GPUs than when ignite calls barrier it could somehow freeze everything… (even if I checked that with 8 GPUs server) Could you please also run this script to get more env info:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Do you have any chance to check the same code on other multi-gpu machine ? I would like to understand if it is related to A100 (hardware) or software, for example if you run the same test on a clean conda env with pytorch and ignite only ?

EDIT: I also yesterday checked on 4 RTX GPUs having the same env as for previous repro tests (pytorch 1.9.0_cuda11.1, python 3.9 and nvidia drivers 460, cuda 11.2) and could not neither reproduce the issue, code snippet worked for 3 and 4 processes.

vfdev-5 on Sep 9, 2021

@ivankitanovski can you please share more info on your infrastucture, total number of gpus available on the node. How exactly you run the code and if any logs are printed before hanging…

vfdev-5 on Sep 7, 2021

Tested all above code snippets on 3 and 4 V100 GPUs (aws) with pytorch 1.9.0 and python 3.9 and everything works fine. Here are logs for the first code with spawn:

Spawn on 3 GPUs

(test) root@ip-172-31-3-32:/repro/repro-ignite-2185# python -u ignite_main_spawn.py
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Running basic DDP example on rank 0.
Running basic DDP example on rank 2.
Running basic DDP example on rank 1.

Spawn on 4 GPUs

(test) root@ip-172-31-3-32:/repro/repro-ignite-2185# python -u ignite_main_spawn.py
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Running basic DDP example on rank 3.
Running basic DDP example on rank 0.
Running basic DDP example on rank 1.
Running basic DDP example on rank 2.

nvidia-smi

(test) root@ip-172-31-3-32:/repro/repro-ignite-2185# nvidia-smi
Tue Sep  7 00:11:36 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P0    53W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   37C    P0    48W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P0    53W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P0    56W / 300W |      0MiB / 16160MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

EDIT: same on a server with 8 V100 using 3 or 4 GPUs:

Spawn on 3 GPUs (p3.16xlarge = 8 V100)

(test) root@ip-172-31-5-246:/repro/repro-ignite-2185# python -u ignite_main_spawn.py
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)                    
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)                    
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)                    
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)                    
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.      
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids
 in barrier() to force use of a particular device.                                                                                                                                                                                 [W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids
 in barrier() to force use of a particular device.                                                                                                                                                                                 Running basic DDP example on rank 0.                    
Running basic DDP example on rank 1.                                                                                                                                                                                               Running basic DDP example on rank 2.

Spawn on 4 GPUs (p3.16xlarge = 8 V100)

(test) root@ip-172-31-5-246:/repro/repro-ignite-2185# python -u ignite_main_spawn.py
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
/opt/conda/envs/test/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/torch/csrc/utils/tensor_numpy.cpp:67.)
  _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Running basic DDP example on rank 1.
Running basic DDP example on rank 2.
Running basic DDP example on rank 0.
Running basic DDP example on rank 3.

nvidia-smi

(test) root@ip-172-31-5-246:/repro/repro-ignite-2185# nvidia-smi
Tue Sep  7 15:12:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                    0 |
| N/A   32C    P0    55W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                    0 |
| N/A   32C    P0    54W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                    0 |
| N/A   33C    P0    56W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   32C    P0    55W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   31C    P0    55W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   31C    P0    56W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   32C    P0    55W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    54W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

vfdev-5 on Sep 7, 2021

I will try asap on another machine without using the slurm configuration.

sdesrozis on Sep 6, 2021

Could you please check if pytorch distributed only works with A100 on 4 gpus. Please try to run this code:

# main.py
# python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py

import time
import torch
import torch.distributed as dist


def pprint(rank, msg):
    # We add sleep to avoid printing clutter
    time.sleep(0.5 * rank)
    print(rank, msg)


if __name__ == "__main__":

    # you can try "gloo" as well instead of "nccl"    
    dist.init_process_group("nccl")

    rank = dist.get_rank()
    ws = dist.get_world_size()
    torch.cuda.set_device(rank)

    pprint(rank, f"Hello from process {rank} among {ws} others")
    pprint(rank, f"Group type: {type(dist.get_backend())} : {dist.get_backend()}")
    
    dist.destroy_process_group()

vfdev-5 on Sep 6, 2021

@ivankitanovski Could you try using the gloo backend instead of nccl ? Thanks in advance.

sdesrozis on Sep 6, 2021

@ivankitanovski thanks for reporting, we’ll investigate what happens exactly.

Meanwhile, can you try to run the training with torch dist launcher :

# run.py
import torch
import ignite.distributed as idist

def run(rank, config):
    print(f"Running basic DDP example on rank {rank}.")

def main():

    # some dummy config
    config = {}
    
    # the same happens even in this case
    with idist.Parallel(backend="nccl") as parallel:
        parallel.run(run, config)

if __name__ == "__main__":
    main()

# if you have 4 GPUs
python -u -m torch.distributed.launch --nproc_per_node=4 --use_env run.py

vfdev-5 on Sep 6, 2021