pytorch-lightning: Single node DDP: "Default process group is not initialized"

🐛 Bug

Unable to start single node ddp training on 0.8.0

To Reproduce

~~was going to run the gpu_template but… #2235~~ both methods of running the template result in the same error

$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
    main(hyperparams)
  File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
    self.barrier('fit_prepare_data')
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
    torch_distrib.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
    _check_default_pg()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

Can we re-open this issue? I am still having the Default process group is not initialized issue when I hit trainer.test() with ddp (with any number of gpus, even 1). I’m using the latest release from yesterday.

zackcarson on Jul 2, 2020

+1, doesn’t look like the issue is resolved yet.

armancohan on Jul 2, 2020

having the same problem… I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.

jxchen01 on Jul 4, 2020