pytorch-lightning: Single node DDP: "Default process group is not initialized"
🐛 Bug
Unable to start single node ddp training on 0.8.0
To Reproduce
was going to run the gpu_template but… #2235
both methods of running the template result in the same error
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp_spawn
$ python -m pl_examples.basic_examples.gpu_template --gpus 4 --distributed_backend ddp
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 80, in <module>
main(hyperparams)
File "/opt/conda/lib/python3.6/site-packages/pl_examples/basic_examples/gpu_template.py", line 41, in main
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in fit
self.barrier('fit_prepare_data')
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1261, in barrier
torch_distrib.barrier()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1484, in barrier
_check_default_pg()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 187, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (7 by maintainers)
Can we re-open this issue? I am still having the
Default process group is not initializedissue when I hittrainer.test()with ddp (with any number of gpus, even 1). I’m using the latest release from yesterday.+1, doesn’t look like the issue is resolved yet.
having the same problem… I also tried to downgrade pl to an older version, like 0.7.5, and try to using the older version to do the inference. But, the model trained and saved using the 0.8.x seems to not directly be compatible with older version.