pytorch-lightning: CUDA OOM when initializing DDP

🐛 Bug

Hey everyone,

I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:

trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        gpus=2,
        accelerator="ddp",
        auto_select_gpus=True
)

And the output I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 138, in <module>
    run_test()
  File "boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: Broken pipe

The script with the BoringModel I run on our workstation is in this gist.

However, this doesn’t happen on Colab using your BoringModel notebook (my version can be found here).

I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-10-d400f0366266> in test_x(tmpdir)
     16 
     17     # Train the model ⚡
---> 18     trainer.fit(model, train, val)
     19 
     20     trainer.test(test_dataloaders=test)

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    442         self.call_hook('on_fit_start')
    443 
--> 444         results = self.accelerator_backend.train()
    445         self.accelerator_backend.teardown()
    446 

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
    146         model = self.trainer.model
    147 
--> 148         results = self.ddp_train(process_idx=self.task_idx, model=model)
    149         if 'WORLD_SIZE' in os.environ:
    150             del os.environ['WORLD_SIZE']

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
    236         # where to store ip_table
    237         model.trainer = self.trainer
--> 238         self.init_ddp_connection(
    239             self.trainer.global_rank,
    240             self.trainer.world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
    213                 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
    214             )
--> 215             torch_distrib.init_process_group(
    216                 torch_backend, rank=global_rank, world_size=world_size
    217             )

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    440     # process groups including global variables are updated correctly on all
    441     # ranks.
--> 442     barrier()
    443 
    444 def _new_process_group_helper(world_size,

~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
   1945     if group == GroupMember.WORLD:
   1946         _check_default_pg()
-> 1947         work = _default_pg.barrier()
   1948     else:
   1949         work = group.barrier()

RuntimeError: CUDA error: out of memory

At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:

try:
	trainer.fit(model, train_data, val_data)
except:
	trainer.fit(model, train_data, val_data)

As a result, I get this stack trace:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
  File "boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "boring_model.py", line 143, in <module>
    run_test()
  File "boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
    self.init_ddp_connection(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
    torch_distrib.init_process_group(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
    barrier()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
    run_test()
  File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
    trainer.fit(model, train_data, val_data)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
    results = self.accelerator_backend.train()
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
    model = self.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
    model = self.ddp_plugin.configure_ddp(model, device_ids)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
    model = LightningDistributedDataParallel(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Broken pipe

Expected behavior

The models should train without issues.

Environment

CUDA:
- GPU:
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
  - TITAN V
- available: True
- version: 10.1
Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.6
- tqdm: 4.52.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Fri Oct 18 17:15:30 UTC 2019

Additional context

I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.

This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.

Thanks in advance!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 51 (27 by maintainers)

Most upvoted comments

With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 I get the usual OOM exception.

With CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9, instead, it works as with ddp.

dedeswim on Nov 19, 2020

Hi Not what OP reported, but in case you land here because you had OOM with PL 1.3.x, it may be because you were running a script on gpu index > 0 (not the first gpu) and PL would still allocate some memory on GPU 0. If you had another experiment running on GPU 0 it could throw OOM. The issue was fixed in #8165.

awaelchli on Jun 28, 2021

ok, it seems this has nothing todo with the selection of the backend. Running the examples on CPU (i.e. Trainer(gpus=None)) allocates 600MB of memory in my GPU and setting CUDA_VISIBLE_DEVICES="" prevents this.

If I take however a pure pytorch script like this https://github.com/pytorch/examples/blob/master/mnist/main.py and run it on CPU, it does not allocate any extra memory on the GPUs so there is clearly something in Lightning that communicates with CUDA even in CPU mode.

That’s all I have in terms of analysis for now. Don’t yet have a clue where to look…

awaelchli on Dec 3, 2020

same issue like this: With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 I get the usual OOM exception. With CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7, instead, it works as with ddp. how to solve it?

shyTNT on Nov 4, 2021

Sorry I missed this. Yes, everything is on CPU for sure. I’ll try with the BoringModel as well.

In the meantime, I can confirm that ddp has mysterious processes running while ddp_spawn doesn’t (it only has a single extract process for the CUDA context).

EDIT: I can’t reproduce it with the BoringModel. I’m trying to understand whether it depends on my custom Dataset…

aleSuglia on Apr 15, 2021

maybe related, not sure, but today I discovered that if you run on CPU but still have pin_memory in your dataloader, it will allocate memory on the gpu. This is what was happening here in this comment: https://github.com/PyTorchLightning/pytorch-lightning/issues/4705#issuecomment-738331370

This could also result in a OOM, but not sure if this is the case here.

awaelchli on Feb 26, 2021

@ktrapeznikov You get the NCCL error probably because maybe you are using pytorch 1.7, see my answer here maybe it helps.

awaelchli on Dec 28, 2020

I forgot to mention that it also takes memory on GPU 1, so it is taking memory on both GPUs (around 10 GiB on GPU1 and 500 MiB on GPU0). In the afternoon (CET timezone) I’ll do my best to create a reproducible script.

dedeswim on Dec 7, 2020

Tried with the version installed in a fresh environment via

pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python boring_model.py is still not working, while CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 python boring_model.py does. This still happens both with ddp and ddp_spawn.

dedeswim on Dec 1, 2020