pytorch-lightning: CUDA OOM when initializing DDP
š Bug
Hey everyone,
I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following Trainer configuration:
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
gpus=2,
accelerator="ddp",
auto_select_gpus=True
)
And the output I get is the following:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "boring_model.py", line 138, in <module>
run_test()
File "boring_model.py", line 133, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 138, in <module>
run_test()
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 133, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
The script with the BoringModel I run on our workstation is in this gist.
However, this doesnāt happen on Colab using your BoringModel notebook (my version can be found here).
I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)
<ipython-input-10-d400f0366266> in test_x(tmpdir)
16
17 # Train the model ā”
---> 18 trainer.fit(model, train, val)
19
20 trainer.test(test_dataloaders=test)
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
442 self.call_hook('on_fit_start')
443
--> 444 results = self.accelerator_backend.train()
445 self.accelerator_backend.teardown()
446
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in train(self)
146 model = self.trainer.model
147
--> 148 results = self.ddp_train(process_idx=self.task_idx, model=model)
149 if 'WORLD_SIZE' in os.environ:
150 del os.environ['WORLD_SIZE']
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py in ddp_train(self, process_idx, model)
236 # where to store ip_table
237 model.trainer = self.trainer
--> 238 self.init_ddp_connection(
239 self.trainer.global_rank,
240 self.trainer.world_size,
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py in init_ddp_connection(self, global_rank, world_size, is_slurm_managing_tasks)
213 f"initializing ddp: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}"
214 )
--> 215 torch_distrib.init_process_group(
216 torch_backend, rank=global_rank, world_size=world_size
217 )
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
440 # process groups including global variables are updated correctly on all
441 # ranks.
--> 442 barrier()
443
444 def _new_process_group_helper(world_size,
~/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py in barrier(group, async_op)
1945 if group == GroupMember.WORLD:
1946 _check_default_pg()
-> 1947 work = _default_pg.barrier()
1948 else:
1949 work = group.barrier()
RuntimeError: CUDA error: out of memory
At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:
try:
trainer.fit(model, train_data, val_data)
except:
trainer.fit(model, train_data, val_data)
As a result, I get this stack trace:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7,8,9]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "boring_model.py", line 135, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "boring_model.py", line 143, in <module>
run_test()
File "boring_model.py", line 137, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
model = self.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
model = self.ddp_plugin.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
model = LightningDistributedDataParallel(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 135, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 238, in ddp_train
self.init_ddp_connection(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 215, in init_ddp_connection
torch_distrib.init_process_group(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729009598/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 143, in <module>
run_test()
File "/home/edoardo.debenedetti/projects/gans-mia-unlearning/boring_model.py", line 137, in run_test
trainer.fit(model, train_data, val_data)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 444, in fit
results = self.accelerator_backend.train()
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 275, in ddp_train
model = self.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 292, in configure_ddp
model = self.ddp_plugin.configure_ddp(model, device_ids)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/ddp_plugin.py", line 59, in configure_ddp
model = LightningDistributedDataParallel(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/edoardo.debenedetti/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Broken pipe
Expected behavior
The models should train without issues.
Environment
- CUDA:
- GPU:
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- TITAN V
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.0.6
- tqdm: 4.52.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Fri Oct 18 17:15:30 UTC 2019
Additional context
I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.
This happens also if I select (free) GPUs manually by specifying them in the gpus flag as a List[int]. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, setting accelerator="dp"I have no issues.
Thanks in advance!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 51 (27 by maintainers)
With
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9I get the usual OOM exception.With
CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9, instead, it works as withddp.Hi Not what OP reported, but in case you land here because you had OOM with PL 1.3.x, it may be because you were running a script on gpu index > 0 (not the first gpu) and PL would still allocate some memory on GPU 0. If you had another experiment running on GPU 0 it could throw OOM. The issue was fixed in #8165.
ok, it seems this has nothing todo with the selection of the backend. Running the examples on CPU (i.e.
Trainer(gpus=None)) allocates 600MB of memory in my GPU and settingCUDA_VISIBLE_DEVICES=""prevents this.If I take however a pure pytorch script like this https://github.com/pytorch/examples/blob/master/mnist/main.py and run it on CPU, it does not allocate any extra memory on the GPUs so there is clearly something in Lightning that communicates with CUDA even in CPU mode.
Thatās all I have in terms of analysis for now. Donāt yet have a clue where to lookā¦
same issue like this: With CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 I get the usual OOM exception. With CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7, instead, it works as with ddp. how to solve it?
Sorry I missed this. Yes, everything is on CPU for sure. Iāll try with the BoringModel as well.
In the meantime, I can confirm that
ddphas mysterious processes running whileddp_spawndoesnāt (it only has a single extract process for the CUDA context).EDIT: I canāt reproduce it with the BoringModel. Iām trying to understand whether it depends on my custom Datasetā¦
maybe related, not sure, but today I discovered that if you run on CPU but still have pin_memory in your dataloader, it will allocate memory on the gpu. This is what was happening here in this comment: https://github.com/PyTorchLightning/pytorch-lightning/issues/4705#issuecomment-738331370
This could also result in a OOM, but not sure if this is the case here.
@ktrapeznikov You get the NCCL error probably because maybe you are using pytorch 1.7, see my answer here maybe it helps.
I forgot to mention that it also takes memory on GPU 1, so it is taking memory on both GPUs (around 10 GiB on GPU1 and 500 MiB on GPU0). In the afternoon (CET timezone) Iāll do my best to create a reproducible script.
Tried with the version installed in a fresh environment via
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 python boring_model.pyis still not working, whileCUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 python boring_model.pydoes. This still happens both withddpandddp_spawn.