pytorch-lightning: Trainig stuck before first epoch with `ddp` and multi-gpu
🐛 Bug
Training is stuck when using ddp
, gpus=[0, 1]
, and num_sanity_val_steps=2
. The two validation checks are executed. The code execution seems to be stuck at self.scaler.step(optimizer)
in pre_optimizer_step
in pytorch_lightning/plugins/precision/native_amp.py
and more specifically in pytorch at
https://github.com/pytorch/pytorch/blob/4f8b986e28736b59bc46cd0873a0f36fdaa6f5b8/torch/cuda/amp/grad_scaler.py#L284
If I either use dp
, or gpus=[0]
, or num_sanity_val_steps=0
, the training runs normally (so any one of the changes means the code works)
Also, the code works with
torch==1.8.1+cu111
, pytorch-lightning==1.3.8
torch==1.10.2+cu113
, pytorch-lightning==1.3.8
code does not work with:
torch==1.10.2+cu113
, pytorch-lightning==1.4.0
torch==1.10.2+cu113
, pytorch-lightning==1.5.10
To Reproduce
Annoyingly, I cannot reproduce the code with the BoringModel.
Environment
- PyTorch Lightning Version: 1.5.10
- PyTorch Version: 1.10.2+cu113
- Python version: 3.7
- OS: Ubuntu 18.04
- CUDA/cuDNN version: 11.6
- GPU models and configuration: 2*2080Ti
- How you installed PyTorch: pip
- Any other relevant information:
Code works pre pytorch-lightning
1.4.0
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @carmocca
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 20 (6 by maintainers)
Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch
I had same issue with it. And I replace the DDP sampler by myself, and set “drop_last=True” to make sure each node have the same number of batch. But It still on stuck on the last. But the funny things is if the limit_train_batch set to a int. it works fine.
One possible reasons I could think of is that all the data workers might not have returned the same no of batches, resulting in GPUs waiting indefinitely. But I am not sure why it works on one version and doesn’t work on latest.
I would suggest you to use
num_sanity_val_steps
as it will identify possible issues present in validation steps earlier, hence we don’t need to wait tillepoch 0
completion to find out those issues.