pytorch-lightning: Trainig stuck before first epoch with `ddp` and multi-gpu

🐛 Bug

Training is stuck when using ddp, gpus=[0, 1], and num_sanity_val_steps=2. The two validation checks are executed. The code execution seems to be stuck at self.scaler.step(optimizer) in pre_optimizer_step in pytorch_lightning/plugins/precision/native_amp.py and more specifically in pytorch at https://github.com/pytorch/pytorch/blob/4f8b986e28736b59bc46cd0873a0f36fdaa6f5b8/torch/cuda/amp/grad_scaler.py#L284

If I either use dp, or gpus=[0], or num_sanity_val_steps=0, the training runs normally (so any one of the changes means the code works)

Also, the code works with torch==1.8.1+cu111, pytorch-lightning==1.3.8 torch==1.10.2+cu113, pytorch-lightning==1.3.8

code does not work with: torch==1.10.2+cu113, pytorch-lightning==1.4.0 torch==1.10.2+cu113, pytorch-lightning==1.5.10

To Reproduce

Annoyingly, I cannot reproduce the code with the BoringModel.

Environment

  • PyTorch Lightning Version: 1.5.10
  • PyTorch Version: 1.10.2+cu113
  • Python version: 3.7
  • OS: Ubuntu 18.04
  • CUDA/cuDNN version: 11.6
  • GPU models and configuration: 2*2080Ti
  • How you installed PyTorch: pip
  • Any other relevant information: Code works pre pytorch-lightning 1.4.0

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @carmocca

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 9
  • Comments: 20 (6 by maintainers)

Most upvoted comments

Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch

I had same issue with it. And I replace the DDP sampler by myself, and set “drop_last=True” to make sure each node have the same number of batch. But It still on stuck on the last. But the funny things is if the limit_train_batch set to a int. it works fine.

Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch

One possible reasons I could think of is that all the data workers might not have returned the same no of batches, resulting in GPUs waiting indefinitely. But I am not sure why it works on one version and doesn’t work on latest.

I would suggest you to use num_sanity_val_steps as it will identify possible issues present in validation steps earlier, hence we don’t need to wait till epoch 0 completion to find out those issues.