pytorch-lightning: Training is interrupted without error with MulitGPU
š Bug
The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated. The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.
We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.
Epoch 1: 29%|āā | 9413/32393 [3:28:18<8:28:33, 1.33s/it, loss=3.23, v_num=9]Terminated
Please reproduce using the BoringModel
Cant reproduce with Boring Model.
Code
class ImageGPT(pl.LightningModule):
def __init__(self,
learning_rate=learning_rate
):
super().__init__()
self.gpt2 = ImageGPT2LMHeadModel(config=...)
self.criterion = nn.CrossEntropyLoss(reduction='none')
self.learning_rate = learning_rate
def forward(self, x):
return self.gpt2(x, past_key_values=None)
....
logger = pl_loggers.TensorBoardLogger(save_dir="logs", name=name)
checkpoint_callback = ModelCheckpoint(
save_top_k=1,
verbose=True,
monitor='val_loss',
mode='min',
filepath='../models',
prefix='ImageGPT'
)
trainer = Trainer(
accelerator='ddp',
max_epochs=10,
max_steps=None,
precision=32,
accumulate_grad_batches=1,
gpus=[0, 1, 2],
callbacks=[checkpoint_callback],
logger=logger,
gradient_clip_val=0.6
)
trainer.fit(model=model, datamodule=datamodule)
Expected behavior
The training is fully completed across all epochs.
Environment
- CUDA:
- GPU:
- TITAN RTX
- TITAN RTX
- TITAN RTX
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.4
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.2
- transformers: 3.5.1
- tqdm: 4.55.0
- System:
- OS: Linux, 64bit
- processor: x86_64
- python: 3.7.4
- version: 86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
Additional context
We have made the following points to solve the problem:
- set the num-workers of the dataloaders to 0 or 1 (instead of 32-64)
- go back to 32 bit precision
- different learning rates
- added gradient clipping
- used AdamW implementation from huggingface
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 24 (10 by maintainers)
I have the same issue, but only running on one GPU.
I suggest:
Logging
of the doc that logging with distributed training need special handling and care, and then refer to the section Multi-GPU training/Prepare your code/Synchronize validation and test loggingvalidation_epoch_end
in the section Multi-GPU training/Prepare your code/Synchronize validation and test loggingThis happens to me too, but I donāt get āTerminatedā at end of progress bar, it just stops, and when I check the system with ātop iā, I see 4 python processes running at 100% and 4 GPUs at about 90% capacity but nothing is changing. Sometimes, after like an hour or two, it just randomly restarts but the s/it jumps from 2.5 to like 85.
DDPPlugin(find_unused_parameters=True)
seems to fix the problem for me too.Sure, Iāll try get some time, in the meantime I tested the
all_gather
function which works fine. Just change yourvalidation_epoch_end
to look like this:Let me give some context as to why this fix works. When we run validation across distributed processes, each GPU/process gets a different set of data batches. This means the score calculated on every GPU is different, unless we do some form of synchronisation between the process. This can either be done via:
validation_epoch_end
this doesnāt automatically sync the batches across processes. This is because in many cases you donāt want to do this, i.e if youāre using apl.Metric
to handle this instead. If you do want to sync, you can sync tensors and some python primitives usingself.all_gather
like suggested above!@marrrcin maybe the explanation could be insightful? If youāre running into a different error and can get a reproducible script let me know, I can help resolve
@PyTorchLightning/core-contributors @justusschock this has come up a few times as a bug. How about if we receive different monitor score in the model checkpoint from processes we throw a warning? I donāt see too many cases where weād have different processes giving different results to the model checkpoint. The check will involve gathering the monitor score across processes, but considering this happens only at saving time, this might be worth it.
Iām only calling
self.log
in the*_step
without any custom metrics (just with loss) and addingsync_dist=True
does not resolve the issue.@carmocca Sorry for the late reproducible example! Please see below for a self-contained example that uses two GPUs for DDP. For me the code gets stuck at epoch 13, while the two GPUs keep busy at 100%. Switching to DP solves the problem.
Some hints
def shared_step(self, batch)
method also solves the problem. But how could such a harmless dropout layer ruin the model?My settings:
This suggestion didnāt work for me, but setting
rank_zero_only=True
did the trick.@edenlightning @SeanNaren sorry for the late reply. Yes the problem is solved, either by using the pl.Metric or overriding
validation_epoch_end
withall_gether
. Thank you so much for the help!Have you guys tried updating to v1.2? Iām using Metrics API now instead of returning batch dict and everything is working fine, using ModelCheckpoint callback too. I havenāt got any freezing so far.
@angadkalra +1. For me switching to DP solves the problem, but it at the cost of speed.
Update: I test on PL v1.1.5-v1.2.0 rc1, and the problem persists. Iām sorry I canāt upload a reproducible example at the moment, but will probably do that later.