pytorch-lightning: Training is interrupted without error with MulitGPU
š Bug
The training is interrupted randomly in the middle of an epoch without errors. The console only says: Terminated. The error does not necessarily occur, if it does then mostly between epochs 2-4. It is noticeable that processes are still running after the termination, the graphic cards are still used by python processes.
We train the PyTorch version of the ImageGPT model with huggingface transformers. Could also be problem of huggingface, we are not sure.
Epoch 1: 29%|āā | 9413/32393 [3:28:18<8:28:33, 1.33s/it, loss=3.23, v_num=9]Terminated
Please reproduce using the BoringModel
Cant reproduce with Boring Model.
Code
class ImageGPT(pl.LightningModule):
def __init__(self,
learning_rate=learning_rate
):
super().__init__()
self.gpt2 = ImageGPT2LMHeadModel(config=...)
self.criterion = nn.CrossEntropyLoss(reduction='none')
self.learning_rate = learning_rate
def forward(self, x):
return self.gpt2(x, past_key_values=None)
....
logger = pl_loggers.TensorBoardLogger(save_dir="logs", name=name)
checkpoint_callback = ModelCheckpoint(
save_top_k=1,
verbose=True,
monitor='val_loss',
mode='min',
filepath='../models',
prefix='ImageGPT'
)
trainer = Trainer(
accelerator='ddp',
max_epochs=10,
max_steps=None,
precision=32,
accumulate_grad_batches=1,
gpus=[0, 1, 2],
callbacks=[checkpoint_callback],
logger=logger,
gradient_clip_val=0.6
)
trainer.fit(model=model, datamodule=datamodule)
Expected behavior
The training is fully completed across all epochs.
Environment
- CUDA:
- GPU:
- TITAN RTX
- TITAN RTX
- TITAN RTX
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.4
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.2
- transformers: 3.5.1
- tqdm: 4.55.0
- System:
- OS: Linux, 64bit
- processor: x86_64
- python: 3.7.4
- version: 86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
Additional context
We have made the following points to solve the problem:
- set the num-workers of the dataloaders to 0 or 1 (instead of 32-64)
- go back to 32 bit precision
- different learning rates
- added gradient clipping
- used AdamW implementation from huggingface
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 24 (10 by maintainers)
I have the same issue, but only running on one GPU.
I suggest:
Loggingof the doc that logging with distributed training need special handling and care, and then refer to the section Multi-GPU training/Prepare your code/Synchronize validation and test loggingvalidation_epoch_endin the section Multi-GPU training/Prepare your code/Synchronize validation and test loggingThis happens to me too, but I donāt get āTerminatedā at end of progress bar, it just stops, and when I check the system with ātop iā, I see 4 python processes running at 100% and 4 GPUs at about 90% capacity but nothing is changing. Sometimes, after like an hour or two, it just randomly restarts but the s/it jumps from 2.5 to like 85.
DDPPlugin(find_unused_parameters=True)seems to fix the problem for me too.Sure, Iāll try get some time, in the meantime I tested the
all_gatherfunction which works fine. Just change yourvalidation_epoch_endto look like this:Let me give some context as to why this fix works. When we run validation across distributed processes, each GPU/process gets a different set of data batches. This means the score calculated on every GPU is different, unless we do some form of synchronisation between the process. This can either be done via:
validation_epoch_endthis doesnāt automatically sync the batches across processes. This is because in many cases you donāt want to do this, i.e if youāre using apl.Metricto handle this instead. If you do want to sync, you can sync tensors and some python primitives usingself.all_gatherlike suggested above!@marrrcin maybe the explanation could be insightful? If youāre running into a different error and can get a reproducible script let me know, I can help resolve
@PyTorchLightning/core-contributors @justusschock this has come up a few times as a bug. How about if we receive different monitor score in the model checkpoint from processes we throw a warning? I donāt see too many cases where weād have different processes giving different results to the model checkpoint. The check will involve gathering the monitor score across processes, but considering this happens only at saving time, this might be worth it.
Iām only calling
self.login the*_stepwithout any custom metrics (just with loss) and addingsync_dist=Truedoes not resolve the issue.@carmocca Sorry for the late reproducible example! Please see below for a self-contained example that uses two GPUs for DDP. For me the code gets stuck at epoch 13, while the two GPUs keep busy at 100%. Switching to DP solves the problem.
Some hints
def shared_step(self, batch)method also solves the problem. But how could such a harmless dropout layer ruin the model?My settings:
This suggestion didnāt work for me, but setting
rank_zero_only=Truedid the trick.@edenlightning @SeanNaren sorry for the late reply. Yes the problem is solved, either by using the pl.Metric or overriding
validation_epoch_endwithall_gether. Thank you so much for the help!Have you guys tried updating to v1.2? Iām using Metrics API now instead of returning batch dict and everything is working fine, using ModelCheckpoint callback too. I havenāt got any freezing so far.
@angadkalra +1. For me switching to DP solves the problem, but it at the cost of speed.
Update: I test on PL v1.1.5-v1.2.0 rc1, and the problem persists. Iām sorry I canāt upload a reproducible example at the moment, but will probably do that later.