pytorch-lightning: Returning None from training_step with multi GPU DDP training
🐛 Bug
Returning None from training_step with multi GPU DDP training freezes the training without exception
To Reproduce
Starting multi-gpu training with a None-returning training_step function.
Example training_step function:
def training_step(self, batch, batch_idx):
data, target = batch
model_outputs = self.forward(images)
loss = calc_loss(model_outputs, target)
if torch.isnan(loss) or random.random() < .05:
return None
return loss
Example trainer:
trainer = Trainer(
gpus=2,
distributed_backend="ddp"
)
Expected behavior
To continue training with skipping the current batch as pointed out at here.
Environment
No specific environment is needed to reproduce this bug.
Additional context
This issue was mentioned here: #4956 but not with specifics.
Note: By the time this issue being investigated, a help for a workaround would be great!
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 2
- Comments: 26 (14 by maintainers)
Hi! I’ll give this a look 😃
No @AsaphLightricks. Returning
Noneintraning_stepwith DDP is currently unsupported.Hi, is there a solution to this issue?
Hi, I’m in a similar situation. My batches are formed from the output of an object detector, so sometimes the batch will essentially be of size zero (I can’t think of a good way to explain this but just trust it makes sense). In this case, I would like to return None from train_step, or least return some kind of zero loss-tensor with zero gradients. If its not easy to return None, is there some way to artificially construct a zero-tensor with all the appropriate gradients present so that the DDP sync will work?
I have a fix but it is inside the training loop ( it is similar to @tchaton suggestion above):
Basically I gather a flag across all DDP workers, if any of the workers set the flag all workers must return None. If all return None there is not anymore freezing. But it would be neat if this stuff is handled inside lightning. I feel here I just add unnecessary synchronization.
I am sure for someone more familiar with lightning background magic it must be easy to do add something similar in the right place.
@AAnoosheh no, #5359 just closed but couldn’t be merged yet. We still need to work on it and figure out a solution.
Hi @awaelchli I am not sure of it yet, but it may be an exploding gradient issue with a single batch generates powerful gradients. It happens rarely but the model is learning perfectly when it doesn’t. I tried clipping gradients which seriously impacts the speed of learning process in my case, but seemingly solves the problem. Another approach I tried is accumulating gradients, which I think reduces the effect of a batch causing low quality gradients problem I mentioned before, and it did reduce the nan loss error significantly. However, the problem still persists.
Another approach I tried was equalizing the loss to
torch.tensor(0)and I thought it could help me not to update my model for that batch. However, it is causing a loss of the computation graph.The
random.random() < .05condition just serves to reproduce the nan loss error more often, as it happens very rarely. It has nothing to do with my training procedure.