pytorch-lightning: Returning None from training_step with multi GPU DDP training

🐛 Bug

Returning None from training_step with multi GPU DDP training freezes the training without exception

To Reproduce

Starting multi-gpu training with a None-returning training_step function.

Example training_step function:

    def training_step(self, batch, batch_idx):
        data, target = batch
        model_outputs = self.forward(images)
        loss = calc_loss(model_outputs, target)

        if torch.isnan(loss) or random.random() < .05:
            return None

        return loss

Example trainer:

 trainer = Trainer(
    gpus=2,
    distributed_backend="ddp"
)

Expected behavior

To continue training with skipping the current batch as pointed out at here.

Environment

No specific environment is needed to reproduce this bug.

Additional context

This issue was mentioned here: #4956 but not with specifics.

Note: By the time this issue being investigated, a help for a workaround would be great!

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 2
Comments: 26 (14 by maintainers)

Most upvoted comments

Hi! I’ll give this a look 😃

gianscarpe on May 18, 2023

No @AsaphLightricks. Returning None in traning_step with DDP is currently unsupported.

carmocca on May 2, 2022

Hi, is there a solution to this issue?

AsaphLightricks on May 2, 2022

Hi, I’m in a similar situation. My batches are formed from the output of an object detector, so sometimes the batch will essentially be of size zero (I can’t think of a good way to explain this but just trust it makes sense). In this case, I would like to return None from train_step, or least return some kind of zero loss-tensor with zero gradients. If its not easy to return None, is there some way to artificially construct a zero-tensor with all the appropriate gradients present so that the DDP sync will work?

kazimpal87 on Mar 2, 2022

I have a fix but it is inside the training loop ( it is similar to @tchaton suggestion above):

    def _reduce_loss(self, utt_id, batch_id, loss, reduction="mean"):

        assert loss.shape[0] == len(utt_id), "loss must be reduced to batch dimension !"

        mask_nan_inf = torch.logical_or(torch.isnan(loss), ~torch.isfinite(loss))
        if torch.any(mask_nan_inf):
            where_invalid = torch.where(mask_nan_inf)[0]
            for indx in range(where_invalid.shape[0]):
                inv_indx = where_invalid[indx].item()
                log.info(
                    f"NaN loss in batch {batch_id} of epoch {self.current_epoch}, for utt_id {utt_id[inv_indx]}"
                )
            # if any is invalid then we must flag this to all processes
            flag_skip = torch.ones((), device=loss.device, dtype=torch.bool)
        else:
            flag_skip = torch.zeros((), device=loss.device, dtype=torch.bool)

        # sub-optimal but will do,
        # till they fix it in https://github.com/Lightning-AI/lightning/issues/5243#issuecomment-1552650013
        world_size = torch_dist.get_world_size()
        torch_dist.barrier()
        # now gather
        result = [torch.zeros_like(flag_skip) for _ in range(world_size)]
        torch_dist.all_gather(result, flag_skip)
        any_invalid = torch.sum(torch.stack(result)).bool().item()

        if any_invalid:
            if self.nan_countdown >= 100:
                raise RuntimeError(
                    "Too many NaNs loss iterations encountered, stopping !"
                )
            self.nan_countdown += 1
            return None
        else:
            self.nan_countdown = 1
            return loss.mean() if reduction == "mean" else loss.sum()

Basically I gather a flag across all DDP workers, if any of the workers set the flag all workers must return None. If all return None there is not anymore freezing. But it would be neat if this stuff is handled inside lightning. I feel here I just add unnecessary synchronization.

I am sure for someone more familiar with lightning background magic it must be easy to do add something similar in the right place.

popcornell on May 18, 2023

@AAnoosheh no, #5359 just closed but couldn’t be merged yet. We still need to work on it and figure out a solution.

awaelchli on Oct 22, 2021

Hi @awaelchli I am not sure of it yet, but it may be an exploding gradient issue with a single batch generates powerful gradients. It happens rarely but the model is learning perfectly when it doesn’t. I tried clipping gradients which seriously impacts the speed of learning process in my case, but seemingly solves the problem. Another approach I tried is accumulating gradients, which I think reduces the effect of a batch causing low quality gradients problem I mentioned before, and it did reduce the nan loss error significantly. However, the problem still persists.

Another approach I tried was equalizing the loss to torch.tensor(0) and I thought it could help me not to update my model for that batch. However, it is causing a loss of the computation graph.

The random.random() < .05 condition just serves to reproduce the nan loss error more often, as it happens very rarely. It has nothing to do with my training procedure.

iamkucuk on Dec 23, 2020