DeepSpeed: DeepSpeed assumes model returns just one variable: loss

https://github.com/microsoft/DeepSpeed/blob/4d735946b8f256bc80ba13e3530f85c91d041ff4/deepspeed/pt/deepspeed_light.py#L582-L606

As you can see in line 596 above in the forward() call for the DeepSpeedLight engine, self.module (which gets initialized earlier with the model passed to deepspeed.initialize() by the client) is assumed to be returning just one output: loss. However, as a model developer, I can have many outputs being returned by my model, including several different losses. An example of such a model would be the GPT2DoubleHeadsModel in Hugging Face’s transformers repo, which returns two different losses, one for each head/task.

The consequence of this is that I won’t be able to integrate DeepSpeed as-is to work with such models. Could you please make the necessary changes to be able to support this use-case?

I suspect what you need to do is:

  1. Move lines 599-600 (which perform loss scaling based on gradient accumulation steps) into your implementation of backward().
  2. Update line 596 to reflect the fact that self.module could return a generic tuple of outputs instead of a loss.

This, in turn, will allow the client to define a customized loss by leveraging all the outputs, and then call your implementation of backward() with this loss as input. And things should be fine because the loss scaling is still happening, only in backward() instead of forward(). Does this make sense?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@tjruwase I think instead of 2, you could just change the signature of backward() to also return the loss. The returned loss would be scaled, and typically users log their loss after forward+backward anyway, not just forward.

So it sounds like the consensus is

  1. Remove _scale_loss() from forward() so deepspeed forward is semantically equal to client forward
  2. Expose model_engine.scale_loss(loss) so client can obtain scaled loss
  3. Call _scale_loss() in backward(), assuming client is passing unscaled loss values

Is this correct? I can start preparing the PR.

@tjruwase Glad you agree with the points I’ve raised now! 😃

  1. It makes assumptions about the appropriate scaling function, i.e… normalization by grad_accumulation_steps. Your example involves an arbitrary scaling function.

But, now I am skeptical whether loss scaling belongs in backward(), but rather that client should be expected to pass appropriately scaled loss values to backward(). In the obvious case, loss scaling is a one-liner (we not saving client much work), and in more exotic cases backward() can’t replicate it anyways.

Indeed, my example involves arbitrary scaling of the two losses returned (via args.coef_a and args.coef_b). But please note that this scaling is independent of gradient accumulation. This scaling is for multi-task learning, where we want to control the importance of each task to the overall loss.

The gradient accumulation scaling needs to happen on the overall loss separately. This could be either internal to DeepSpeed or externally handled by the client script when not using DeepSpeed.

So I think moving gradient accumulation-related loss scaling to backward() will be just fine and should not be a breaking change. I’d also like to hear what @samyam, @jeffra and @ShadenSmith think.