transformers: A potential bug in ModuleUtilsMixin.get_extended_attention_mask

Environment info

  • transformers version: 4.13.0
  • Platform:
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.10.0+cu102
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …): T5

There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I’ve trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax. However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn’t be trained correctly.

Here is the code I mentioned: link

I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in the code.

To reproduce

Expected behavior

the function get_extended_attention_mask uses the smaller number to mask a tensor.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (13 by maintainers)

Most upvoted comments

Hi, this is on my TODO list - I have a few remaining things to finalize about PT/TF/Flax more aggressive testings, and will come back to this issue!

@jk-jung

This is fixed now (finally) 😃

Start working on it 😃

We’ve always used -10_000 from the very beginning for the attention_mask so changing this now would affect all transformers versions

Ok, happy to do a PR for this next week - putting it on my ToDo list

torch.finfo(self.dtype).min

Loving it!

We should do the same for all other places we have the conditional for when we that for masking and where no weights will be impacted by this change.

@LysandreJik related discussion #10484