transformers: A potential bug in ModuleUtilsMixin.get_extended_attention_mask

Environment info

transformers version: 4.13.0
Platform:
Python version: 3.8.5
PyTorch version (GPU?): 1.10.0+cu102
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

Information

Model I am using (Bert, XLNet …): T5

There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I’ve trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax. However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn’t be trained correctly.

Here is the code I mentioned: link

I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in the code.

To reproduce

Expected behavior

the function get_extended_attention_mask uses the smaller number to mask a tensor.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (13 by maintainers)

Most upvoted comments

Hi, this is on my TODO list - I have a few remaining things to finalize about PT/TF/Flax more aggressive testings, and will come back to this issue!

ydshieh on Mar 28, 2022

@jk-jung

This is fixed now (finally) 😃

ydshieh on Jul 1, 2022

Start working on it 😃

ydshieh on May 16, 2022

We’ve always used -10_000 from the very beginning for the attention_mask so changing this now would affect all transformers versions

patrickvonplaten on Jan 31, 2022

Ok, happy to do a PR for this next week - putting it on my ToDo list

patrickvonplaten on Dec 22, 2021

torch.finfo(self.dtype).min

Loving it!

We should do the same for all other places we have the conditional for when we that for masking and where no weights will be impacted by this change.

stas00 on Dec 22, 2021

@LysandreJik related discussion #10484

patil-suraj on Dec 21, 2021