transformers: A potential bug in ModuleUtilsMixin.get_extended_attention_mask
Environment info
transformers
version: 4.13.0- Platform:
- Python version: 3.8.5
- PyTorch version (GPU?): 1.10.0+cu102
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
Information
Model I am using (Bert, XLNet …): T5
There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I’ve trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax. However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn’t be trained correctly.
Here is the code I mentioned: link
I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in the code.
To reproduce
Expected behavior
the function get_extended_attention_mask uses the smaller number to mask a tensor.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (13 by maintainers)
Hi, this is on my TODO list - I have a few remaining things to finalize about PT/TF/Flax more aggressive testings, and will come back to this issue!
@jk-jung
This is fixed now (finally) 😃
Start working on it 😃
We’ve always used -10_000 from the very beginning for the
attention_mask
so changing this now would affect all transformers versionsOk, happy to do a PR for this next week - putting it on my ToDo list
Loving it!
We should do the same for all other places we have the conditional for when we that for masking and where no weights will be impacted by this change.
@LysandreJik related discussion #10484