transformers: Corrupted Relative Attention in T5 Decoder
Environment info
platform: Mac/Ubuntu 14 transformers==2.11.0 torch==1.4.0 (GPU) python 3.6 I know this is an old version but it supports important experiments in a paper under review. Would appreciate to know what’s wrong. I checked the commit log and I don’t think any following commits resolve it.
Who can help
@patrickvonplaten (through slack) @patil-suraj (mentioned below) Please let me know if there is anything else I can provide! Thank you!
Information
I made an artificial binary classification data where the input sequences are near-randomly generated tokens from the T5 vocab. The output sequences are balanced “answer: correct/restaurant
” (two binary tag words randomly selected). A data sample can be found here in format (input_seq \t output_seq
). The custom data reader parses this data with T5Tokenizer and is_pretokenized=True (see here)
I feed the T5ForConditionalGeneration model (v.2.11.0) with input_ids, lm_labels, and their corresponding attention_masks during training. The model should not learn anything because the sequences are near-random, but in reality, it converges to a zero loss, meaning that the lm_logits from decoder actually attend to future inputs (after shift_right()
) and knows the label. During evaluation where I hide the binary tag, the model always predicts positive.
To reproduce
Steps to reproduce the behavior:
- Use the code in this repo: https://github.com/Slash0BZ/t5-investigation
- Ran with sample data. I have tried both pre-trained T5-large and also randomly initialized T5-Large (written like this)
I am not sure if the training data size affects the result. I ran with a training size of 5M. I am happy to provide the full data and a trained model if actual experiments are needed.
Expected behavior
The training loss converges to near-zero and the lm_logits reflects predictions the same as the output sequence during training. However, in evaluation where the data reader hides the binary tag in the output sequence (achieve through only providing “answer:” in decoder_input_ids), the prediction is uniform.
I also tried to change the decoder_input_ids. When it is [0, 1525, 10, 2024], the prediction at position 2 is 2024. When it is [0, 1525, 10, 2062], the prediction at position 2 is 2062.
Notes: 1525->“answer”, 10->“:”, 2024->“correct”, 2062->“restaurant”
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
We use https://github.com/allenai/allennlp/blob/f091cb9cd92e767f55659b2b59f0ffb75bc613be/allennlp/nn/util.py#L239, which ultimately boils down to using this value:
torch.finfo(tensor.dtype).min
.@patrickvonplaten I never faced this issue in my T5 experiments but it does seem possible that -10000 can cause some issues because while investigating the fp16 issue we have seen that T5 produces large activation values.
And I agree with @dirkgr solution.
Interesting, so our causal mask actually doesn’t fully force “attending-to-next-tokens” to be impossible -> it just gives it a very large negative number before softmax (-10000) so that after softmax this value should be zero. Maybe your model has incredibly high activations that can somehow overturn the -10000. Could you maybe try the following:
-float("inf")
. Then the script above should definitely not yield an assertion error anymore@patrickvonplaten, yes, the
-10000
can totally cheat the value. We’ve seen that in the past in cases where the output values are passed through an argmax while the probability distribution is very uniform.We’ve kept
-10000
to stay as close as possible to the original BERT implementation, and we recommend to use as few padding tokens as possible for this not to have an effect (while keeping in mind that the -10000 should keep values very very small and should have a minimal impact).@dirkgr’s solution is definitely more robust and I don’t think switching the -10000 value to be lower would change anyone’s workflow, so I wouldn’t be opposed to switching.