transformers: EncoderDecoder does not automatically create decoder_attention_mask to match decoder_input_ids
System Info
- `transformers` version: 4.31.0
- Platform: Linux-4.15.0-192-generic-x86_64-with-glibc2.27
- Python version: 3.11.4
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I’m using a pretrained BERT model to make a bert2bert model using an EncoderDecoderModel. According to the documentation and a deprecation warning in the source code, it says that you no longer need to pass in decoder_input_ids
as they’ll be automatically generated using labels
. In the docs specifically, it also goes on to say that the default behavior of decoder_attention_mask
is to automatically generate it based on padded tokens in decoder_input_ids
, so you don’t need to pass the decoder attention mask either, as expected.
However, when trying to just pass input_ids + attention_mask
for the encoder and labels
, I get a warning that says something to the effect of “we strongly recommend passing an attention mask”. If I explicitly pass input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, and labels
, the warning goes away. Looking at the implementation of creating the decoder_input_ids
from labels
, it does indeed seem to skip the generation of decoder_attention_mask
and simply passes through the value from the arguments, in this case None
:
https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L619-L637
You can recreate the warning in the notebook that Patrick made for the blog (https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb#scrollTo=yoN2q0hZUbXN&line=11&uniqifier=1). Specifically, in the process_data_to_model_inputs
function, you can just comment out the lines which explicitly set decoder_input_ids
and decoder_attention_mask
.
Expected behavior
I’d expect that if you can just pass labels
to the forward call of EncoderDecoder and it will create decoder_input_ids
, it would also create decoder_attention_mask
. The fix is probably a few lines:
if (labels is not None) and (decoder_input_ids is None and decoder_inputs_embeds is None):
decoder_input_ids = shift_tokens_right(
labels, self.config.pad_token_id, self.config.decoder_start_token_id
)
if decoder_attention_mask is not None:
raise Exception # some error for passing 1/2 of decoder input_id/attn_mask?
decoder_attention_mask = torch.where(decoder_input_ids == self.config.pad_token_id, 0, 1)
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (6 by maintainers)
@rangehow
That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do https://github.com/huggingface/transformers/pull/26752 but for more models.
Obviously, if we are supplied with either the decoder_input_ids or decoder_attention_masks, we won’t generate any defaults.
Also, do you happen to have some code that illustrate this? I imagine you’d run into the “strongly recommend passing in attention mask” warning when training with default args using Trainer on Bert/T5, but can you verify this?
I’m curious to hear opinions from @gante and @ArthurZucker regarding this. I can add this to more models if they’re good with it (unless @rangehow wants to do it).
For reference, here’s the code in T5 where we create default decoder_input_ids: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1703
If we don’t pass in an attention mask, it just creates a default one with all one’s: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1004
Re: @StevenSong regarding “labels” are shifted automatically to the left for language modeling training.
The colab you linked seems to use BertLMHeadModel for the decoder. I believe the left shifting behavior is here.
Hi, I’ve noticed this seems to be the same for other model classes, e.g. BART/mBART and T5. For all of them, the documentation states:
but then it seems only a causal mask is used if no attention mask is passed to the model explicitly, see e.g. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953). In comparison, the original fairseq implementation for BART/mBART takes padding into account by default: https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/fairseq/models/transformer/transformer_decoder.py#L327-L329. I would think this is the same for T5.
The fact this doesn’t seem to be done here is a bit misleading. Users might not be aware they need to pass the correct attention masks themselves, especially considering none of the examples in the respective model docs or training scripts like https://github.com/huggingface/transformers/blob/v4.32.0/examples/pytorch/translation/run_translation_no_trainer.py pass decoder attention masks either.
Sorry @StevenSong did not really have the time to look at this, will do so when I can!