transformers: EncoderDecoder does not automatically create decoder_attention_mask to match decoder_input_ids

System Info

- `transformers` version: 4.31.0
- Platform: Linux-4.15.0-192-generic-x86_64-with-glibc2.27
- Python version: 3.11.4
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @NielsRogge

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I’m using a pretrained BERT model to make a bert2bert model using an EncoderDecoderModel. According to the documentation and a deprecation warning in the source code, it says that you no longer need to pass in decoder_input_ids as they’ll be automatically generated using labels. In the docs specifically, it also goes on to say that the default behavior of decoder_attention_mask is to automatically generate it based on padded tokens in decoder_input_ids, so you don’t need to pass the decoder attention mask either, as expected.

However, when trying to just pass input_ids + attention_mask for the encoder and labels, I get a warning that says something to the effect of “we strongly recommend passing an attention mask”. If I explicitly pass input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, and labels, the warning goes away. Looking at the implementation of creating the decoder_input_ids from labels, it does indeed seem to skip the generation of decoder_attention_mask and simply passes through the value from the arguments, in this case None: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L619-L637

You can recreate the warning in the notebook that Patrick made for the blog (https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb#scrollTo=yoN2q0hZUbXN&line=11&uniqifier=1). Specifically, in the process_data_to_model_inputs function, you can just comment out the lines which explicitly set decoder_input_ids and decoder_attention_mask.

Expected behavior

I’d expect that if you can just pass labels to the forward call of EncoderDecoder and it will create decoder_input_ids, it would also create decoder_attention_mask. The fix is probably a few lines:

        if (labels is not None) and (decoder_input_ids is None and decoder_inputs_embeds is None):
            decoder_input_ids = shift_tokens_right(
                labels, self.config.pad_token_id, self.config.decoder_start_token_id
            )
            if decoder_attention_mask is not None:
                raise Exception # some error for passing 1/2 of decoder input_id/attn_mask?
            decoder_attention_mask = torch.where(decoder_input_ids == self.config.pad_token_id, 0, 1)

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (6 by maintainers)

Most upvoted comments

@rangehow

That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do https://github.com/huggingface/transformers/pull/26752 but for more models.

Obviously, if we are supplied with either the decoder_input_ids or decoder_attention_masks, we won’t generate any defaults.

Also, do you happen to have some code that illustrate this? I imagine you’d run into the “strongly recommend passing in attention mask” warning when training with default args using Trainer on Bert/T5, but can you verify this?

I’m curious to hear opinions from @gante and @ArthurZucker regarding this. I can add this to more models if they’re good with it (unless @rangehow wants to do it).

For reference, here’s the code in T5 where we create default decoder_input_ids: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1703

If we don’t pass in an attention mask, it just creates a default one with all one’s: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1004

hackyon on Mar 5, 2024

Re: @StevenSong regarding “labels” are shifted automatically to the left for language modeling training.

The colab you linked seems to use BertLMHeadModel for the decoder. I believe the left shifting behavior is here.

hackyon on Oct 12, 2023

Hi, I’ve noticed this seems to be the same for other model classes, e.g. BART/mBART and T5. For all of them, the documentation states:

decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
    Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
    be used by default.

but then it seems only a causal mask is used if no attention mask is passed to the model explicitly, see e.g. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953). In comparison, the original fairseq implementation for BART/mBART takes padding into account by default: https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/fairseq/models/transformer/transformer_decoder.py#L327-L329. I would think this is the same for T5.

The fact this doesn’t seem to be done here is a bit misleading. Users might not be aware they need to pass the correct attention masks themselves, especially considering none of the examples in the respective model docs or training scripts like https://github.com/huggingface/transformers/blob/v4.32.0/examples/pytorch/translation/run_translation_no_trainer.py pass decoder attention masks either.

xplip on Oct 3, 2023

Sorry @StevenSong did not really have the time to look at this, will do so when I can!

ArthurZucker on Aug 29, 2023