transformers: encode_plus not returning attention_mask and not padding
š Bug
Tested on RoBERTa and BERT of the master branch, the encode_plus
method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True
, I donāt get that back.
If these specific tokenizers (RoBERTa/BERT) donāt support this functionality (which would seem odd), it might be useful to also put that in the documentation.
As a small note, thereās also a typo in the documentation:
return_attention_mask ā (optional) Set to False to avoir returning attention mask (default True)
Finally, it seems that pad_to_max_length
isnāt padding my input (see the example below). I also tried True
instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.
from transformers import BertTokenizer
if __name__ == '__main__':
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']
# orig_sents and edit_text are lists of sentences
for orig_sents, edit_sents in zip(orig_text, edit_text):
orig_tokens = tokenizer.tokenize(orig_sents)
edit_tokens = tokenizer.tokenize(edit_sents)
seqs = tokenizer.encode_plus(orig_tokens,
edit_tokens,
return_attention_mask=True,
return_tensors='pt',
pad_to_max_length=120)
print(seqs)
Output:
{'input_ids': tensor([[ 101, 1045, 2066, 26191, 1012, 102, 2079, 2017, 1029, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011, 999, 102, 2002, 5359, 1037, 6547, 7427, 1012, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 2079, 2017, 5959, 16324, 1029, 102, 2026, 13055, 2074, 17776, 2070, 999, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 16 (7 by maintainers)
Hey! For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error The code Iām executing:
The error that I am getting:
Aha, great. I couldnāt wait because I needed it for a shared task, but nice to see itās taking form. Almost there!
@BramVanroy Thanks for your comment! It made me try it out in just a plain Python file instead of a Jupyter notebook and it worked⦠š
Hm, youāre right. I think it was (again) an issue with the notebook that I was testing this time, where some values from previous cells were used or something like that.
Thanks for the fix!
Now that weāre at the topic, though, it might be nice to have a convenience method for batch processing? Something along these lines where
pad_to_batch_length
pads up to the max batch length (rather than max_seq_length of the model) to save computation/memory.Hi, thanks for raising this issue!
When running this code on the master branch, I do get the attention mask as output, but only when removing the
return_tensors
argument. When running with this argument, it crashes because a list is being concatenated to a tensor. Iām fixing this in #2148.Itās weird that you didnāt get an error when running this line. On which commit are you based?
encode
andencode_plus
take kwargs arguments so it wouldnāt raise an error if one of your arguments (pad_to_max_length
) was not supposed to be there (e.g. if running on an old version of transformers).pad_to_max_length
is a boolean flag: if set to True with nomax_length
specified, it will pad the sequence up to the maximum sequence length the model can handle. If amax_length
is specified, it will pad the sequence up to that number.