transformers: encode_plus not returning attention_mask and not padding

šŸ› Bug

Tested on RoBERTa and BERT of the master branch, the encode_plus method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True, I don’t get that back.

If these specific tokenizers (RoBERTa/BERT) don’t support this functionality (which would seem odd), it might be useful to also put that in the documentation.

As a small note, there’s also a typo in the documentation:

return_attention_mask – (optional) Set to False to avoir returning attention mask (default True)

Finally, it seems that pad_to_max_length isn’t padding my input (see the example below). I also tried True instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.


from transformers import BertTokenizer

if __name__ == '__main__':
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
    edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']

    # orig_sents and edit_text are lists of sentences
    for orig_sents, edit_sents in zip(orig_text, edit_text):
        orig_tokens = tokenizer.tokenize(orig_sents)
        edit_tokens = tokenizer.tokenize(edit_sents)

        seqs = tokenizer.encode_plus(orig_tokens,
                                     edit_tokens,
                                     return_attention_mask=True,
                                     return_tensors='pt',
                                     pad_to_max_length=120)
        print(seqs)

Output:

{'input_ids': tensor([[  101,  1045,  2066, 26191,  1012,   102,  2079,  2017,  1029,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011,  999,  102, 2002, 5359, 1037, 6547, 7427, 1012,  102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[  101,  2079,  2017,  5959, 16324,  1029,   102,  2026, 13055,  2074, 17776,  2070,   999,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 16 (7 by maintainers)

Most upvoted comments

Hey! For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error The code I’m executing:

titles = [['allround developer', 'Visual Studio Code'],
 ['allround developer', 'IntelliJ IDEA / PyCharm'],
 ['allround developer', 'Version Control']]
enc_titles = [[tokenizer.encode_plus(title[0], max_length=13, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=13, pad_to_max_length=True)] for title in titles]

The error that I am getting:

<ipython-input-213-349f66a39abe> in <module>
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

<ipython-input-213-349f66a39abe> in <listcomp>(.0)
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, return_tensors, return_token_type_ids, return_overflowing_tokens, return_special_tokens_mask, **kwargs)
    816                 If there are overflowing tokens, those will be added to the returned dictionary
    817             stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
--> 818                 from the main sequence returned. The value of this argument defines the number of additional tokens.
    819             truncation_strategy: string selected in the following options:
    820                 - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
    808                 the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
    809                 method)
--> 810             text_pair: Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
    811                 string using the `tokenize` method) or a list of integers (tokenized string ids using the
    812                 `convert_tokens_to_ids` method)

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:
--> 659                     result += [tok]
    660                 elif i == len(split_text) - 1:
    661                     if sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in <genexpr>(.0)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

TypeError: _tokenize() got an unexpected keyword argument 'pad_to_max_length'```

Aha, great. I couldn’t wait because I needed it for a shared task, but nice to see it’s taking form. Almost there!

@BramVanroy Thanks for your comment! It made me try it out in just a plain Python file instead of a Jupyter notebook and it worked… šŸ˜„

Hm, you’re right. I think it was (again) an issue with the notebook that I was testing this time, where some values from previous cells were used or something like that.

Thanks for the fix!

Now that we’re at the topic, though, it might be nice to have a convenience method for batch processing? Something along these lines where pad_to_batch_length pads up to the max batch length (rather than max_seq_length of the model) to save computation/memory.

def enocde_batch_plus(batch, batch_pair=None, pad_to_batch_length=False, return_tensors=None, **kwargs):
    def merge_dicts(list_of_ds):
        # there's probably a better way of doing this
        d = defaultdict(list)
        for _d in list_of_ds:
            for _k, _v in _d.items():
                d[_k].append(_v)

        return dict(d)

    encoded_inputs = []
    batch_pair = [None] * len(batch) if batch_pair is None else batch_pair
    for firs_sent, second_sent in zip(batch, batch_pair):
        encoded_inputs.append(tokenizer.encode_plus(firs_sent,
                                          second_sent,
                                          **kwargs))

    encoded_inputs = merge_dicts(encoded_inputs)

    if pad_to_batch_length:
        max_batch_len = max([len(l) for l in encoded_inputs['input_ids']])
        # pad up to max_batch_len, similar to how it's done ine in prepare_for_model()

    if return_tensors:
        # convert to tensors, similar to how it's done in prepare_model()
        pass

    return encoded_inputs

Hi, thanks for raising this issue!

When running this code on the master branch, I do get the attention mask as output, but only when removing the return_tensors argument. When running with this argument, it crashes because a list is being concatenated to a tensor. I’m fixing this in #2148.

It’s weird that you didn’t get an error when running this line. On which commit are you based? encode and encode_plus take kwargs arguments so it wouldn’t raise an error if one of your arguments (pad_to_max_length) was not supposed to be there (e.g. if running on an old version of transformers).

pad_to_max_length is a boolean flag: if set to True with no max_length specified, it will pad the sequence up to the maximum sequence length the model can handle. If a max_length is specified, it will pad the sequence up to that number.