transformers: Slow tokenizers return overflowing tokens in reversed order

When implementing the slow tokenizer for LayoutLMv2, I spotted some weird behaviour for slow tokenizers when specifying return_overflowing_tokens = True. Namely, in that case, overflowing tokens are returned in reversed order, and no padding is performed, unlike fast tokenizers.

Small example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "hello my name is niels"

encoding = tokenizer(text, padding=True, max_length=6, truncation=True, return_overflowing_tokens=True)

When checking out the encoding, it looks as follows:

print(tokenizer.decode(encoding.input_ids))
# prints '[CLS] hello my name is [SEP]'

print(tokenizer.decode(encoding.overflowing_tokens))
# prints '##els ni'

As you can see, the overflowing tokens are returned in reversed order, and they are not padded up to the max length of 6 tokens. In contrast, BertTokenizerFast does everything correctly:

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

text = "hello my name is niels"

encoding = tokenizer(text, padding=True, max_length=6, truncation=True, return_overflowing_tokens=True)

returns

print(tokenizer.decode(encoding.input_ids[0]))
# prints '[CLS] hello my name is [SEP]'

print(tokenizer.decode(encoding.input_ids[1]))
# prints '[CLS] niels [SEP] [PAD] [PAD]'

So I guess we have some work to do for slow tokenizers to work correctly.

cc @LysandreJik @SaulLu @n1t0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 15 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Yes, the LayoutLMv2 PR was merged before the PR that fixed the reverse order. So feel free to update the truncate_sequence method of LayoutLMv2Tokenizer.

@Apoorvgarg-creator – i can’t explain it, but a fresh environment solved the issue with the toy example above. It is now correctly printing off niels. However, I’m still seeing unexpected behavior with the following example:

Environment:

$ conda create -n test python=3.8
$ source activate test
$ pip install git+https://github.com/huggingface/transformers.git
...
$ pip list
Package            Version
------------------ -------------------
certifi            2021.5.30
charset-normalizer 2.0.4
click              8.0.1
filelock           3.0.12
huggingface-hub    0.0.16
idna               3.2
joblib             1.0.1
numpy              1.21.2
packaging          21.0
pip                21.0.1
pyparsing          2.4.7
PyYAML             5.4.1
regex              2021.8.28
requests           2.26.0
sacremoses         0.0.45
setuptools         52.0.0.post20210125
six                1.16.0
tokenizers         0.10.3
tqdm               4.62.2
transformers       4.11.0.dev0
typing-extensions  3.10.0.2
urllib3            1.26.6
wheel              0.37.0

Reproducible example:

from transformers import BertTokenizer, LayoutLMv2Tokenizer

max_length = 8
n_src_tok_per_sample = max_length - 2  # account for pad
words = (
    n_src_tok_per_sample * ["a"]
    + n_src_tok_per_sample * ["b"]
    + n_src_tok_per_sample * ["c"]
)
print("Original words: ", words)


print(50 * "=" + "\nBERT\n" + 50 * "=")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

encoded_inputs = tokenizer(
    text=words,
    padding="max_length",
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_length,
    return_overflowing_tokens=True,
    return_tensors="pt",
    is_split_into_words=True,
)
input_ids = encoded_inputs["input_ids"]
print("Decoded input_ids: ", [tokenizer.decode(x) for x in input_ids])

overflowing_tokens = encoded_inputs["overflowing_tokens"]
print("Decoded overflow tokens: ", [tokenizer.decode(x) for x in overflowing_tokens])

print(50 * "=" + "\nLayout\n" + 50 * "=")
tokenizer = LayoutLMv2Tokenizer.from_pretrained(
    "microsoft/layoutlmv2-base-uncased",
    only_label_first_subword=False,
)

encoded_inputs = tokenizer(
    text=words,
    boxes=len(words) * [[1, 1, 1, 1]],
    padding="max_length",
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_length,
    return_overflowing_tokens=True,
    return_tensors="pt",
    is_split_into_words=True,
)
input_ids = encoded_inputs["input_ids"]
print("Decoded input_ids: ", [tokenizer.decode(x) for x in input_ids])

overflowing_tokens = encoded_inputs["overflowing_tokens"]
print("Decoded overflow tokens: ", [tokenizer.decode(x) for x in overflowing_tokens])

Output:

Original words:  ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c']
==================================================
BERT
==================================================
Decoded input_ids:  ['[CLS] a a a a a a [SEP]']
Decoded overflow tokens:  ['b b b b b b c c c c c c']
==================================================
Layout
==================================================
Decoded input_ids:  ['[CLS] a a a a a a [SEP]']
Decoded overflow tokens:  ['c c c c c c b b b b b b']

@SaulLu @NielsRogge Thank you for the guidance. I will go through the truncate_sequences method.

@Apoorvgarg-creator It is extremely kind of you to offer your help on this problem!

As I had started to look at the problem of the strange order of tokens in overflowing_tokens (“making sure overflowing tokens are returned in the correct order”), let me share with you what I had identified if it can be of any help:

  • There are behaviours that were not tested in the test_maximum_encoding_length_pair_input and test_maximum_encoding_length_single_input tests in the test_tokenization_common.py file. So we should add these tests to make sure that overflowing tokens are tested for all TruncationStrategy types and with a single sequence or a pair of sequences;
  • As said by @NielsRogge, the problem is most likely with the truncate_sequences method in tokenization_utils_base.py.

I would like to take this opportunity to comment on the other 2 points (“add special tokens to the overflowing tokens” and “add a overflow_to_sample_mapping, similar to the fast tokenizers”) raised by @NielsRogge. Indeed, the slow and fast tokenizer handle overflowing tokens quite differently. I think it would be nice to have the opinion of @LysandreJik , @sgugger and @n1t0 (and if ever someone else wants to give their opinion too, it would be a pleasure!!) on the fact of changing the API of the slow tokenizers so that it corresponds to the one of the fast tokenizers (as there is perhaps a need for backward compatibility).

I see someone also already noticed this: #6697