transformers: Slow tokenizers return overflowing tokens in reversed order
When implementing the slow tokenizer for LayoutLMv2, I spotted some weird behaviour for slow tokenizers when specifying return_overflowing_tokens = True. Namely, in that case, overflowing tokens are returned in reversed order, and no padding is performed, unlike fast tokenizers.
Small example:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text = "hello my name is niels"
encoding = tokenizer(text, padding=True, max_length=6, truncation=True, return_overflowing_tokens=True)
When checking out the encoding, it looks as follows:
print(tokenizer.decode(encoding.input_ids))
# prints '[CLS] hello my name is [SEP]'
print(tokenizer.decode(encoding.overflowing_tokens))
# prints '##els ni'
As you can see, the overflowing tokens are returned in reversed order, and they are not padded up to the max length of 6 tokens. In contrast, BertTokenizerFast does everything correctly:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "hello my name is niels"
encoding = tokenizer(text, padding=True, max_length=6, truncation=True, return_overflowing_tokens=True)
returns
print(tokenizer.decode(encoding.input_ids[0]))
# prints '[CLS] hello my name is [SEP]'
print(tokenizer.decode(encoding.input_ids[1]))
# prints '[CLS] niels [SEP] [PAD] [PAD]'
So I guess we have some work to do for slow tokenizers to work correctly.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 15 (12 by maintainers)
Commits related to this issue
- correct order of overflowing_tokens for slow tokenizer (issue fix #13148) — committed to Apoorvgarg-creator/transformers by Apoorvgarg-creator 3 years ago
- Correct order of overflowing_tokens for slow tokenizer (#13179) * correct order of overflowing_tokens for slow tokenizer (issue fix #13148) * python 3.9 requires sentencepiece version 0.1.94 or ab... — committed to huggingface/transformers by Apoorvgarg-creator 3 years ago
Yes, the LayoutLMv2 PR was merged before the PR that fixed the reverse order. So feel free to update the
truncate_sequencemethod ofLayoutLMv2Tokenizer.@Apoorvgarg-creator – i can’t explain it, but a fresh environment solved the issue with the toy example above. It is now correctly printing off
niels. However, I’m still seeing unexpected behavior with the following example:Environment:
Reproducible example:
Output:
@SaulLu @NielsRogge Thank you for the guidance. I will go through the
truncate_sequencesmethod.@Apoorvgarg-creator It is extremely kind of you to offer your help on this problem!
As I had started to look at the problem of the strange order of tokens in
overflowing_tokens(“making sure overflowing tokens are returned in the correct order”), let me share with you what I had identified if it can be of any help:test_maximum_encoding_length_pair_inputandtest_maximum_encoding_length_single_inputtests in thetest_tokenization_common.pyfile. So we should add these tests to make sure that overflowing tokens are tested for allTruncationStrategytypes and with a single sequence or a pair of sequences;truncate_sequencesmethod intokenization_utils_base.py.I would like to take this opportunity to comment on the other 2 points (“add special tokens to the overflowing tokens” and “add a
overflow_to_sample_mapping, similar to the fast tokenizers”) raised by @NielsRogge. Indeed, the slow and fast tokenizer handle overflowing tokens quite differently. I think it would be nice to have the opinion of @LysandreJik , @sgugger and @n1t0 (and if ever someone else wants to give their opinion too, it would be a pleasure!!) on the fact of changing the API of the slow tokenizers so that it corresponds to the one of the fast tokenizers (as there is perhaps a need for backward compatibility).I see someone also already noticed this: #6697