transformers: Word offsets of some fast tokenizers are not compatible with token classification pipeline label aggregation
System Info
transformersversion: 4.21.0.dev0- Platform: macOS-12.4-x86_64-i386-64bit
- Python version: 3.9.13
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): 2.9.1 (False)
- Flax version (CPU?/GPU?/TPU?): 0.5.2 (cpu)
- Jax version: 0.3.6
- JaxLib version: 0.3.5
- Using GPU in script?: N
- Using distributed or parallel set-up in script?: N
Who can help?
Tagging @Narsil for pipelines and @SaulLu for tokenization. Let me know if I should tag anyone for specific models, but it’s not really a model issue, except in terms of tokenization.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I noticed this issue with a DeBERTa model, but it also affects some others. The high level issue is that some tokenizers include leading spaces in the offset indices, some exclude them, and some are configurable with trim_offsets. When offsets include leading spaces (equivalent to trim_offsets==False), the pipeline word heuristic doesn’t work. The result is aggregating all tokens in the sequence to one label. Simple example:
model_name = "brandon25/deberta-base-finetuned-ner"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
ner_aggregate = pipeline("ner", model=model, tokenizer=tokenizer, ignore_labels=[], aggregation_strategy="max")
ner_aggregate("We're from New York")
Result:
[{'entity_group': 'O', 'score': 0.9999778, 'word': " We're from New York", 'start': 0, 'end': 19}]
Expected behavior
Expected result, something like:
[{'entity_group': 'O', 'score': 0.9999778, 'word': " We're from", 'start': 0, 'end': 10}, {'entity_group': 'O', 'score': 0.9xxx, 'word': "New York", 'start': 11, 'end': 19}]
If you’d like to see actual output, here’s a colab notebook with relevant models for comparison.
This affects at least these:
- DeBERTa V1
- DeBERTa V2/3
- GPT2 (tested because
DebertaTokenizerFastis a subclass ofGPT2TokenizerFast) - Depending on config, Roberta (and any other tokenizer that honors
trim_offsets==False)
The easiest solution would be to update the heuristic. Here is a change that works for preceding space in sequence (like current heuristic) or leading space in token. I can turn into a PR if desired.
I know a lot of the default configuration matches reference implementations or published research, so I’m not sure where inconsistencies between tokenizers are desired behavior. I did notice, for example, that some sentencepiece tokenizers include leading spaces in offset indices (DeBERTa V2/3), and some don’t (Albert, XLNet). I looked at the converter config and the rust code (which is pretty opaque to me), but it’s not obvious to me why the offsets are different. Do you know, @SaulLu? Is that expected?
I am comparing different architectures to replace a production Bert model and was evaluating models fine tuned on an internal dataset when I ran into this. I have my manager’s blessing to spend some time on this (and already have! 😂), so I’m happy to work on a PR or help out how I can.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Update pipeline word heuristic to work with whitespace in token offsets This change checks for whitespace in the input string at either the character preceding the token or in the first character of ... — committed to davidbenton/transformers by davidbenton 2 years ago
- Update pipeline word heuristic to work with whitespace in token offsets (#18402) * Update pipeline word heuristic to work with whitespace in token offsets This change checks for whitespace in the ... — committed to huggingface/transformers by davidbenton 2 years ago
- Update pipeline word heuristic to work with whitespace in token offsets (#18402) * Update pipeline word heuristic to work with whitespace in token offsets This change checks for whitespace in the ... — committed to oneraghavan/transformers by davidbenton 2 years ago
- Tokenizer treats space differently Some tokenizers will count space into words for example. Given text: 'hello world', normal bert will output: [('hello', (0, 5)), ('world', (6, 11))] w... — committed to cheungdaven/autogluon by cheungdaven 2 years ago
- [NER] fix issues with some checkpoints (#2301) * reset model_max_length * Tokenizer treats space differently Some tokenizers will count space into words for example. Given text: 'hell... — committed to autogluon/autogluon by cheungdaven 2 years ago
@davidbenton what’s your environement ? I can’t seem to reproduce on my local env
Do you mind creating a new issue for this ? Report it like a regular bug, there should be tools to print your exact env. https://github.com/huggingface/transformers/issues/new?assignees=&labels=bug&template=bug-report.yml
As I said, slow tests can be sometimes a little more flaky that fast tests, but usually within acceptable bounds (pytorch will modify kernels which affects ever so slightly values, but it can pile up, Python version can break dictionary order etc…)
Thanks for flagging, I am looking into it right now 😃
I’m not sure how to read your answer ahah. The tokenizer I have in mind is for example Bert’s: Bert’s tokenizer doesn’t have trim_offset set to True, but the spaces are removed during the pre-tokenization step and the “words” boundaries are built the other way by adding “##” to the token that doesn’t start a word.
trim_offsetscannot be linked the heuristic IMO.The heuristic is just trying to determine if what we’re looking at is a “word”. currently it only looks like if the previous character before the offset ends with a space. But prefix space could also exist so checking that the first character (in the original string) corresponding this token is a space is also valid IMO. Again extremely biased towards space separated language, but working.
I may have to dive and see really what the issue is, but this is my current understanding without exactly looking at the issue in detail.