transformers: Your example code for WNUT NER produces array indexing ValueError
Environment info
transformersversion: 3.4- Platform: Google Colab
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0
- Tensorflow version (GPU?):
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): DistilBERT
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I’m trying to run the example code Advanced Guides --> Fine-tuning with custom datasets --> Token Classification with W-NUT Emerging Entities.
Steps to reproduce the behavior:
- I already have a Google CoLab notebook with your code.
- I use the
tokenizerwithmax_length=64, which is typically my “best practice” choice. Note that if I setmax_length=None, everything runs successfully.
max_length = 64
encodings = tokenizer(texts, is_split_into_words=True, max_length=max_length, return_offsets_mapping=True, padding=True, truncation=True)
- When I run
encode_tags()on the WNUT data, I get a ValueError
labels = encode_tags(tags, encodings)
11 # set labels whose first offset position is 0 and the second is not 0
---> 12 doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
13 encoded_labels.append(doc_enc_labels.tolist())
14
ValueError: NumPy boolean array indexing assignment cannot assign 29 input values to the 24 output values where the mask is true
Expected behavior
I expect that encode_tags() should return the correct IOB tag labels when I run your Tokenizer with a max_length=64.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 15
- Comments: 16 (3 by maintainers)
I solved the issue by replacing
with
By this way, it will only map the first
np.sum(mask)true indices ofdoc_labelsin case of any indexing problem. I am a newbie 🤗 Transformers user, and I wonder if this solution may cause any problems.For me the error occurred using the example code in combination with a sentence piece tokenizer (e.g. XLM-RoBERTa). Switching to the updated code used in the run_ner.py script (https://github.com/huggingface/transformers/blob/ad072e852816cd32547504c2eb018995550b126a/examples/token-classification/run_ner.py) solved the issue for me.
I am also facing this issue. I am using custom dataset and haven’t passed any max_length argument to the tokenizer.
Any idea how to fix this ? But same piece of code works well on W-NUT dataset
I figured out the problem. A typical input instance has
Ntokens andNNER tags with a one-to-one correspondence. When you pass in the sentence to the tokenizer, it will addkmore tokens for either (1) subword tokens (e.g.##ing) or (2) special model-specific tokens (e.g.[CLS]or[SEP]. So now you haveN+ktokens andNNER tags.If you apply a max length truncation (e.g.
64), then thoseN+ktokens will get truncated to64, leaving an unpredictable mix of valid tokens and special tokens because both types of tokens may have been truncated. However, there are stillNNER tags which may not match up against valid tokens because the latter may have been truncated.I fixed the problem by one of several approaches:
N+k. However, this increase forces me to reduce my batch size due to GPU memory constraints.