transformers: Your example code for WNUT NER produces array indexing ValueError

Environment info

  • transformers version: 3.4
  • Platform: Google Colab
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@stefan-it, @sgugger

Information

Model I am using (Bert, XLNet …): DistilBERT

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I’m trying to run the example code Advanced Guides --> Fine-tuning with custom datasets --> Token Classification with W-NUT Emerging Entities.

Steps to reproduce the behavior:

  1. I already have a Google CoLab notebook with your code.
  2. I use the tokenizer with max_length=64, which is typically my “best practice” choice. Note that if I set max_length=None, everything runs successfully.
max_length = 64
encodings = tokenizer(texts, is_split_into_words=True, max_length=max_length, return_offsets_mapping=True, padding=True, truncation=True)
  1. When I run encode_tags() on the WNUT data, I get a ValueError
labels = encode_tags(tags, encodings)
     11         # set labels whose first offset position is 0 and the second is not 0
---> 12         doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
     13         encoded_labels.append(doc_enc_labels.tolist())
     14 

ValueError: NumPy boolean array indexing assignment cannot assign 29 input values to the 24 output values where the mask is true

Expected behavior

I expect that encode_tags() should return the correct IOB tag labels when I run your Tokenizer with a max_length=64.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 15
  • Comments: 16 (3 by maintainers)

Most upvoted comments

I solved the issue by replacing

doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())

with

mask = (arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0)
doc_enc_labels[mask] = doc_labels[:np.sum(mask)]
encoded_labels.append(doc_enc_labels.tolist())

By this way, it will only map the first np.sum(mask) true indices of doc_labels in case of any indexing problem. I am a newbie 🤗 Transformers user, and I wonder if this solution may cause any problems.

For me the error occurred using the example code in combination with a sentence piece tokenizer (e.g. XLM-RoBERTa). Switching to the updated code used in the run_ner.py script (https://github.com/huggingface/transformers/blob/ad072e852816cd32547504c2eb018995550b126a/examples/token-classification/run_ner.py) solved the issue for me.

I am also facing this issue. I am using custom dataset and haven’t passed any max_length argument to the tokenizer.

Any idea how to fix this ? But same piece of code works well on W-NUT dataset

I figured out the problem. A typical input instance has N tokens and N NER tags with a one-to-one correspondence. When you pass in the sentence to the tokenizer, it will add k more tokens for either (1) subword tokens (e.g. ##ing) or (2) special model-specific tokens (e.g. [CLS] or [SEP]. So now you have N+k tokens and N NER tags.

If you apply a max length truncation (e.g. 64), then those N+k tokens will get truncated to 64, leaving an unpredictable mix of valid tokens and special tokens because both types of tokens may have been truncated. However, there are still N NER tags which may not match up against valid tokens because the latter may have been truncated.

I fixed the problem by one of several approaches:

  1. Removing data instances that are problematically long. For example, I removed sentences which had more than 45 tokens. Using Pandas really help out here.
  2. Increasing the truncation length to, say, 128, or whatever number that’s longer than any N+k. However, this increase forces me to reduce my batch size due to GPU memory constraints.