transformers: Error with run_seq2seq_qa.py official script (pyarrow.lib.ArrowInvalid: Column 4 named labels expected length 1007 but got length 1000)

Environment info

transformers version: 4.17.0.dev0
Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.10.0+cu111 (True)
Tensorflow version (GPU?): 2.7.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

pinging @sgugger and @patil-suraj

Information

Model I am using (Bert, XLNet …): T5-base

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: SQUaD
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior: Here is a notebook to reproduce the issue: REMOVED

I have been able to reproduce this issue on Google Colab as well as my local machine.

Clone the transformers repo and run the train_seq2seq_qa.py script with the following command: python examples/pytorch/question-answering/run_seq2seq_qa.py
–model_name_or_path t5-small
–dataset_name squad_v2
–context_column context
–question_column question
–answer_column answers
–do_train
–do_eval
–per_device_train_batch_size 12
–learning_rate 3e-5
–num_train_epochs 2
–max_seq_length 384
–doc_stride 128
–output_dir /tmp/debug_seq2seq_squad/

The issue arises during the preprocessing step on the training set. Particularly I get the following error when trying to cache the preprocessed dataset:

Running tokenizer on train dataset:   0% 0/131 [00:00<?, ?ba/s]01/28/2022 18:26:06 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-61cf0f14d28995e4.arrow
Running tokenizer on train dataset: 100% 131/131 [01:12<00:00,  1.81ba/s]
Running tokenizer on validation dataset:   0% 0/12 [00:00<?, ?ba/s]01/28/2022 18:27:18 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-d54ee58948263e43.arrow
Running tokenizer on validation dataset:   0% 0/12 [00:08<?, ?ba/s]
Traceback (most recent call last):
  File "examples/pytorch/question-answering/run_seq2seq_qa.py", line 678, in <module>
    main()
  File "examples/pytorch/question-answering/run_seq2seq_qa.py", line 522, in main
    desc="Running tokenizer on validation dataset",
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2125, in map
    desc=desc,
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 519, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 486, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 413, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2503, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 500, in write_batch
    pa_table = pa.Table.from_arrays(arrays, schema=schema)
  File "pyarrow/table.pxi", line 1532, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1181, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 4 named labels expected length 1007 but got length 1000

Expected behavior

The preprocessing step uses the map function which runs successfully through the entire dataset but the issue arises during the caching of the preprocessed training set. I found that the map function preprocess in batches of 1000 samples so I am anticipating that one of the batches has different dimensions which leads to this error.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 9
Comments: 40 (4 by maintainers)

Most upvoted comments

Hi, that happens because tokenizer will return multiple instances if question+context > max_seq_length. Commenting out return_overflowing_tokens and related lines will help to solve this problem.

    # Validation preprocessing
    def preprocess_validation_function(examples):
        inputs, targets = preprocess_squad_batch(examples, question_column, context_column, answer_column)
    
        model_inputs = tokenizer(
            inputs,
            max_length=max_seq_length,
            padding=padding,
            truncation=True,
            return_offsets_mapping=True,
            # return_overflowing_tokens=True,
        )
        
        # Setup the tokenizer for targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(targets, max_length=max_answer_length, padding=padding, truncation=True)

        # Since one example might give us several features if it has a long context, we need a map from a feature to
        # its corresponding example. This key gives us just that.
        # sample_mapping = model_inputs.pop("overflow_to_sample_mapping")
        sample_mapping = list(range(len(model_inputs["input_ids"])))

        # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
        # corresponding example_id and we will store the offset mappings.
        model_inputs["example_id"] = []

        for i in range(len(model_inputs["input_ids"])):
            # One example can give several spans, this is the index of the example containing this span of text.
            sample_index = sample_mapping[i]
            model_inputs["example_id"].append(examples["id"][sample_index])

        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]

        return model_inputs

You may also need to optionally adjust the listed command w/ about the following arguments:

--answer_column answers \
--eval_accumulation_steps 1 \
--predict_with_generate \
--version_2_with_negative \

chaojiang06 on Mar 22, 2022

@patil-suraj Thank you for this amazing effort to take care of seq2seq models and their applications. it would be great if this issue got fixed.

salrowili on Apr 23, 2022

Thank you for reporting the issue, I can reproduce it. Looking into it.

patil-suraj on Jan 31, 2022

Hi @patil-suraj, Thanks for looking into this. Is there any update?

dharakotecha on Apr 29, 2022

I think an update to fix this issue is crucial for seq2seq qa task because I observe that seq2seq models are more likely to overfit so evaluating the model during the training steps will help us definitely find the best checkpoint which may reside in the middle.

salrowili on Apr 4, 2022

Hi, I am facing the same issue. Is there any update?

dharakotecha on Feb 7, 2022

+1 @sgugger @patil-suraj @LysandreJik Would be great if this got fixed! Thanks 😃

YianZhang on May 4, 2022

Hi @sgugger @patil-suraj @LysandreJik ! is there any update on that issue? Thanks in advance 😃

CyrilShch on May 2, 2022

Hi @patil-suraj , Thanks for looking into the issue. Is there any update?

dharakotecha on Apr 4, 2022

This is a temporary solution till @patil-suraj fixes the issue. This solution is taken from the T5 squad colab by @patil-suraj https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=H8AbD1B7TR0k . You can use this notebook also to train seq2seq model with TPU and pytorch xla. I use it to fine-tunein BART on SQuAD and it woks like charm. However, if you want to use the training script by huggingface transformers library then first train your model using run_seq2seq_qa.py with do_train flag only and save the model in “out” dir. Then create a file called eval.py and copy this code:

from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys
import torch
from tqdm import tqdm
import torch
import torch
import datasets
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("out")
# process the examples in input and target text format and the eos token at the end 
def add_eos_to_examples(example):
    example['input_text'] = 'question: %s  context: %s </s>' % (example['question'], example['context'])
    example['target_text'] = '%s </s>' % example['answers']['text'][0]
    return example

# tokenize the examples
def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['input_text'], pad_to_max_length=True, max_length=512)
    target_encodings = tokenizer.batch_encode_plus(example_batch['target_text'], pad_to_max_length=True, max_length=16)

    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'target_ids': target_encodings['input_ids'],
        'target_attention_mask': target_encodings['attention_mask']
    }

    return encodings
from datasets import load_dataset
# load train and validation split of squad
train_dataset  = load_dataset("squad", split="train")
valid_dataset = load_dataset("squad", split="validation")

# map add_eos_to_examples function to the dataset example wise 
train_dataset = train_dataset.map(add_eos_to_examples)
# map convert_to_features batch wise
train_dataset = train_dataset.map(convert_to_features, batched=True)

valid_dataset = valid_dataset.map(add_eos_to_examples, load_from_cache_file=False)
valid_dataset = valid_dataset.map(convert_to_features, batched=True, load_from_cache_file=False)


# set the tensor type and the columns which the dataset should return
columns = ['input_ids', 'target_ids', 'attention_mask', 'target_attention_mask']
train_dataset.set_format(type='torch', columns=columns)
valid_dataset.set_format(type='torch', columns=columns)

torch.save(train_dataset, 'train_data.pt')
torch.save(valid_dataset, 'valid_data.pt')
valid_dataset = torch.load('valid_data.pt')
dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=32)
from transformers import AutoModelForSeq2SeqLM,AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("out").to('cuda')
tokenizer=AutoTokenizer.from_pretrained("out")
def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)

        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)


def evaluate(gold_answers, predictions):
    f1 = exact_match = total = 0

    for ground_truths, prediction in zip(gold_answers, predictions):
      total += 1
      exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
      f1 += metric_max_over_ground_truths(
          f1_score, prediction, ground_truths)
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {'exact_match': exact_match, 'f1': f1}

answers = []

for batch in tqdm(dataloader):
  outs = model.generate(input_ids=batch['input_ids'].to('cuda'), 
                        attention_mask=batch['attention_mask'].to('cuda'),
                        max_length=32,
                        early_stopping=True)
  outs = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]
  answers.extend(outs)
predictions = []
references = []
for ref, pred in zip(valid_dataset["answers"], answers):
  predictions.append(pred)
  references.append(ref['text'])
print(evaluate(references, predictions))

salrowili on Feb 9, 2022

Thank you for taking a look @patil-suraj. Also curious if you found a fix for this.

anas-awadalla on Feb 8, 2022

Hi, that happens because tokenizer will return multiple instances if question+context > max_seq_length. Commenting out return_overflowing_tokens and related lines will help to solve this problem.

    # Validation preprocessing
    def preprocess_validation_function(examples):
        inputs, targets = preprocess_squad_batch(examples, question_column, context_column, answer_column)
    
        model_inputs = tokenizer(
            inputs,
            max_length=max_seq_length,
            padding=padding,
            truncation=True,
            return_offsets_mapping=True,
            # return_overflowing_tokens=True,
        )
        
        # Setup the tokenizer for targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(targets, max_length=max_answer_length, padding=padding, truncation=True)

        # Since one example might give us several features if it has a long context, we need a map from a feature to
        # its corresponding example. This key gives us just that.
        # sample_mapping = model_inputs.pop("overflow_to_sample_mapping")
        sample_mapping = list(range(len(model_inputs["input_ids"])))

        # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
        # corresponding example_id and we will store the offset mappings.
        model_inputs["example_id"] = []

        for i in range(len(model_inputs["input_ids"])):
            # One example can give several spans, this is the index of the example containing this span of text.
            sample_index = sample_mapping[i]
            model_inputs["example_id"].append(examples["id"][sample_index])

        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]

        return model_inputs

You may also need to optionally adjust the listed command w/ about the following arguments:

--answer_column answers \
--eval_accumulation_steps 1 \
--predict_with_generate \
--version_2_with_negative \

I would stress that we have to do the following updates:

model_inputs = tokenizer(
             inputs,
             max_length=max_seq_length,
             padding=padding,
             truncation=True,
             return_offsets_mapping=True,
             #### HERE - coment out this line
             # return_overflowing_tokens=True,
             ####
         )

And:

        #### HERE replace the first line with the second
        # sample_mapping = model_inputs.pop("overflow_to_sample_mapping")
        sample_mapping = list(range(len(model_inputs["input_ids"])))
        ####

I have provided only the second update and received an error.

apohllo on Jun 6, 2022

@anas-awadalla T5 does not have a hard limit on sequence length. As long as you are not restricted by GPU memory or compute speed, you can feed in sequences longer than 512 as input.

YianZhang on May 6, 2022

I totally agree with others that we need to have this code fixed because I notice that seq2seq models (e.g. BART and T5) are easy to be overfitted with more fine-tuning training unlike other Transformers models (e.g. ELECTRA, BERT, and ALBERT). That means the best epoch or checkpoint may reside in the middle and to catch that checkpoint when need an evaluation code within this fine-tuning code to evaluate the model at the end of each epoch or x steps.

I have also noticed that this code works very well on TPU XLA but i am wondering what “–per_device_train_batch_size” represents in this case? is it the total batch after on all cores on TPUv3-8 or in the single-core? with other pytorch codes it represent per core so i am assuming its the same?

salrowili on May 5, 2022

Hi, for example, a test example is split into 3 instances, and the model makes predictions for these 3 instances. You will need to merge the predictions from these 3 instances and rank them by the probability. You can refer to the code here.

chaojiang06 on May 5, 2022

Sure this works as in it gets the code to run but then the evaluation results might be lower than the true performance of the model because you may be cutting out parts of the context that do contain the answer and some samples may become unanswerable.

anas-awadalla on Mar 18, 2022

This error occurs because some columns do not have the same number of examples as the other columns.

Captainr22 on Mar 14, 2022

Thanks @salrowili this seems like a good temporary fix! I would love to see if the script can be fixed still as it provides a much smoother development experience.

anas-awadalla on Feb 16, 2022