transformers: Error with run_seq2seq_qa.py official script (pyarrow.lib.ArrowInvalid: Column 4 named labels expected length 1007 but got length 1000)
Environment info
transformers
version: 4.17.0.dev0- Platform: Linux-5.4.144±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.12
- PyTorch version (GPU?): 1.10.0+cu111 (True)
- Tensorflow version (GPU?): 2.7.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
pinging @sgugger and @patil-suraj
Information
Model I am using (Bert, XLNet …): T5-base
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: SQUaD
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior: Here is a notebook to reproduce the issue: REMOVED
I have been able to reproduce this issue on Google Colab as well as my local machine.
- Clone the transformers repo and run the train_seq2seq_qa.py script with the following command:
python examples/pytorch/question-answering/run_seq2seq_qa.py
–model_name_or_path t5-small
–dataset_name squad_v2
–context_column context
–question_column question
–answer_column answers
–do_train
–do_eval
–per_device_train_batch_size 12
–learning_rate 3e-5
–num_train_epochs 2
–max_seq_length 384
–doc_stride 128
–output_dir /tmp/debug_seq2seq_squad/
The issue arises during the preprocessing step on the training set. Particularly I get the following error when trying to cache the preprocessed dataset:
Running tokenizer on train dataset: 0% 0/131 [00:00<?, ?ba/s]01/28/2022 18:26:06 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-61cf0f14d28995e4.arrow
Running tokenizer on train dataset: 100% 131/131 [01:12<00:00, 1.81ba/s]
Running tokenizer on validation dataset: 0% 0/12 [00:00<?, ?ba/s]01/28/2022 18:27:18 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-d54ee58948263e43.arrow
Running tokenizer on validation dataset: 0% 0/12 [00:08<?, ?ba/s]
Traceback (most recent call last):
File "examples/pytorch/question-answering/run_seq2seq_qa.py", line 678, in <module>
main()
File "examples/pytorch/question-answering/run_seq2seq_qa.py", line 522, in main
desc="Running tokenizer on validation dataset",
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2125, in map
desc=desc,
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 519, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 486, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py", line 413, in wrapper
out = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py", line 2503, in _map_single
writer.write_batch(batch)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 500, in write_batch
pa_table = pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 1532, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1181, in pyarrow.lib.Table.validate
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 4 named labels expected length 1007 but got length 1000
Expected behavior
The preprocessing step uses the map function which runs successfully through the entire dataset but the issue arises during the caching of the preprocessed training set. I found that the map function preprocess in batches of 1000 samples so I am anticipating that one of the batches has different dimensions which leads to this error.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 40 (4 by maintainers)
Hi, that happens because tokenizer will return multiple instances if question+context > max_seq_length. Commenting out return_overflowing_tokens and related lines will help to solve this problem.
You may also need to optionally adjust the listed command w/ about the following arguments:
@patil-suraj Thank you for this amazing effort to take care of seq2seq models and their applications. it would be great if this issue got fixed.
Thank you for reporting the issue, I can reproduce it. Looking into it.
Hi @patil-suraj, Thanks for looking into this. Is there any update?
I think an update to fix this issue is crucial for seq2seq qa task because I observe that seq2seq models are more likely to overfit so evaluating the model during the training steps will help us definitely find the best checkpoint which may reside in the middle.
Hi, I am facing the same issue. Is there any update?
+1 @sgugger @patil-suraj @LysandreJik Would be great if this got fixed! Thanks 😃
Hi @sgugger @patil-suraj @LysandreJik ! is there any update on that issue? Thanks in advance 😃
Hi @patil-suraj , Thanks for looking into the issue. Is there any update?
This is a temporary solution till @patil-suraj fixes the issue. This solution is taken from the T5 squad colab by @patil-suraj https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=H8AbD1B7TR0k . You can use this notebook also to train seq2seq model with TPU and pytorch xla. I use it to fine-tunein BART on SQuAD and it woks like charm. However, if you want to use the training script by huggingface transformers library then first train your model using run_seq2seq_qa.py with do_train flag only and save the model in “out” dir. Then create a file called eval.py and copy this code:
Thank you for taking a look @patil-suraj. Also curious if you found a fix for this.
I would stress that we have to do the following updates:
And:
I have provided only the second update and received an error.
@anas-awadalla T5 does not have a hard limit on sequence length. As long as you are not restricted by GPU memory or compute speed, you can feed in sequences longer than 512 as input.
I totally agree with others that we need to have this code fixed because I notice that seq2seq models (e.g. BART and T5) are easy to be overfitted with more fine-tuning training unlike other Transformers models (e.g. ELECTRA, BERT, and ALBERT). That means the best epoch or checkpoint may reside in the middle and to catch that checkpoint when need an evaluation code within this fine-tuning code to evaluate the model at the end of each epoch or x steps.
I have also noticed that this code works very well on TPU XLA but i am wondering what “–per_device_train_batch_size” represents in this case? is it the total batch after on all cores on TPUv3-8 or in the single-core? with other pytorch codes it represent per core so i am assuming its the same?
Hi, for example, a test example is split into 3 instances, and the model makes predictions for these 3 instances. You will need to merge the predictions from these 3 instances and rank them by the probability. You can refer to the code here.
Sure this works as in it gets the code to run but then the evaluation results might be lower than the true performance of the model because you may be cutting out parts of the context that do contain the answer and some samples may become unanswerable.
This error occurs because some columns do not have the same number of examples as the other columns.
Thanks @salrowili this seems like a good temporary fix! I would love to see if the script can be fixed still as it provides a much smoother development experience.