transformers: Custom train/validation file not supported in run_qa.py

Environment info transformers version: 4.0.1 Platform: Linux-5.4.0-58-generic-x86_64-with-glibc2.10 Python version: 3.8.5 PyTorch version (GPU?): 1.7.1+cu110 (True) Tensorflow version (GPU?): not installed (NA) Using GPU in script?: yes

I am trying to pass custom dataset or modified squad dataset (in valid squad format only) using parameters –train_file = train-v1.1.json
–validation_file = dev-v1.1.json
but it does not work for me g

from the official documentation, https://github.com/huggingface/transformers/tree/master/examples/question-answering this script runs fine:

python run_qa.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

but if I use the below script:

python run_qa.py \
  --model_name_or_path bert-base-uncased \
  --train_file = train-v1.1.json \
  --validation_file = dev-v1.1.json \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /data1/debug_squad1/

for data train-v1.1.json. dev-v1.1.json / train.csv, dev.csv error:

2020-12-31 12:00:59.821145: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2020-12-31 12:00:59.821182: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "run_qa.py", line 469, in <module>
    main()
  File "run_qa.py", line 159, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/media/data2/anaconda/envs/bertQA-env/lib/python3.8/site-packages/transformers/hf_argparser.py", line 135, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 16, in __init__
  File "run_qa.py", line 142, in __post_init__
    assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
AssertionError: `train_file` should be a csv or a json file.

the train_file, validation_file is a valid parameter in run_qa.py file. Can someone please help with how can we train on specific dataset?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

removing the spaces worked for me, thoe I’m still not able to run that script getting:

Traceback (most recent call last):
  File "run_qa.py", line 469, in <module>
    main()
  File "run_qa.py", line 252, in main
    answer_column_name = "answers" if "answers" in column_names else column_names[2]
IndexError: list index out of range

Note: I am using official training and dev json file to run the script please see if someone can help. @patrickvonplaten / @stas00 / @vasudevgupta7

After using modified squad.py and converting data to JSON. It loads the data without error but when it starts training I got the following error message. @gowtham1997

[INFO|trainer.py:837] 2021-03-04 01:19:16,915 >> ***** Running training *****
[INFO|trainer.py:838] 2021-03-04 01:19:16,915 >>   Num examples = 14842
[INFO|trainer.py:839] 2021-03-04 01:19:16,916 >>   Num Epochs = 5
[INFO|trainer.py:840] 2021-03-04 01:19:16,916 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-03-04 01:19:16,916 >>   Total train batch size (w. parallel, distributed & accumulation) = 48
[INFO|trainer.py:842] 2021-03-04 01:19:16,916 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-03-04 01:19:16,916 >>   Total optimization steps = 1550

  0%|          | 0/1550 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/okyanus/users/ctantug/transformers/examples/question-answering/run_qa.py", line 507, in <module>
    main()
  File "/okyanus/users/ctantug/transformers/examples/question-answering/run_qa.py", line 481, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1304, in training_step
    loss = self.compute_loss(model, inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1334, in compute_loss
    outputs = model(**inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1793, in forward
    start_logits, end_logits = logits.split(1, dim=-1)
ValueError: too many values to unpack (expected 2)

Solved it. Turns out I have to change the config file to have only two labels (one for the first sentence and one for the second).

@sgugger can you lend a hand here?

I have ran into the same problem but with a different error

Traceback (most recent call last):
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 434, in incomplete_dir
    yield tmp_dir
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 476, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 553, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 897, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/home/abashir/.cache/huggingface/modules/datasets_modules/datasets/json/fb88b12bd94767cb0cc7eedcd82ea1f402d2162addc03a37e81d4f8dc7313ad9/json.py", line 75, in _generate_tables
    parse_options=self.config.pa_parse_options,
  File "pyarrow/_json.pyx", line 247, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/GW/Health-Corpus/work/UMLS/transformers/examples/question-answering/run_qa.py", line 495, in <module>
    main()
  File "/GW/Health-Corpus/work/UMLS/transformers/examples/question-answering/run_qa.py", line 222, in main
    datasets = load_dataset(extension, data_files=data_files, field="data")
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/load.py", line 611, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 483, in download_and_prepare
    self._save_info()
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/site-packages/datasets/builder.py", line 440, in incomplete_dir
    shutil.rmtree(tmp_dir)
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/shutil.py", line 498, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/abashir/anaconda3/envs/mpi/lib/python3.7/shutil.py", line 496, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/home/abashir/.cache/huggingface/datasets/json/default-43dfe5d134316dba/0.0.0/fb88b12bd94767cb0cc7eedcd82ea1f402d2162addc03a37e81d4f8dc7313ad9.incomplete'

When tried the above fixes. altering the datasers line with loading the squad.py altered script I run into

30a174f57e692deb3b377336683/squad.py", line 106, in _split_generators
    datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
KeyError: 'dev'