transformers: Custom train/validation file not supported in run_qa.py
Environment info transformers version: 4.0.1 Platform: Linux-5.4.0-58-generic-x86_64-with-glibc2.10 Python version: 3.8.5 PyTorch version (GPU?): 1.7.1+cu110 (True) Tensorflow version (GPU?): not installed (NA) Using GPU in script?: yes
I am trying to pass custom dataset or modified squad dataset (in valid squad format only) using parameters
–train_file = train-v1.1.json
–validation_file = dev-v1.1.json
but it does not work for me g
from the official documentation, https://github.com/huggingface/transformers/tree/master/examples/question-answering this script runs fine:
python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
but if I use the below script:
python run_qa.py \
--model_name_or_path bert-base-uncased \
--train_file = train-v1.1.json \
--validation_file = dev-v1.1.json \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /data1/debug_squad1/
for data train-v1.1.json. dev-v1.1.json / train.csv, dev.csv error:
2020-12-31 12:00:59.821145: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2020-12-31 12:00:59.821182: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "run_qa.py", line 469, in <module>
main()
File "run_qa.py", line 159, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/media/data2/anaconda/envs/bertQA-env/lib/python3.8/site-packages/transformers/hf_argparser.py", line 135, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 16, in __init__
File "run_qa.py", line 142, in __post_init__
assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
AssertionError: `train_file` should be a csv or a json file.
the train_file, validation_file is a valid parameter in run_qa.py file. Can someone please help with how can we train on specific dataset?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (2 by maintainers)
removing the spaces worked for me, thoe I’m still not able to run that script getting:
Note: I am using official training and dev json file to run the script please see if someone can help. @patrickvonplaten / @stas00 / @vasudevgupta7
Solved it. Turns out I have to change the config file to have only two labels (one for the first sentence and one for the second).
@sgugger can you lend a hand here?
I have ran into the same problem but with a different error
When tried the above fixes. altering the datasers line with loading the
squad.py
altered script I run into