transformers: [example scripts] inconsistency around eval vs val

  • val == validation set (split)
  • eval == evaluation (mode)

those two are orthogonal to each other - one is a split, another is a model’s run mode.

the trainer args and the scripts are inconsistent around when it’s val and when it’s eval in variable names and metrics.

examples:

  • eval_dataset but --validation_file
  • eval_* metrics key for validation dataset - why the prediction is then test_* metric keys?
  • data_args.max_val_samples vs eval_dataset in the same line

the 3 parallels:

  • train is easy - it’s both the process and the split
  • prediction is almost never used in the scripts it’s all test - var names and metrics and cl args
  • eval vs val vs validation is very inconsistent. when writing tests I’m never sure whether I’m looking up eval_* or val_* key. And one could run evaluation on the test dataset.

Perhaps asking a question would help and then a consistent answer is obvious:

Are metrics reporting stats on a split or a mode? A. split - rename all metrics keys to be train|val|test B. mode - rename all metrics keys to be train|eval|predict

Thank you.

@sgugger, @patil-suraj, @patrickvonplaten

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (16 by maintainers)

Most upvoted comments

No the key in the dataset dictionary is “validation”, so it should be validation_file.

Awesome! Thank you, @bhadreshpsavani!

So the changes we need are:

  1. use eval instead of val
  2. use predict instead of test

in cl args and variable names in example scripts (only the active ones, please ignore legacy/research subdirs).

I hope this will be a last rename in awhile.

@bhadreshpsavani, would this be something you’d like to work on by chance? If you haven’t tired of examples yet.

I vote for B, for consistency with do_train, do_eval, do_predict.

For examples: switching an arg name can be done without taking precautions for BC as long as the README is updated at the same time, but for TrainingArguments (if any is concerned), a proper deprecation cycle has to be made.

ok, so I had to disable a Privacy Badger firefox extension and colab started working.

First, make a habit to start colab with:

!free -h

sometimes I get 12GB RAM, other times 25GB, 12GB is typically too low for much.

So run_clm works just fine even on 12GB. I had to use a small bs so edited your cmd lines to limit bs:

!python examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--max_train_samples 5 \
--max_val_samples 5 \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm \
--per_device_eval_batch_size 2 \
--per_device_train_batch_size 2 \
--overwrite_output_dir

this worked too:

!python examples/pytorch/language-modeling/run_plm.py \
--model_name_or_path xlnet-base-cased \
--dataset_name wikitext \
--max_train_samples 5 \
--max_val_samples 5 \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm \
--per_device_eval_batch_size 2 \
--per_device_train_batch_size 2 \
--overwrite_output_dir

and so did:

!python examples/pytorch/question-answering/run_qa.py \
--model_name_or_path distilbert-base-uncased \
--train_file tests/fixtures/tests_samples/SQUAD/sample.json \
--validation_file tests/fixtures/tests_samples/SQUAD/sample.json \
--test_file tests/fixtures/tests_samples/SQUAD/sample.json \
--do_train \
--do_eval \
--do_predict \
--max_train_samples 5 \
--max_val_samples 5 \
--max_test_samples 5 \
--learning_rate 3e-5 \
--max_seq_length 384 \
--doc_stride 128 \
--version_2_with_negative \
--output_dir /tmp/debug_squad/ \
--per_device_eval_batch_size 2 \
--per_device_train_batch_size 2 \
--overwrite_output

Ya, I think it silently aborted the run w/o any traceback. Might be because it is occupying the entire ram somehow. Similar behavior I observed when I run a really big docker image locally.

I will definitely try this command and dig more!

Thanks a lot for your input. This is really insightful! I will note down this as well 😃

Hi @stas00, Ya i will be happy to work more. Actually I was looking for some issues to work on!

Not really my area of expertise here, but I do agree with @stas00 -> I think we should keep the liberty of quickly adapting the examples