datasets: [experiment] missing default_experiment-1-0.arrow

the original report was pretty bad and incomplete - my apologies!

Please see the complete version here: https://github.com/huggingface/datasets/issues/1942#issuecomment-786336481

As mentioned here https://github.com/huggingface/datasets/issues/1939 metrics don’t get cached, looking at my local ~/.cache/huggingface/metrics - there are many *.arrow.lock files but zero metrics files.

w/o the network I get:

FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow

there is just ~/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow.lock

I did run the same run_seq2seq.py script on the instance with network and it worked just fine, but only the lock file was left behind.

this is with master.

Thank you.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (18 by maintainers)

Most upvoted comments

When you’re using metrics in a distributed setup, there are two cases:

you’re doing two completely different experiments (two evaluations) and the 2 metrics jobs have nothing to do with each other
you’re doing one experiment (one evaluation) but use multiple processes to feed the data to the metric.

In case 1. you just need to provide two different experiment_id so that the metrics don’t collide. In case 2. they must have the same experiment_id (or use the default one), but in this case you also need to provide the num_processes and process_id

If understand correctly you’re in situation 2.

If so, you make sure that you instantiate the metrics with both the right num_processes and process_id parameters ?

If they’re not set, then the cache files of the two metrics collide it can cause issues. For example if one metric finishes before the other, then the cache file is deleted and the other metric gets a FileNotFoundError There’s more information in the documentation if you want

Hope that helps !

lhoestq on Feb 26, 2021

I just opened #1966 to fix this 😃 @stas00 if have a chance feel free to try it !

lhoestq on Mar 1, 2021

The fact a datasets.Metric object cannot be used as a simple compute function in a multi-process environment is, in my opinion, a bug in datasets

Yes totally, this use case is supposed to be supported by datasets. And in this case there shouldn’t be any collision between the metrics. I’m looking into it 😃 My guess is that at one point the metric isn’t using the right file name. It’s supposed to use one with a unique uuid in order to avoid the collisions.

lhoestq on Mar 1, 2021

Right, to clarify, I meant it’d be good to have it sorted on the library side and not requiring the user to figure it out. This is too complex and error-prone and if not coded correctly the bug will be intermittent which is even worse.

Oh I guess I wasn’t clear in my message - in no way I’m proposing that we use this workaround code - I was just showing what I had to do to make it work.

We are on the same page.

The changes you are proposing Stas are making the code less readable and also concatenate all the predictions and labels number_of_processes times I believe, which is not going to make the metric computation any faster.

And yes, this is another problem that my workaround introduces. Thank you for pointing it out, @sgugger

stas00 on Feb 28, 2021

To give more context, we are just using the metrics for the comput_metric function and nothing else. Is there something else we can use that just applies the function to the full arrays of predictions and labels? Because that’s all we need, all the gathering has already been done because the datasets Metric multiprocessing relies on file storage and thus does not work in a multi-node distributed setup (whereas the Trainer does).

Otherwise, we’ll have to switch to something else to compute the metrics 😦

sgugger on Feb 27, 2021

Thank you, @lhoestq - I will experiment and report back.

edit: It works! Thank you

stas00 on Mar 1, 2021

I don’t see how this could be the responsibility of Trainer, who hasn’t the faintest idea of what a datasets.Metric is. The trainer takes a function compute_metrics that goes from predictions + labels to metric results, there is nothing there. That computation is done on all processes

The fact a datasets.Metric object cannot be used as a simple compute function in a multi-process environment is, in my opinion, a bug in datasets. Especially since, as I mentioned before, the multiprocessing part of datasets.Metric has a deep flaw since it can’t work in a multinode environment. So you actually need to do the job of gather predictions and labels yourself.

The changes you are proposing Stas are making the code less readable and also concatenate all the predictions and labels number_of_processes times I believe, which is not going to make the metric computation any faster.

sgugger on Feb 27, 2021

OK, it definitely leads to a race condition in how it’s used right now. Here is how you can reproduce it - by injecting a random sleep time different for each process before the locks are acquired.

--- a/src/datasets/metric.py
+++ b/src/datasets/metric.py
@@ -348,6 +348,16 @@ class Metric(MetricInfoMixin):

         elif self.process_id == 0:
             # Let's acquire a lock on each node files to be sure they are finished writing
+
+            import time
+            import random
+            import os
+            pid = os.getpid()
+            random.seed(pid)
+            secs = random.randint(1, 15)
+            time.sleep(secs)
+            print(f"sleeping {secs}")
+
             file_paths, filelocks = self._get_all_cache_files()

             # Read the predictions and references
@@ -385,7 +395,10 @@ class Metric(MetricInfoMixin):

         if predictions is not None:
             self.add_batch(predictions=predictions, references=references)
+        print("FINALIZE START")
+
         self._finalize()
+        print("FINALIZE END")

         self.cache_file_name = None
         self.filelock = None

then run with 2 procs: python -m torch.distributed.launch --nproc_per_node=2

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --do_train --do_predict --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 --max_train_samples 10 --max_val_samples 10 --max_test_samples 10  --dataset_name wmt16 --dataset_config ro-en --source_prefix "translate English to Romanian: "

***** Running Evaluation *****
  Num examples = 10
  Batch size = 16
  0%|                                                                                                                                      | 0/1 [00:00<?, ?it/s]FINALIZE START
FINALIZE START
sleeping 11
FINALIZE END
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.06s/it]
sleeping 11
Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 368, in _finalize
    self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: pa.Table = self._get_dataset_from_filename(f_dict, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 302, in _get_dataset_from_filename
    pa_table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 322, in read_table
    stream = stream_from(filename)
  File "pyarrow/io.pxi", line 782, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 743, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/home/stas/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow'. Detail: [errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/seq2seq/run_seq2seq.py", line 645, in <module>
    main()
  File "examples/seq2seq/run_seq2seq.py", line 601, in main
    metrics = trainer.evaluate(
  File "/mnt/nvme1/code/huggingface/transformers-mp-pp/src/transformers/trainer_seq2seq.py", line 74, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme1/code/huggingface/transformers-mp-pp/src/transformers/trainer.py", line 1703, in evaluate
    output = self.prediction_loop(
  File "/mnt/nvme1/code/huggingface/transformers-mp-pp/src/transformers/trainer.py", line 1876, in prediction_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
  File "examples/seq2seq/run_seq2seq.py", line 556, in compute_metrics
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 402, in compute
    self._finalize()
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 370, in _finalize
    raise ValueError(
ValueError: Error in finalize: another metric instance is already using the local cache file. Please specify an experiment_id to avoid colision between distributed metric instances.

stas00 on Feb 27, 2021

OK, this issue is not about caching but some internal conflict/race condition it seems, I have just run into it on my normal env:

Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 356, in _finalize
    self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: pa.Table = self._get_dataset_from_filename(f_dict, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 302, in _get_dataset_from_filename
    pa_table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_reader.py", line 322, in read_table
    stream = stream_from(filename)
  File "pyarrow/io.pxi", line 782, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 743, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/home/stas/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow'. Detail: [errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/seq2seq/run_seq2seq.py", line 655, in <module>
    main()
  File "examples/seq2seq/run_seq2seq.py", line 619, in main
    test_results = trainer.predict(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer_seq2seq.py", line 121, in predict
    return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1706, in predict
    output = self.prediction_loop(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1813, in prediction_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
  File "examples/seq2seq/run_seq2seq.py", line 556, in compute_metrics
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 388, in compute
    self._finalize()
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/metric.py", line 358, in _finalize
    raise ValueError(
ValueError: Error in finalize: another metric instance is already using the local cache file. Please specify an experiment_id to avoid colision between distributed metric instances.

I’m just running run_seq2seq.py under DeepSpeed:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 deepspeed --num_gpus=2 examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --do_train --do_predict --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 --max_train_samples 100 --max_val_samples 100 --max_test_samples 100 --dataset_name wmt16 --dataset_config ro-en  --source_prefix "translate English to Romanian: " --deepspeed examples/tests/deepspeed/ds_config.json

It finished the evaluation OK and crashed on the prediction part of the Trainer. But the eval / predict parts no longer run under Deepspeed, it’s just plain ddp.

Is this some kind of race condition? It happens intermittently - there is nothing else running at the same time.

But if 2 independent instances of the same script were to run at the same time it’s clear to see that this problem would happen. Perhaps it’d help to create a unique hash which is shared between all processes in the group and use that as the default experiment id?

stas00 on Feb 26, 2021

Hi !

The cache at ~/.cache/huggingface/metrics stores the users data for metrics computations (hence the arrow files).

However python modules (i.e. dataset scripts, metric scripts) are stored in ~/.cache/huggingface/modules/datasets_modules.

In particular the metrics are cached in ~/.cache/huggingface/modules/datasets_modules/metrics/

Feel free to take a look at your cache and let me know if you find any issue that would help explaining why you had an issue with rouge with no connection. I’m doing some tests on my side to try to reproduce the issue you have

lhoestq on Feb 25, 2021