datasets: [experiment] missing default_experiment-1-0.arrow
the original report was pretty bad and incomplete - my apologies!
Please see the complete version here: https://github.com/huggingface/datasets/issues/1942#issuecomment-786336481
As mentioned here https://github.com/huggingface/datasets/issues/1939 metrics don’t get cached, looking at my local ~/.cache/huggingface/metrics
- there are many *.arrow.lock
files but zero metrics files.
w/o the network I get:
FileNotFoundError: [Errno 2] No such file or directory: '~/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow
there is just ~/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow.lock
I did run the same run_seq2seq.py
script on the instance with network and it worked just fine, but only the lock file was left behind.
this is with master.
Thank you.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
When you’re using metrics in a distributed setup, there are two cases:
In case 1. you just need to provide two different
experiment_id
so that the metrics don’t collide. In case 2. they must have the same experiment_id (or use the default one), but in this case you also need to provide thenum_processes
andprocess_id
If understand correctly you’re in situation 2.
If so, you make sure that you instantiate the metrics with both the right
num_processes
andprocess_id
parameters ?If they’re not set, then the cache files of the two metrics collide it can cause issues. For example if one metric finishes before the other, then the cache file is deleted and the other metric gets a FileNotFoundError There’s more information in the documentation if you want
Hope that helps !
I just opened #1966 to fix this 😃 @stas00 if have a chance feel free to try it !
Yes totally, this use case is supposed to be supported by
datasets
. And in this case there shouldn’t be any collision between the metrics. I’m looking into it 😃 My guess is that at one point the metric isn’t using the right file name. It’s supposed to use one with a unique uuid in order to avoid the collisions.Right, to clarify, I meant it’d be good to have it sorted on the library side and not requiring the user to figure it out. This is too complex and error-prone and if not coded correctly the bug will be intermittent which is even worse.
Oh I guess I wasn’t clear in my message - in no way I’m proposing that we use this workaround code - I was just showing what I had to do to make it work.
We are on the same page.
And yes, this is another problem that my workaround introduces. Thank you for pointing it out, @sgugger
To give more context, we are just using the metrics for the
comput_metric
function and nothing else. Is there something else we can use that just applies the function to the full arrays of predictions and labels? Because that’s all we need, all the gathering has already been done because the datasets Metric multiprocessing relies on file storage and thus does not work in a multi-node distributed setup (whereas the Trainer does).Otherwise, we’ll have to switch to something else to compute the metrics 😦
Thank you, @lhoestq - I will experiment and report back.
edit: It works! Thank you
I don’t see how this could be the responsibility of
Trainer
, who hasn’t the faintest idea of what adatasets.Metric
is. The trainer takes a functioncompute_metrics
that goes from predictions + labels to metric results, there is nothing there. That computation is done on all processesThe fact a
datasets.Metric
object cannot be used as a simple compute function in a multi-process environment is, in my opinion, a bug indatasets
. Especially since, as I mentioned before, the multiprocessing part ofdatasets.Metric
has a deep flaw since it can’t work in a multinode environment. So you actually need to do the job of gather predictions and labels yourself.The changes you are proposing Stas are making the code less readable and also concatenate all the predictions and labels
number_of_processes
times I believe, which is not going to make the metric computation any faster.OK, it definitely leads to a race condition in how it’s used right now. Here is how you can reproduce it - by injecting a random sleep time different for each process before the locks are acquired.
then run with 2 procs:
python -m torch.distributed.launch --nproc_per_node=2
OK, this issue is not about caching but some internal conflict/race condition it seems, I have just run into it on my normal env:
I’m just running
run_seq2seq.py
under DeepSpeed:It finished the evaluation OK and crashed on the prediction part of the Trainer. But the eval / predict parts no longer run under Deepspeed, it’s just plain ddp.
Is this some kind of race condition? It happens intermittently - there is nothing else running at the same time.
But if 2 independent instances of the same script were to run at the same time it’s clear to see that this problem would happen. Perhaps it’d help to create a unique hash which is shared between all processes in the group and use that as the default experiment id?
Hi !
The cache at
~/.cache/huggingface/metrics
stores the users data for metrics computations (hence the arrow files).However python modules (i.e. dataset scripts, metric scripts) are stored in
~/.cache/huggingface/modules/datasets_modules
.In particular the metrics are cached in
~/.cache/huggingface/modules/datasets_modules/metrics/
Feel free to take a look at your cache and let me know if you find any issue that would help explaining why you had an issue with
rouge
with no connection. I’m doing some tests on my side to try to reproduce the issue you have