datasets: map/filter multiprocessing raises errors and corrupts datasets
After upgrading to the 1.0 started seeing errors in my data loading script after enabling multiprocessing.
...
ner_ds_dict = ner_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
ner_ds_dict["validation"] = ner_ds_dict["test"]
rel_ds_dict = rel_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
rel_ds_dict["validation"] = rel_ds_dict["test"]
return ner_ds_dict, rel_ds_dict
The first train_test_split, ner_ds
/ner_ds_dict
, returns a train
and test
split that are iterable.
The second, rel_ds
/rel_ds_dict
in this case, returns a Dataset dict that has rows but if selected from or sliced into into returns an empty dictionary. eg rel_ds_dict['train'][0] == {}
and rel_ds_dict['train'][0:100] == {}
.
Ok I think I know the problem – the rel_ds was mapped though a mapper with num_proc=12
. If I remove num_proc
. The dataset loads.
I also see errors with other map and filter functions when num_proc
is set.
Done writing 67 indices in 536 bytes .
Done writing 67 indices in 536 bytes .
Fatal Python error: PyCOND_WAIT(gil_cond) failed
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (14 by maintainers)
Thanks for reporting. I’m going to fix that and add a test case so that it doesn’t happen again 😃 I’ll let you know when it’s done
In the meantime if you could make a google colab that reproduces the issue it would be helpful ! @timothyjlaurent
Hi @lhoestq ,
Thanks for letting me know about the update.
So I don’t think it’s the caching - because hashing mechanism isn’t stable for me – but that’s a different issue. In any case I
rm -rf ~/.cache/huggingface
to make a clean slate.I synced with master and I see the key error has gone away, I tried with and without the
TOKENIZERS_PARALLELISM
variable set and see the log line for setting the value false before the map.Now I’m seeing an issue with
.train_test_split()
on datasets that are the product of a multiprocess map.Here is the stack trace