datasets: map/filter multiprocessing raises errors and corrupts datasets

After upgrading to the 1.0 started seeing errors in my data loading script after enabling multiprocessing.

    ...
    ner_ds_dict = ner_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
    ner_ds_dict["validation"] = ner_ds_dict["test"]
    rel_ds_dict = rel_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
    rel_ds_dict["validation"] = rel_ds_dict["test"]
    return ner_ds_dict, rel_ds_dict

The first train_test_split, ner_ds/ner_ds_dict, returns a train and test split that are iterable. The second, rel_ds/rel_ds_dict in this case, returns a Dataset dict that has rows but if selected from or sliced into into returns an empty dictionary. eg rel_ds_dict['train'][0] == {} and rel_ds_dict['train'][0:100] == {}.

Ok I think I know the problem – the rel_ds was mapped though a mapper with num_proc=12. If I remove num_proc. The dataset loads.

I also see errors with other map and filter functions when num_proc is set.

Done writing 67 indices in 536 bytes .
Done writing 67 indices in 536 bytes .
Fatal Python error: PyCOND_WAIT(gil_cond) failed

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 22 (14 by maintainers)

Most upvoted comments

Thanks for reporting. I’m going to fix that and add a test case so that it doesn’t happen again 😃 I’ll let you know when it’s done

In the meantime if you could make a google colab that reproduces the issue it would be helpful ! @timothyjlaurent

Hi @lhoestq ,

Thanks for letting me know about the update.

So I don’t think it’s the caching - because hashing mechanism isn’t stable for me – but that’s a different issue. In any case I rm -rf ~/.cache/huggingface to make a clean slate.

I synced with master and I see the key error has gone away, I tried with and without the TOKENIZERS_PARALLELISM variable set and see the log line for setting the value false before the map.

Now I’m seeing an issue with .train_test_split() on datasets that are the product of a multiprocess map.

Here is the stack trace

  File "/Users/timothy.laurent/src/inv-text2struct/text2struct/model/dataset.py", line 451, in load_prodigy_arrow_datasets
    ner_ds_dict = ner_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed)
  File "/Users/timothy.laurent/.virtualenvs/inv-text2struct/src/datasets/src/datasets/arrow_dataset.py", line 168, in wrapper
    dataset.set_format(**new_format)
  File "/Users/timothy.laurent/.virtualenvs/inv-text2struct/src/datasets/src/datasets/fingerprint.py", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/Users/timothy.laurent/.virtualenvs/inv-text2struct/src/datasets/src/datasets/arrow_dataset.py", line 794, in set_format
    list(filter(lambda col: col not in self._data.column_names, columns)), self._data.column_names
ValueError: Columns ['train', 'test'] not in the dataset. Current columns in the dataset: ['_input_hash', '_task_hash', '_view_id', 'answer', 'encoding__ids', 'encoding__offsets', 'encoding__overflowing', 'encoding__tokens', 'encoding__words', 'ner_ids', 'ner_labels', 'relations', 'spans', 'text', 'tokens']