datasets: load_dataset for text files not working

Trying the following snippet, I get different problems on Linux and Windows.

dataset = load_dataset("text", data_files="data.txt")
# or 
dataset = load_dataset("text", data_files=["data.txt"])

(ps This example shows that you can use a string as input for data_files, but the signature is Union[Dict, List].)

The problem on Linux is that the script crashes with a CSV error (even though it isn’t a CSV file). On Windows the script just seems to freeze or get stuck after loading the config file.

Linux stack trace:

PyTorch version 1.6.0+cu101 available.
Checking /home/bram/.cache/huggingface/datasets/b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.json
Using custom data configuration default
Generating dataset text (/home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7)
Downloading and preparing dataset text/default-0907112cc6cd2a38 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7...
Dataset not on Hf google storage. Downloading and preparing it from source
Downloading took 0.0 min
Checksum Computation took 0.0 min
Unable to verify checksums.
Generating split train
Traceback (most recent call last):
  File "/home/bram/Python/projects/dutch-simplification/utils.py", line 45, in prepare_data
    dataset = load_dataset("text", data_files=dataset_f)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/load.py", line 608, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 468, in download_and_prepare
    self._download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 546, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 888, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py", line 100, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2

Windows just seems to get stuck. Even with a tiny dataset of 10 lines, it has been stuck for 15 minutes already at this message:

Checking C:\Users\bramv\.cache\huggingface\datasets\b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text\dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.json
Using custom data configuration default

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 41 (26 by maintainers)

Most upvoted comments

I found a way to implement it without third party lib and without separator/delimiter logic. Creating a PR now.

I’d love to have your feedback on the PR @Skyy93 , hopefully this is the final iteration of the text dataset 😃

Let me know if it works on your side !

Until there’s a new release you can test it with

from datasets import load_dataset

d = load_dataset("text", data_files=..., script_version="master")

@banunitte Please do not post screenshots in the future but copy-paste your code and the errors. That allows others to copy-and-paste your code and test it. You may also want to provide the Python version that you are using.

@lhoestq I’ve opened a new issue #743 and added a colab. When do you have free time, can you please look upon it. I would appreciate it 😄. Thank you.

Indeed good catch ! I’m going to add the lineterminator and add tests that include the \r character. According to the stackoverflow thread this issue occurs when there is a \r in the text file.

You can expect a patch release by tomorrow

Traceback` (most recent call last):
  File "main.py", line 281, in <module>
    main()
  File "main.py", line 190, in main
    train_data, test_data = data_factory(
  File "main.py", line 129, in data_factory
    train_data = load_dataset('text', 
  File "/home/me/Downloads/datasets/src/datasets/load.py", line 608, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/me/Downloads/datasets/src/datasets/builder.py", line 468, in download_and_prepare
    self._download_and_prepare(
  File "/home/me/Downloads/datasets/src/datasets/builder.py", line 546, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/me/Downloads/datasets/src/datasets/builder.py", line 888, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/me/.local/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/home/me/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014/text.py", line 103, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 617, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2

Unfortunately i am still getting this issue on Linux. I installed datasets from source and specified script_version to master.

To complete what @lhoestq is saying, I think that to use the new version of the text processing script (which is on master right now) you need to either specify the version of the script to be the master one or to install the lib from source (in which case it uses the master version of the script by default):

dataset = load_dataset('text', script_version='master', data_files=XXX)

We do versioning by default, i.e. your version of the dataset lib will use the script with the same version by default (i.e. only the 1.0.1 version of the script if you have the PyPI version 1.0.1 of the lib).