datasets: load_dataset for CSV files not working

Similar to #622, I’ve noticed there is a problem when trying to load a CSV file with datasets.

from datasets import load_dataset dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")

Displayed error: ... ArrowInvalid: CSV parse error: Expected 2 columns, got 1

I should mention that when I’ve tried to read data from https://github.com/lhoestq/transformers/tree/custom-dataset-in-rag-retriever/examples/rag/test_data/my_knowledge_dataset.csv it worked without a problem. I’ve read that there might be some problems with /r character, so I’ve removed them from the custom dataset, but the problem still remains.

I’ve added a colab reproducing the bug, but unfortunately I cannot provide the dataset. https://colab.research.google.com/drive/1Qzu7sC-frZVeniiWOwzoCe_UHZsrlxu8?usp=sharing

Are there any work around for it ? Thank you

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 22 (11 by maintainers)

Most upvoted comments

This is because load_dataset without split= returns a dictionary of split names (train/validation/test) to dataset. You can do

from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")
print(dataset["train"][0])

Or if you want to directly get the train split:

from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",", split="train")
print(dataset[0])

Hi ! the split argument in load_dataset is used to select the splits you want among the available splits. However when loading a csv with a single file as you did, only a train split is available by default.

Indeed since data_files='./amazon_data/Video_Games_5.csv' is equivalent to data_files={"train": './amazon_data/Video_Games_5.csv'}, you can get a dataset with

from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split="train")

And then to get both a train and test split you can do

dataset = dataset.train_test_split()
print(dataset.keys())
# ['train', 'test']

Also note that a csv dataset may have several available splits if it is defined this way:

from datasets import load_dataset
dataset = load_dataset('csv', data_files={
    "train": './amazon_data/Video_Games_5_train.csv',
    "test": './amazon_data/Video_Games_5_test.csv'
})
print(dataset.keys())
# ['train', 'test']

Oh… I figured it out. According to issue #42387 from pandas, this new version does not accept None for both parameters (which was being done by the repo I’m testing). Dowgrading Pandas==1.0.4 and Python==3.8 worked

Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.

By default datasets are cached in ~/.cache/huggingface/datasets, could you check that you have the right permissions ? You can also try to change the cache directory by passing cache_dir="path/to/my/cache/dir" to load_dataset.

Hi @kauvinlucas

You can use the latest versions of datasets to do this. To do so, just pip install datasets instead of nlp (the library was renamed) and then

from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv')