datasets: load_dataset for CSV files not working
Similar to #622, I’ve noticed there is a problem when trying to load a CSV file with datasets.
from datasets import load_dataset
dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")
Displayed error:
... ArrowInvalid: CSV parse error: Expected 2 columns, got 1
I should mention that when I’ve tried to read data from https://github.com/lhoestq/transformers/tree/custom-dataset-in-rag-retriever/examples/rag/test_data/my_knowledge_dataset.csv
it worked without a problem. I’ve read that there might be some problems with /r character, so I’ve removed them from the custom dataset, but the problem still remains.
I’ve added a colab reproducing the bug, but unfortunately I cannot provide the dataset. https://colab.research.google.com/drive/1Qzu7sC-frZVeniiWOwzoCe_UHZsrlxu8?usp=sharing
Are there any work around for it ? Thank you
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 22 (11 by maintainers)
This is because load_dataset without
split=
returns a dictionary of split names (train/validation/test) to dataset. You can doOr if you want to directly get the train split:
Hi ! the
split
argument inload_dataset
is used to select the splits you want among the available splits. However when loading a csv with a single file as you did, only atrain
split is available by default.Indeed since
data_files='./amazon_data/Video_Games_5.csv'
is equivalent todata_files={"train": './amazon_data/Video_Games_5.csv'}
, you can get a dataset withAnd then to get both a train and test split you can do
Also note that a csv dataset may have several available splits if it is defined this way:
Oh… I figured it out. According to issue #42387 from pandas, this new version does not accept None for both parameters (which was being done by the repo I’m testing). Dowgrading Pandas==1.0.4 and Python==3.8 worked
Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.
By default datasets are cached in
~/.cache/huggingface/datasets
, could you check that you have the right permissions ? You can also try to change the cache directory by passingcache_dir="path/to/my/cache/dir"
toload_dataset
.Hi @kauvinlucas
You can use the latest versions of
datasets
to do this. To do so, justpip install datasets
instead ofnlp
(the library was renamed) and then