datasets: load_dataset method returns Unknown split "validation" even if this dir exists

Describe the bug

The datasets.load_dataset returns a ValueError: Unknown split "validation". Should be one of ['train', 'test']. when running load_dataset(local_data_dir_path, split="validation") even if the validation sub-directory exists in the local data path.

The data directories are as follows and attached to this issue:

test_data1
              |_ train
                  |_ 1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ 234.png
                  |_ metadata.jsonl
                  ...
test_data2
              |_ train
                  |_ train_1012.png
                  |_ metadata.jsonl
                  ...
              |_ test
                  ...
              |_ validation
                  |_ val_234.png
                  |_ metadata.jsonl
                  ...

They contain the same image files and metadata.jsonl but the images in test_data2 have the split names prepended i.e. train_1012.png, val_234.png and the images in test_data1 do not have the split names prepended to the image names i.e. 1012.png, 234.png

I actually saw in another issue val was not recognized as a split name but here I would expect the files to take the split from the parent directory name i.e. val should become part of the validation split?

Steps to reproduce the bug

import datasets
datasets.logging.set_verbosity_error()
from datasets import load_dataset, get_dataset_split_names


# the following only finds train, validation and test splits correctly
path = "./test_data1"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)


# the following only finds train and test splits
path = "./test_data2"
print("######################", get_dataset_split_names(path), "######################")

dataset_list = []
for spt in ["train", "test", "validation"]:
    dataset = load_dataset(path, split=spt)
    dataset_list.append(dataset)

Expected results

###################### ['train', 'test', 'validation'] ######################
###################### ['train', 'test', 'validation'] ######################

Actual results

Traceback (most recent call last):
  File "test_data_loader.py", line 11, in <module>

    dataset = load_dataset(path, split=spt)
  File "/home/venv/lib/python3.8/site-packages/datasets/load.py", line 1758, in load_dataset
    ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 893, in as_dataset
    datasets = map_nested(
  File "/home/venv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 385, in map_nested
    return function(data_struct)
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 924, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/venv/lib/python3.8/site-packages/datasets/builder.py", line 993, in _as_dataset
    dataset_kwargs = ArrowReader(self._cache_dir, self.info).read(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 211, in read
    files = self.get_file_instructions(name, instructions, split_infos)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 184, in get_file_instructions
    file_instructions = make_file_instructions(
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 107, in make_file_instructions
    absolute_instructions = instruction.to_absolute(name2len)
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in to_absolute
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 616, in <listcomp>
    return [_rel_to_abs_instr(rel_instr, name2len) for rel_instr in self._relative_instructions]
  File "/home/venv/lib/python3.8/site-packages/datasets/arrow_reader.py", line 433, in _rel_to_abs_instr
    raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
ValueError: Unknown split "validation". Should be one of ['train', 'test'].

Environment info

  • datasets version:
  • Platform: Linux Ubuntu 18.04
  • Python version: 3.8.12
  • PyArrow version: 9.0.0

Data files

test_data1.zip test_data2.zip

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

@polinaeterna I have solved the issue. The solution was to call: load_dataset("csv", data_files={split: files}, split=split)

This code indeed behaves as expected on main. But suppose the val_234.png is renamed to some other value not containing one of these keywords, in that case, this issue becomes relevant again because the real cause of it is the order in which we check the predefined split patterns to assign data files to each split - first we assign data files based on filenames, and only if this fails meaning not a single split found (val is not recognized here in the older versions of datasets, which results in an empty validation split), do we assign based on directory names.

@polinaeterna @lhoestq Perhaps one way to fix this would be to swap the order of the patterns if data_dir is specified (or if load_dataset(data_dir) is called)?