datasets: Dataset librispeech_asr fails to load

Describe the bug

The dataset librispeech_asr (standard Librispeech) fails to load.

Steps to reproduce the bug

datasets.load_dataset("librispeech_asr")

Expected results

It should download and prepare the whole dataset (all subsets).

In the doc, it says it has two configurations (clean and other). However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

Actual results

...
  File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
    line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
    locals:
      archive_path = <not found>
      dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
      dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
      _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
      self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
      self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
      self.config.name = <local> 'default', len = 7
KeyError: 'default'

Environment info

  • datasets version: 2.1.0
  • Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
  • Python version: 3.9.9
  • PyArrow version: 6.0.1
  • Pandas version: 1.4.2

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (14 by maintainers)

Most upvoted comments

load_dataset(“librispeech_asr”, “clean”, “train.100”) actually downloads the whole dataset and not the 100 hr split, is this a bug?

Since this bug is still there and google led me here when I was searching for a solution, I am writing down how to quickly fix it (as suggested by @mariosasko) for whoever else is not familiar with how the HF Hub works.

Download the librispeech_asr.py script and remove the unwanted splits both from the _DL_URLS dictionary and from the _split_generators function. Here I made an example with only the test sets.

Then either save the script locally and load the dataset via

load_dataset("${local_path}/librispeech_asr.py")

or create a new dataset repo on the hub named “librispeech_asr” and upload the script there, then you can just run

load_dataset("${hugging_face_username}/librispeech_asr")

Would it make sense to have clean as the default config ?

I think a user would expect that the default would give you the full dataset.

Also I think load_dataset("librispeech_asr") should have raised you an error that says that you need to specify a config

It does raise an error, but this error confused me because I did not understand why I needed a config, or why I could not simply download the whole dataset, which is what people usually do with Librispeech.

@patrickvonplaten This problem is a bit harder than it may seem, and it has to do with how our scripts are structured - _split_generators downloads data for a split before its definition. There was an attempt to fix this in https://github.com/huggingface/datasets/pull/2249, but it wasn’t flexible enough. Luckily, I have a plan of attack, and this issue is on our short-term roadmap, so I’ll work on it soon.

In the meantime, one can use streaming or manually download a dataset script, remove unwanted splits and load a dataset via load_dataset.

If you need both "clean" and "other" I think you’ll have to do concatenate them as follows:

from datasets import concatenate_datasets, load_dataset

other = load_dataset("librispeech_asr", "other")
clean = load_dataset("librispeech_asr", "clean")

librispeech = concatenate_datasets([other, clean])

See https://huggingface.co/docs/datasets/v2.1.0/en/process#concatenate