datasets: Dataset librispeech_asr fails to load
Describe the bug
The dataset librispeech_asr (standard Librispeech) fails to load.
Steps to reproduce the bug
datasets.load_dataset("librispeech_asr")
Expected results
It should download and prepare the whole dataset (all subsets).
In the doc, it says it has two configurations (clean and other).
However, the dataset doc says that not specifying split
should just load the whole dataset, which is what I want.
Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.
Actual results
...
File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
locals:
archive_path = <not found>
dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
_DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
self.config.name = <local> 'default', len = 7
KeyError: 'default'
Environment info
datasets
version: 2.1.0- Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
- Python version: 3.9.9
- PyArrow version: 6.0.1
- Pandas version: 1.4.2
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (14 by maintainers)
Since this bug is still there and google led me here when I was searching for a solution, I am writing down how to quickly fix it (as suggested by @mariosasko) for whoever else is not familiar with how the HF Hub works.
Download the librispeech_asr.py script and remove the unwanted splits both from the
_DL_URLS
dictionary and from the_split_generators
function. Here I made an example with only the test sets.Then either save the script locally and load the dataset via
or create a new dataset repo on the hub named “librispeech_asr” and upload the script there, then you can just run
I think a user would expect that the default would give you the full dataset.
It does raise an error, but this error confused me because I did not understand why I needed a config, or why I could not simply download the whole dataset, which is what people usually do with Librispeech.
Fixed by https://github.com/huggingface/datasets/pull/4184
@patrickvonplaten This problem is a bit harder than it may seem, and it has to do with how our scripts are structured -
_split_generators
downloads data for a split before its definition. There was an attempt to fix this in https://github.com/huggingface/datasets/pull/2249, but it wasn’t flexible enough. Luckily, I have a plan of attack, and this issue is on our short-term roadmap, so I’ll work on it soon.In the meantime, one can use streaming or manually download a dataset script, remove unwanted splits and load a dataset via
load_dataset
.If you need both
"clean"
and"other"
I think you’ll have to do concatenate them as follows:See https://huggingface.co/docs/datasets/v2.1.0/en/process#concatenate