datasets: [Audio] Path of Common Voice cannot be used for audio loading anymore
Describe the bug
Steps to reproduce the bug
from datasets import load_dataset
from torchaudio import load
ds = load_dataset("common_voice", "ab", split="train")
# both of the following commands fail at the moment
load(ds[0]["audio"]["path"])
load(ds[0]["path"])
Expected results
The path should be the complete absolute path to the downloaded audio file not some relative path.
Actual results
~/hugging_face/venv_3.9/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py in load(filepath, frame_offset, num_frames, normalize, channels_first, format)
150 filepath, frame_offset, num_frames, normalize, channels_first, format)
151 filepath = os.fspath(filepath)
--> 152 return torch.ops.torchaudio.sox_io_load_audio_file(
153 filepath, frame_offset, num_frames, normalize, channels_first, format)
154
RuntimeError: Error loading audio file: failed to open file cv-corpus-6.1-2020-12-11/ab/clips/common_voice_ab_19904194.mp3
Environment info
datasets
version: 1.18.3.dev0- Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.27
- Python version: 3.9.1
- PyArrow version: 3.0.0
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (13 by maintainers)
It appears downgrading to torchaudio 0.11.0 fixed this problem.
From https://github.com/huggingface/datasets/pull/3736 the Common Voice dataset now gives access to the local audio files as before
Yes!
Yes this might be, but I highly doubt that
soundfile
is the go-to library for audio then. @anton-l and I have tried out a bunch of different audio loading libraries (soundfile
,librosa
,torchaudio
, pureffmpeg
,audioread
, …). One thing that was pretty clear to me is that there is just no “de-facto standard” library and they all have pros and cons. None of the libraries really supports “batch”-ed audio loading. Some depend on PyTorch.torchaudio
is 100x faster (really!) thanlibrosa's
fallback on MP3.torchaudio
often has problems with multi-proessing, … Also we should keep in mind that resampling is similarly not as simple as reading a text file. It’s a pretty complex signal processing transform and people very well might want to use special filters, etc…at the moment we just hard-codetorchaudio's
orlibrosa's
default filter when doing resampling.=> All this to say that we should definitely care about whether we rely on local paths or bytes IMO. We don’t want to loose all users that are forced to use
datasets
decoding or resampling or have to built a very much not intuitive way of loading bytes into a numpy array. It’s much more intuitive to be able to inspect a local file. I feel pretty strongly about this and am happy to also jump on a call. Keeping libraries flexible and lean as well as exposing internals is very important IMO (this philosophy has worked quite well so far with Transformers).Related to this discussion: in https://github.com/huggingface/datasets/pull/3664#issuecomment-1031866858 I propose how we could change
iter_archive
to work for streaming and also return local paths (as it used too !). I’d love your opinions on this