datasets: [Audio] Path of Common Voice cannot be used for audio loading anymore

Describe the bug

Steps to reproduce the bug

from datasets import load_dataset
from torchaudio import load

ds = load_dataset("common_voice", "ab", split="train")

# both of the following commands fail at the moment
load(ds[0]["audio"]["path"])
load(ds[0]["path"])

Expected results

The path should be the complete absolute path to the downloaded audio file not some relative path.

Actual results

~/hugging_face/venv_3.9/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py in load(filepath, frame_offset, num_frames, normalize, channels_first, format)
    150                 filepath, frame_offset, num_frames, normalize, channels_first, format)
    151         filepath = os.fspath(filepath)
--> 152     return torch.ops.torchaudio.sox_io_load_audio_file(
    153         filepath, frame_offset, num_frames, normalize, channels_first, format)
    154

RuntimeError: Error loading audio file: failed to open file cv-corpus-6.1-2020-12-11/ab/clips/common_voice_ab_19904194.mp3

Environment info

  • datasets version: 1.18.3.dev0
  • Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.27
  • Python version: 3.9.1
  • PyArrow version: 3.0.0

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (13 by maintainers)

Most upvoted comments

Despite the comments that this has been fixed, I am finding the exact same problem is occurring again (with datasets version 2.3.2)

It appears downgrading to torchaudio 0.11.0 fixed this problem.

From https://github.com/huggingface/datasets/pull/3736 the Common Voice dataset now gives access to the local audio files as before

@patrickvonplaten

The other solution of providing a path-like object derived from the bytes stocked in the .array file is not nearly as user-friendly, but better than nothing

Just to clarify, here you describe the approach that uses the Audio.decode attribute to access the underlying bytes?

Yes!

The official example currently doesn’t work and we don’t even have a workaround for it for MP3 files at the moment

I’d assume this is because we use sox_io as a backend for decoding. However, soon we should be able to use soundfile, which supports path-like objects, for MP3 (#3667 (comment)). Your concern is reasonable, but there are situations where we can only serve bytes (see #3685 for instance). IMO it makes sense to fix the affected datasets for now, but I don’t think we should care too much whether we rely on local paths or bytes after soundfile adds support for MP3 as long as our examples work (shouldn’t be too hard to update the map_to_array functions) and we properly document how to access the underlying path/bytes for custom decoding (via ds.cast_column("audio", Audio(decode=False))).

Yes this might be, but I highly doubt that soundfile is the go-to library for audio then. @anton-l and I have tried out a bunch of different audio loading libraries (soundfile, librosa, torchaudio, pure ffmpeg, audioread, …). One thing that was pretty clear to me is that there is just no “de-facto standard” library and they all have pros and cons. None of the libraries really supports “batch”-ed audio loading. Some depend on PyTorch. torchaudio is 100x faster (really!) than librosa's fallback on MP3. torchaudio often has problems with multi-proessing, … Also we should keep in mind that resampling is similarly not as simple as reading a text file. It’s a pretty complex signal processing transform and people very well might want to use special filters, etc…at the moment we just hard-code torchaudio's or librosa's default filter when doing resampling.

=> All this to say that we should definitely care about whether we rely on local paths or bytes IMO. We don’t want to loose all users that are forced to use datasets decoding or resampling or have to built a very much not intuitive way of loading bytes into a numpy array. It’s much more intuitive to be able to inspect a local file. I feel pretty strongly about this and am happy to also jump on a call. Keeping libraries flexible and lean as well as exposing internals is very important IMO (this philosophy has worked quite well so far with Transformers).

Related to this discussion: in https://github.com/huggingface/datasets/pull/3664#issuecomment-1031866858 I propose how we could change iter_archive to work for streaming and also return local paths (as it used too !). I’d love your opinions on this