audio: MelSpectrogram inconsistency with librosa melspectrogram

Hello! I am excited with this framework a lot and its ability to make transformations on gpu.

Problem: transforms.Spectrogram (with power 1.) (which is real) output equals to absolute value of librosa.stft (which is complex) with equal parameters.

Here is spectrograms for my example audio (really close results): Screenshot 2020-11-26 at 22 23 12 Screenshot 2020-11-26 at 22 24 45

Next step is to get melspectrogram using transforms.MelScale (on Spectrogram with power 1) and librosa.feature.melspectrogram (actually power is 1., this argument not in use) (using previous spectrogram). And here we can’t get the same result:

  • in both steps only matmul takes place
  • in transforms.MelScale tensors with real values multiplicated, in librosa.feature.melspectrogram gives us multiplication of complex based matrices, thus in the result we can get absolutely different values
  • also quite misleading use of power in transforms.Spectrogram (don’t need in librosa.stft)

And the result (differs not only in some fields, but in scale too): Screenshot 2020-11-26 at 22 45 46 Screenshot 2020-11-26 at 22 47 18

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 17

Commits related to this issue

Most upvoted comments

Okay, I did further research and could reproduce librosa’s melspectrogram with torchaudio. The parameters added in #1212 helped.

Numerical compatibility

torchaudio_spec = torchaudio.transforms.Spectrogram(
    n_fft=n_fft,
    win_length=win_len,
    hop_length=hop_len,
    center=True,
    pad_mode="reflect",
    power=2.0,
)(waveform)
librosa_spec, _ = librosa.core.spectrum._spectrogram(
    waveform.numpy(),
    n_fft=n_fft,
    hop_length=hop_len,
    win_length=win_len,
    center=True,
    pad_mode="reflect",
    power=2.0,
)

spec

MSE: 5.792542556726232e-10

torchaudio_mel = torchaudio.functional.create_fb_matrix(
    int(n_fft // 2 + 1),
    n_mels=n_mels,
    f_min=0.,
    f_max=sample_rate/2.,
    sample_rate=sample_rate,
    norm='slaney'
)

librosa_mel = librosa.filters.mel(
    sample_rate,
    n_fft,
    n_mels=n_mels,
    fmin=0.,
    fmax=sample_rate/2.,
    norm='slaney',
    htk=True,
).T

mel_bins

MSE: 3.6859009276685303e-16

torchaudio_melspec = torchaudio.transforms.MelSpectrogram(
    sample_rate=sample_rate,
    n_fft=n_fft,
    win_length=win_len,
    hop_length=hop_len,
    center=True,
    pad_mode="reflect",
    power=2.0,
    norm='slaney',
    onesided=True,
    n_mels=n_mels,
)(waveform)
librosa_melspec = librosa.feature.melspectrogram(
    waveform.numpy(),
    sr=sample_rate,
    n_fft=n_fft,
    hop_length=hop_len,
    win_length=win_len,
    center=True,
    pad_mode="reflect",
    power=2.0,
    n_mels=n_mels,
    norm='slaney',
    htk=True,
)

mel_spec

MSE: 3.748331423025775e-09

Call-stacks

@eldrin @SolomidHero

I have merged #1212 so we can pass slaney normalization as a parameter to MelSpectrogram transform. I will keep looking at a way to add other filter bank option and numerical parity to librosa.

Hi @mthrok @SolomidHero

sorry for the late response.

Adding an option to change normalization method for MelSpectorgram completely solve the issue discussed here?

I am afraid not. I was digging in a bit for the issue and I found it’s both from the slaney normalization and mel filter bank.

F.create_fb_matrix function generates a mel filterbank kernel based on the formula same with the wikipedia entry. While librosa seems to offer 2 implementations (htk and auditory toolbox). And there is another option: normalization scheme from slaney or None.

From the equation and the outputs, torchaudio implements [htk + None] and [htk + slaney] while librosa provides full combination of {htk, auditory toolbox} x {None, slaney}. The default setup for each packages so far is torchaudio: [htk + None] and librosa:[auditory toolbox + slaney].

image

I plugged in the default librosa filterbank to the custom MelSpectrogram transform, and the result comply the librosa output. I measured it computing the mean squared error of resulting mel-spectrograms computed from torch audio object (using different filter banks) against the one computed from librosa as the reference.

Mean Squared Error[fb:None]: 275233.8125
Mean Squared Error[fb:slaney]: 3064.0825
Mean Squared Error[fb:librosa_auditory_toolbox+slaney]: 0.0000

image

Result says that once the exactly same kernel and normalization are used, the result comply the librosa output!

So eventually implementation of auditory toolbox based mel-filterbank is needed for the complete compliance.

Here’s the code snippet for this test

Hi @SolomidHero @eldrin

I looked into the difference of MelSpectrogram. As the documentation says, the origin of the code is from some notebooks, and it seems that there is no test to check the output values of this transform. This makes it difficult to tell if the current implementation is correct or define what is the correct behavior here.

I looked at librosa, hoping that they have some test to validate the resulting signal but they do not have such test either.

We could change the implementations to match librosa but that will cause backward compatibility changes, so we really need to have a good reason to do so. (Recently PyTorch core is doing similar move with NumPy API compatibility)

Adding an option to change normalization method for MelSpectorgram completely solve the issue discussed here?

But another problem is confusing use of power argument.

@SolomidHero Can you elaborate this one? I think I have a similar opinion on the power used in complex values in torchaudio. I am thinking if a way to improve it while I work on the migration of native complex type.

To summarize:

  1. torchaudio.functional.create_fb_matrix(..., norm='slaney') is numerically compatible with librosa.filters.mel(..., htk=True, norm="slaney")
    • Subtask is to make htk option available to create_fb_matrix and Transforms that use this funciton.
  2. torchaudio.transforms.Spectrogram is numerically compatible with librosa.core.spectrum._spectrogram. gist
  3. Somewhere in torchaudio.transforms.MelScale is not matching with librosa.

Hi! I also happened to have a similar issue with MFCC. After some digging, I found librosa use slaney normalization for the mel-filterbank creation as the default, while torchaudio is no normalization by default. (no option to feed normalization setup at the MelSpec/MelSpectrogram construction.

https://github.com/pytorch/audio/blob/fb3ef9ba427acd7db3084f988ab55169fab14854/torchaudio/transforms.py#L243-L245

After specifying the normalization to slaney I got a reasonably comparable result.

smaller

It is still numerically not exactly the same, but much closer.

Right now my solution is to write some custom MelSpec and MelSpectrogram inherited from their original and adding an explicit mel_norm argument to pass it down to the filterbank creation 😃