audio: MelSpectrogram inconsistency with librosa melspectrogram
Hello! I am excited with this framework a lot and its ability to make transformations on gpu.
Problem:
transforms.Spectrogram (with power 1.) (which is real) output equals to absolute value of librosa.stft (which is complex) with equal parameters.
Here is spectrograms for my example audio (really close results):

Next step is to get melspectrogram using transforms.MelScale (on Spectrogram with power 1) and librosa.feature.melspectrogram (actually power is 1., this argument not in use) (using previous spectrogram). And here we can’t get the same result:
- in both steps only matmul takes place
- in
transforms.MelScaletensors with real values multiplicated, inlibrosa.feature.melspectrogramgives us multiplication of complex based matrices, thus in the result we can get absolutely different values - also quite misleading use of
powerintransforms.Spectrogram(don’t need inlibrosa.stft)
And the result (differs not only in some fields, but in scale too):

About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17
Commits related to this issue
- Merge pull request #1058 from jlin27/master Add new directory for prototype tutorials — committed to mthrok/audio by deleted user 4 years ago
- Update (#1058) — committed to mpc001/audio by deleted user 2 years ago
Okay, I did further research and could reproduce librosa’s melspectrogram with torchaudio. The parameters added in #1212 helped.
Numerical compatibility
MSE: 5.792542556726232e-10
MSE: 3.6859009276685303e-16
MSE: 3.748331423025775e-09
Call-stacks
@eldrin @SolomidHero
I have merged #1212 so we can pass
slaneynormalization as a parameter toMelSpectrogramtransform. I will keep looking at a way to add other filter bank option and numerical parity tolibrosa.Hi @mthrok @SolomidHero
sorry for the late response.
I am afraid not. I was digging in a bit for the issue and I found it’s both from the
slaneynormalization andmel filter bank.F.create_fb_matrixfunction generates a mel filterbank kernel based on the formula same with the wikipedia entry. While librosa seems to offer 2 implementations (htkandauditory toolbox). And there is another option: normalization scheme fromslaneyorNone.From the equation and the outputs, torchaudio implements [
htk+None] and [htk+slaney] while librosa provides full combination of {htk,auditory toolbox} x {None,slaney}. The default setup for each packages so far istorchaudio: [htk+None] andlibrosa:[auditory toolbox+slaney].I plugged in the default librosa filterbank to the custom
MelSpectrogramtransform, and the result comply the librosa output. I measured it computing the mean squared error of resulting mel-spectrograms computed from torch audio object (using different filter banks) against the one computed fromlibrosaas the reference.Result says that once the exactly same kernel and normalization are used, the result comply the librosa output!
So eventually implementation of
auditory toolboxbased mel-filterbank is needed for the complete compliance.Here’s the code snippet for this test
Hi @SolomidHero @eldrin
I looked into the difference of MelSpectrogram. As the documentation says, the origin of the code is from some notebooks, and it seems that there is no test to check the output values of this transform. This makes it difficult to tell if the current implementation is correct or define what is the correct behavior here.
I looked at
librosa, hoping that they have some test to validate the resulting signal but they do not have such test either.We could change the implementations to match
librosabut that will cause backward compatibility changes, so we really need to have a good reason to do so. (Recently PyTorch core is doing similar move with NumPy API compatibility)Adding an option to change normalization method for
MelSpectorgramcompletely solve the issue discussed here?@SolomidHero Can you elaborate this one? I think I have a similar opinion on the
powerused in complex values in torchaudio. I am thinking if a way to improve it while I work on the migration of native complex type.To summarize:
torchaudio.functional.create_fb_matrix(..., norm='slaney')is numerically compatible withlibrosa.filters.mel(..., htk=True, norm="slaney")htkoption available tocreate_fb_matrixand Transforms that use this funciton.torchaudio.transforms.Spectrogramis numerically compatible withlibrosa.core.spectrum._spectrogram. gisttorchaudio.transforms.MelScaleis not matching with librosa.Hi! I also happened to have a similar issue with MFCC. After some digging, I found librosa use
slaneynormalization for the mel-filterbank creation as the default, while torchaudio is no normalization by default. (no option to feed normalization setup at theMelSpec/MelSpectrogramconstruction.https://github.com/pytorch/audio/blob/fb3ef9ba427acd7db3084f988ab55169fab14854/torchaudio/transforms.py#L243-L245
After specifying the normalization to
slaneyI got a reasonably comparable result.It is still numerically not exactly the same, but much closer.
Right now my solution is to write some custom
MelSpecandMelSpectrograminherited from their original and adding an explicitmel_normargument to pass it down to the filterbank creation 😃