audio: MelSpectrogram inconsistency with librosa melspectrogram
Hello! I am excited with this framework a lot and its ability to make transformations on gpu.
Problem:
transforms.Spectrogram
(with power 1.) (which is real) output equals to absolute value of librosa.stft
(which is complex) with equal parameters.
Here is spectrograms for my example audio (really close results):
Next step is to get melspectrogram using transforms.MelScale
(on Spectrogram
with power 1) and librosa.feature.melspectrogram
(actually power is 1., this argument not in use) (using previous spectrogram). And here we can’t get the same result:
- in both steps only matmul takes place
- in
transforms.MelScale
tensors with real values multiplicated, inlibrosa.feature.melspectrogram
gives us multiplication of complex based matrices, thus in the result we can get absolutely different values - also quite misleading use of
power
intransforms.Spectrogram
(don’t need inlibrosa.stft
)
And the result (differs not only in some fields, but in scale too):
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17
Commits related to this issue
- Merge pull request #1058 from jlin27/master Add new directory for prototype tutorials — committed to mthrok/audio by deleted user 4 years ago
- Update (#1058) — committed to mpc001/audio by deleted user 2 years ago
Okay, I did further research and could reproduce librosa’s melspectrogram with torchaudio. The parameters added in #1212 helped.
Numerical compatibility
MSE: 5.792542556726232e-10
MSE: 3.6859009276685303e-16
MSE: 3.748331423025775e-09
Call-stacks
@eldrin @SolomidHero
I have merged #1212 so we can pass
slaney
normalization as a parameter toMelSpectrogram
transform. I will keep looking at a way to add other filter bank option and numerical parity tolibrosa
.Hi @mthrok @SolomidHero
sorry for the late response.
I am afraid not. I was digging in a bit for the issue and I found it’s both from the
slaney
normalization andmel filter bank
.F.create_fb_matrix
function generates a mel filterbank kernel based on the formula same with the wikipedia entry. While librosa seems to offer 2 implementations (htk
andauditory toolbox
). And there is another option: normalization scheme fromslaney
orNone
.From the equation and the outputs, torchaudio implements [
htk
+None
] and [htk
+slaney
] while librosa provides full combination of {htk
,auditory toolbox
} x {None
,slaney
}. The default setup for each packages so far istorchaudio
: [htk
+None
] andlibrosa
:[auditory toolbox
+slaney
].I plugged in the default librosa filterbank to the custom
MelSpectrogram
transform, and the result comply the librosa output. I measured it computing the mean squared error of resulting mel-spectrograms computed from torch audio object (using different filter banks) against the one computed fromlibrosa
as the reference.Result says that once the exactly same kernel and normalization are used, the result comply the librosa output!
So eventually implementation of
auditory toolbox
based mel-filterbank is needed for the complete compliance.Here’s the code snippet for this test
Hi @SolomidHero @eldrin
I looked into the difference of MelSpectrogram. As the documentation says, the origin of the code is from some notebooks, and it seems that there is no test to check the output values of this transform. This makes it difficult to tell if the current implementation is correct or define what is the correct behavior here.
I looked at
librosa
, hoping that they have some test to validate the resulting signal but they do not have such test either.We could change the implementations to match
librosa
but that will cause backward compatibility changes, so we really need to have a good reason to do so. (Recently PyTorch core is doing similar move with NumPy API compatibility)Adding an option to change normalization method for
MelSpectorgram
completely solve the issue discussed here?@SolomidHero Can you elaborate this one? I think I have a similar opinion on the
power
used in complex values in torchaudio. I am thinking if a way to improve it while I work on the migration of native complex type.To summarize:
torchaudio.functional.create_fb_matrix(..., norm='slaney')
is numerically compatible withlibrosa.filters.mel(..., htk=True, norm="slaney")
htk
option available tocreate_fb_matrix
and Transforms that use this funciton.torchaudio.transforms.Spectrogram
is numerically compatible withlibrosa.core.spectrum._spectrogram
. gisttorchaudio.transforms.MelScale
is not matching with librosa.Hi! I also happened to have a similar issue with MFCC. After some digging, I found librosa use
slaney
normalization for the mel-filterbank creation as the default, while torchaudio is no normalization by default. (no option to feed normalization setup at theMelSpec
/MelSpectrogram
construction.https://github.com/pytorch/audio/blob/fb3ef9ba427acd7db3084f988ab55169fab14854/torchaudio/transforms.py#L243-L245
After specifying the normalization to
slaney
I got a reasonably comparable result.It is still numerically not exactly the same, but much closer.
Right now my solution is to write some custom
MelSpec
andMelSpectrogram
inherited from their original and adding an explicitmel_norm
argument to pass it down to the filterbank creation 😃