FreeVC: Error with 24khz data utils
So when trying to resume one of the existing 24khz models I am getting the following error.
File "Z:\FreeVCtrain24khz\train_24.py", line 65, in run train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps) File "Z:\FreeVCtrain24khz\data_utils_24.py", line 34, in __init__ self._filter() File "Z:\FreeVCtrain24khz\data_utils_24.py", line 46, in _filter lengths.append(os.path.getsize(audiopath[0]) // (2 * self.hop_length)) File "C:\Users\steven\AppData\Local\Programs\Python\Python37\lib\genericpath.py", line 50, in getsize return os.stat(filename).st_size FileNotFoundError: [WinError 3] The system cannot find the path specified: 'DUMMY\\p337\\p337_014.wav'
I noticed this line is missing in the 24khz datautils
audiopath = audiopath[0].replace("\\","/").replace("DUMMY", "dataset/vctk-16k")
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (14 by maintainers)
@OlaWod Alright I have figured out the problem… The original preprocess is incorrect, for both the 24 and 16khz. so the problem is this line here
wav, sr = librosa.load(wav_path)Librosa.load load the file with a default sampling rate of 22khz. From the librosa documentationso what happens is the audio is loaded with only 22khz, in both the 16 and 24khz preprocess. This is theoretically fine when doing the 16khz preprocess, but obviously this is a problem when trying to save at 24khz. I noticed this when I first ran the preprocess already and had changed the loading of the wav file, as i noticed that the output of the downsampled files was blank above 10khz. So the 24khz files where sampled at 24khz, but only had the audio information of a 22khz file. See below image of 2 24khz files. Top file was loaded with
wav, sr = librosa.load(wav_path)Bottom file was loaded withwav, sr = librosa.load(wav_path, sr=None)So here you can see the the files loaded at the 22khz are longer then the one loaded at 48khz. So this length variance is what causes the tensor size errors, but also look closely at the circled section, the trimming on the 16khz is stretching and mashing words together. You no longer have clean breaks on formants between sounds, everything gets blurred together, this hurts the intelligibility of the model. We see this is even more apparent when we compare to the original file below.
So the phrase is “please call stella”, what we see here is that the trim has removed the beginning of the word please, it has removed the P from please. This explains why the model has a hard time with S and P pronounciations, as the trim on preprocess removes all of these sounds. So if you listen to the 2 files below, you can clearly hear how the trimmed files has a very weak P sound.
https://drive.google.com/file/d/1r-pra0feL3aWpWUQDWxGJ7QNf9f4xxj3/view?usp=share_link
https://drive.google.com/file/d/1q4m_to27nYUqpu9EOWpi7Ol5UwsqF2PX/view?usp=sharing
So the trim function needs to be made less aggressive or left out completely as its removing key sounds from words that the model needs to be able to learn properly. and the preprocess scripts for both 16 and 24khz need to be adjusted to use
wav, sr = librosa.load(wav_path, sr=None)So that the files load with the correct sampling rate. After having rerun the preprocess for both the 16 and 24khz with sr=None the 24khz training now works. Of course when resuming the original model the mel_loss is over 30, as the new files actually contain audio data above 10khz and the existing model was trained on files with the blank section. Olawod, you may want to train a new 24khz model.Make sure that before you run this you remove all spec.pt files from ur dataset
find . -name "*.spec.pt" -type f -deleteIt works! Thanks a lot
If I set the hop_length to 480 manually:
It seems that the sizes almost match:
RuntimeError: The expanded size of the tensor (297) must match the existing size (298) at non-singleton dimension 1. Target sizes: [1024, 297]. Tensor sizes: [1024, 298]self.hop_length seems to be 320 by default. Also, the audio_norm is actually the 24kHZ version of the audio. Is that supposed to be like this? I’ve tried importing the 16KHz versions and doing the spectrograms for them instead but the error persists. Sorry, I’m not really smart enough for all this especially this 16 to 24KHz hack thing.
I get that error regardless of what folder I use. well similar, diff size listed. RuntimeError: The expanded size of the tensor (61920) must match the existing size (33824) at non-singleton dimension 1. Target sizes: [1, 61920]. Tensor sizes: [33824]