TTS: RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]
Hey,
I’m trying to run a training with Tacotron 1 using GST. I get the error on the first batch already.
Pytorch version: 1.8 and 1.7.1 (both yielded the same error) Python version: 3.8.0
Traceback (most recent call last): File "TTS/bin/train_tacotron.py", line 721, in <module> main(args) File "TTS/bin/train_tacotron.py", line 619, in main train_avg_loss_dict, global_step = train(train_loader, model, File "TTS/bin/train_tacotron.py", line 168, in train decoder_output, postnet_output, alignments, stop_tokens = model( File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/big-boy/projects/TTS/TTS/tts/models/tacotron.py", line 173, in forward decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs) RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]
My hyperparams: // TRAINING “batch_size”: 64, “eval_batch_size”: 16, “r”: 4, “gradual_training”: [ [0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32] ], “mixed_precision”: true,
// MULTI-SPEAKER and GST “use_speaker_embedding”: false, // use speaker embedding to enable multi-speaker learning. “use_gst”: true, “use_external_speaker_embedding_file”: false, “external_speaker_embedding_file”: “…/…/speakers-vctk-en.json”, “gst”: { // gst parameter if gst is enabled “gst_style_input”: null, // Condition the style input either on a // -> wave file [path to wave] or // -> dictionary using the style tokens {‘token1’: ‘value’, ‘token2’: ‘value’} example {“0”: 0.15, “1”: 0.15, “5”: -0.15} // with the dictionary being len(dict) <= len(gst_style_tokens). “gst_embedding_dim”: 512, “gst_num_heads”: 4, “gst_style_tokens”: 10, “gst_use_speaker_embedding”: false },
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 28 (26 by maintainers)
Ok I figured. It’s a 🐛 😃
@erogol yes I did, same error with T1, T2 threw a librosa error, my Dataset seems to be stereo apparently, will check back.
librosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(19680, 2)
- I’ve checked and weirdly SOME of the wav files in my corpus are stereo, going to convert to mono now and try again.Training is running, thank you guys so much!
I had to downgrade to librosa==0.6.3, though because of this:
librosa.util.exceptions.ParameterError: Audio buffer is not Fortran-contiguous. Use numpy.asfortranarray to ensure Fortran contiguity.
did you pip install -e . again after the checkout ?
@a-froghyar dev branch should work now
There is no “need” but it just works better with GL.
for debugging you can use the small dataset set under
tests
folder. There are also some sample configs for model testing that you can copy and paste for debugging.@WeberJulian not yet, I’ll be on it later today and I’ll report back, thanks again!
set this to
"r": 7
I think since you are using gradual training it loads the first batch of data with r=4 but tries to train with r=7
Edit: Just checked and got the same error
The error is not related to GST. Just set
r=7
in your config and it should work.