TTS: [Bug] [v0.9.0] "Kernel size" error when using model "tts_models/fr/mai/tacotron2-DDC"
Describe the bug
tts --text “autobus” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
tts_models/fr/mai/tacotron2-DDC is already downloaded. vocoder_models/universal/libri-tts/fullband-melgan is already downloaded. Using model: Tacotron2 Setting up Audio Processor… | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/tts_models–fr–mai–tacotron2-DDC/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Model’s reduction rate
r
is set to: 1 Vocoder Model: fullband_melgan Setting up Audio Processor… | > sample_rate:24000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/vocoder_models–universal–libri-tts–fullband-melgan/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Generator Model: fullband_melgan_generator Discriminator Model: melgan_multiscale_discriminator Text: autobus Text splitted to sentences. [‘autobus’] Traceback (most recent call last): File “/home/iwater/miniconda3/bin/tts”, line 8, in <module> sys.exit(main()) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py”, line 357, in main wav = synthesizer.tts( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py”, line 279, in tts outputs = synthesis( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py”, line 207, in synthesis outputs = run_model_torch( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py”, line 50, in run_model_torch outputs = _func( File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py”, line 249, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py”, line 108, in inference o = layer(o) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py”, line 40, in forward o = self.convolution1d(x) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 307, in forward return self._conv_forward(input, self.weight, self.bias) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 303, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can’t be greater than actual input size
To Reproduce
tts --text “autobus” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
or
tts --text “chat” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
Expected behavior
Doesn’t error
Logs
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
> tts_models/fr/mai/tacotron2-DDC is already downloaded.
> vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
> Using model: Tacotron2
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > pitch_fmin:0.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
> Model's reduction rate `r` is set to: 1
> Vocoder Model: fullband_melgan
> Setting up Audio Processor...
| > sample_rate:24000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:0
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > pitch_fmin:0.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
> Generator Model: fullband_melgan_generator
> Discriminator Model: melgan_multiscale_discriminator
> Text: autobus
> Text splitted to sentences.
['autobus']
Traceback (most recent call last):
File "/home/iwater/miniconda3/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py", line 357, in main
wav = synthesizer.tts(
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 279, in tts
outputs = synthesis(
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 207, in synthesis
outputs = run_model_torch(
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
outputs = _func(
File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference
o = layer(o)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
Environment
{
"CUDA": {
"GPU": [
"NVIDIA GeForce GTX 1080 Ti"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.12.0+cu102",
"TTS": "0.9.0",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.12",
"version": "#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022"
}
}
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (16 by maintainers)
HAHAHHAHAHHAHHAHAA
yes, after install gruut-lang-fr, everything works fine, thanks