TTS: [Bug] [v0.9.0] "Kernel size" error when using model "tts_models/fr/mai/tacotron2-DDC"

Describe the bug

tts --text “autobus” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

tts_models/fr/mai/tacotron2-DDC is already downloaded. vocoder_models/universal/libri-tts/fullband-melgan is already downloaded. Using model: Tacotron2 Setting up Audio Processor… | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/tts_models–fr–mai–tacotron2-DDC/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Model’s reduction rate r is set to: 1 Vocoder Model: fullband_melgan Setting up Audio Processor… | > sample_rate:24000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/vocoder_models–universal–libri-tts–fullband-melgan/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Generator Model: fullband_melgan_generator Discriminator Model: melgan_multiscale_discriminator Text: autobus Text splitted to sentences. [‘autobus’] Traceback (most recent call last): File “/home/iwater/miniconda3/bin/tts”, line 8, in <module> sys.exit(main()) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py”, line 357, in main wav = synthesizer.tts( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py”, line 279, in tts outputs = synthesis( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py”, line 207, in synthesis outputs = run_model_torch( File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py”, line 50, in run_model_torch outputs = _func( File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py”, line 249, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py”, line 108, in inference o = layer(o) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py”, line 40, in forward o = self.convolution1d(x) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 307, in forward return self._conv_forward(input, self.weight, self.bias) File “/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 303, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can’t be greater than actual input size

To Reproduce

tts --text “autobus” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

or

tts --text “chat” --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

Expected behavior

Doesn’t error

Logs

tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
 > tts_models/fr/mai/tacotron2-DDC is already downloaded.
 > vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: fullband_melgan
 > Setting up Audio Processor...
 | > sample_rate:24000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: autobus
 > Text splitted to sentences.
['autobus']
Traceback (most recent call last):
  File "/home/iwater/miniconda3/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py", line 357, in main
    wav = synthesizer.tts(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 279, in tts
    outputs = synthesis(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 207, in synthesis
    outputs = run_model_torch(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1080 Ti"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.0+cu102",
        "TTS": "0.9.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.12",
        "version": "#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022"
    }
}

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (16 by maintainers)

Most upvoted comments

HAHAHHAHAHHAHHAHAA

yes, after install gruut-lang-fr, everything works fine, thanks