TTS: train_yourtts speaker embeddings does not generate audio
Describe the bug
Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth
Gives a log output of Model restored from step 0
Full log:
> Training Environment:
| > Current device: 0
| > Num. of GPUs: 1
| > Num. of CPUs: 16
| > Num. of Torch Threads: 24
| > Torch seed: 54321
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
> Restoring from model_file.pth ...
> Restoring Model...
> Partial model initialization...
| > Layer missing in the model definition: speaker_encoder.conv1.weight
| > Layer missing in the model definition: speaker_encoder.conv1.bias
| > Layer missing in the model definition: speaker_encoder.bn1.weight
| > Layer missing in the model definition: speaker_encoder.bn1.bias
| > Layer missing in the model definition: speaker_encoder.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
> `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
> `speakers_file` is updated in the config.json.
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
| > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
| > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
| > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
| > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
| > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
| > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
| > Layer missing in the model definition: speaker_encoder.attention.0.weight
| > Layer missing in the model definition: speaker_encoder.attention.0.bias
| > Layer missing in the model definition: speaker_encoder.attention.2.weight
| > Layer missing in the model definition: speaker_encoder.attention.2.bias
| > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
| > Layer missing in the model definition: speaker_encoder.attention.2.running_var
| > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
| > Layer missing in the model definition: speaker_encoder.attention.3.weight
| > Layer missing in the model definition: speaker_encoder.attention.3.bias
| > Layer missing in the model definition: speaker_encoder.fc.weight
| > Layer missing in the model definition: speaker_encoder.fc.bias
| > Layer missing in the model definition: emb_l.weight
| > Layer missing in the model definition: duration_predictor.cond_lang.weight
| > Layer missing in the model definition: duration_predictor.cond_lang.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
| > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
| > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
| > 724 / 896 layers are restored.
> Model restored from step 0
> Model has 86565676 parameters
Also When I run:
output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
--model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
--config_path "/workspace/project/output/{output_dir}/config.json" \
--list_speaker_idxs \
--out_path /workspace/output.wav
to test then I get
> Using model: vits
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:0
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:None
| > fft_size:1024
| > power:None
| > preemphasis:0.0
| > griffin_lim_iters:None
| > signal_norm:None
| > symmetric_norm:None
| > mel_fmin:0
| > mel_fmax:None
| > pitch_fmin:None
| > pitch_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
> Model fully restored.
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:64
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:512
| > power:1.5
| > preemphasis:0.97
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:False
| > mel_fmin:0
| > mel_fmax:8000.0
| > pitch_fmin:1.0
| > pitch_fmax:640.0
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:False
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:True
| > db_level:-27.0
| > stats_path:None
| > base:10
| > hop_length:160
| > win_length:400
> External Speaker Encoder Loaded !!
> Model fully restored.
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:64
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:512
| > power:1.5
| > preemphasis:0.97
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:False
| > mel_fmin:0
| > mel_fmax:8000.0
| > pitch_fmin:1.0
| > pitch_fmax:640.0
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:False
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:True
| > db_level:-27.0
| > stats_path:None
| > base:10
| > hop_length:160
| > win_length:400
> Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}
Some how the interference cannot read speaker embeddings
Here is my config:
{
"output_path": "/workspace/project/output",
"logger_uri": null,
"run_name": "YourTTS-EN-VCTK",
"project_name": "YourTTS",
"run_description": "\n - Original YourTTS trained using VCTK dataset\n ",
"print_step": 50,
"plot_step": 100,
"model_param_stats": false,
"wandb_entity": null,
"dashboard_logger": "tensorboard",
"log_model_step": 1000,
"save_step": 500,
"save_n_checkpoints": 2,
"save_checkpoints": true,
"save_all_best": false,
"save_best_after": 10000,
"target_loss": "loss_1",
"print_eval": true,
"test_delay_epochs": 0,
"run_eval": true,
"run_eval_steps": null,
"distributed_backend": "nccl",
"distributed_url": "tcp://localhost:54321",
"mixed_precision": false,
"epochs": 1,
"batch_size": 18,
"eval_batch_size": 18,
"grad_clip": [
1000,
1000
],
"scheduler_after_epoch": true,
"lr": 0.001,
"optimizer": "AdamW",
"optimizer_params": {
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"weight_decay": 0.01
},
"lr_scheduler": null,
"lr_scheduler_params": null,
"use_grad_scaler": false,
"cudnn_enable": true,
"cudnn_deterministic": false,
"cudnn_benchmark": false,
"training_seed": 54321,
"model": "vits",
"num_loader_workers": 8,
"num_eval_loader_workers": 4,
"use_noise_augment": false,
"audio": {
"fft_size": 1024,
"sample_rate": 16000,
"win_length": 1024,
"hop_length": 256,
"num_mels": 80,
"mel_fmin": 0.0,
"mel_fmax": null
},
"use_phonemes": false,
"phonemizer": "espeak",
"phoneme_language": "en",
"compute_input_seq_cache": true,
"text_cleaner": "multilingual_cleaners",
"enable_eos_bos_chars": false,
"test_sentences_file": "",
"phoneme_cache_path": null,
"characters": {
"characters_class": "TTS.tts.models.vits.VitsCharacters",
"vocab_dict": null,
"pad": "_",
"eos": "&",
"bos": "*",
"blank": null,
"characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
"punctuations": "!'(),-.:;? ",
"phonemes": "",
"is_unique": true,
"is_sorted": true
},
"add_blank": true,
"batch_group_size": 5,
"loss_masking": null,
"min_audio_len": 1,
"max_audio_len": 240000,
"min_text_len": 1,
"max_text_len": Infinity,
"compute_f0": false,
"compute_linear_spec": true,
"precompute_num_workers": 12,
"start_by_longest": true,
"shuffle": false,
"drop_last": false,
"datasets": [
{
"formatter": "vctk",
"dataset_name": "vctk",
"path": "/workspace/project/VCTK",
"meta_file_train": "",
"ignored_speakers": null,
"language": "en",
"meta_file_val": "",
"meta_file_attn_mask": ""
}
],
"test_sentences": [
[
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
"VCTK_p277",
null,
"en"
],
[
"Be a voice, not an echo.",
"VCTK_p239",
null,
"en"
],
[
"I'm sorry Dave. I'm afraid I can't do that.",
"VCTK_p258",
null,
"en"
],
[
"This cake is great. It's so delicious and moist.",
"VCTK_p244",
null,
"en"
],
[
"Prior to November 22, 1963.",
"VCTK_p305",
null,
"en"
]
],
"eval_split_max_size": 256,
"eval_split_size": 0.01,
"use_speaker_weighted_sampler": false,
"speaker_weighted_sampler_alpha": 1.0,
"use_language_weighted_sampler": false,
"language_weighted_sampler_alpha": 1.0,
"use_length_weighted_sampler": false,
"length_weighted_sampler_alpha": 1.0,
"model_args": {
"num_chars": 165,
"out_channels": 513,
"spec_segment_size": 32,
"hidden_channels": 192,
"hidden_channels_ffn_text_encoder": 768,
"num_heads_text_encoder": 2,
"num_layers_text_encoder": 10,
"kernel_size_text_encoder": 3,
"dropout_p_text_encoder": 0.1,
"dropout_p_duration_predictor": 0.5,
"kernel_size_posterior_encoder": 5,
"dilation_rate_posterior_encoder": 1,
"num_layers_posterior_encoder": 16,
"kernel_size_flow": 5,
"dilation_rate_flow": 1,
"num_layers_flow": 4,
"resblock_type_decoder": "2",
"resblock_kernel_sizes_decoder": [
3,
7,
11
],
"resblock_dilation_sizes_decoder": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates_decoder": [
8,
8,
2,
2
],
"upsample_initial_channel_decoder": 512,
"upsample_kernel_sizes_decoder": [
16,
16,
4,
4
],
"periods_multi_period_discriminator": [
2,
3,
5,
7,
11
],
"use_sdp": true,
"noise_scale": 1.0,
"inference_noise_scale": 0.667,
"length_scale": 1,
"noise_scale_dp": 1.0,
"inference_noise_scale_dp": 1.0,
"max_inference_len": null,
"init_discriminator": true,
"use_spectral_norm_disriminator": false,
"use_speaker_embedding": false,
"num_speakers": 0,
"speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
"d_vector_file": [
"/workspace/project/VCTK/speakers.pth"
],
"speaker_embedding_channels": 256,
"use_d_vector_file": true,
"d_vector_dim": 512,
"detach_dp_input": true,
"use_language_embedding": false,
"embedded_language_dim": 4,
"num_languages": 0,
"language_ids_file": null,
"use_speaker_encoder_as_loss": true,
"speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
"speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
"condition_dp_on_speaker": true,
"freeze_encoder": false,
"freeze_DP": false,
"freeze_PE": false,
"freeze_flow_decoder": false,
"freeze_waveform_decoder": false,
"encoder_sample_rate": null,
"interpolate_z": true,
"reinit_DP": false,
"reinit_text_encoder": false
},
"lr_gen": 0.0002,
"lr_disc": 0.0002,
"lr_scheduler_gen": "ExponentialLR",
"lr_scheduler_gen_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"lr_scheduler_disc": "ExponentialLR",
"lr_scheduler_disc_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"kl_loss_alpha": 1.0,
"disc_loss_alpha": 1.0,
"gen_loss_alpha": 1.0,
"feat_loss_alpha": 1.0,
"mel_loss_alpha": 45.0,
"dur_loss_alpha": 1.0,
"speaker_encoder_loss_alpha": 9.0,
"return_wav": true,
"use_weighted_sampler": false,
"weighted_sampler_attrs": null,
"weighted_sampler_multipliers": null,
"r": 1,
"num_speakers": 0,
"use_speaker_embedding": false,
"speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
"speaker_embedding_channels": 256,
"language_ids_file": null,
"use_language_embedding": false,
"use_d_vector_file": true,
"d_vector_file": [
"/workspace/project/VCTK/speakers.pth"
],
"d_vector_dim": 512
}
It might be because of a typo on line #114:-
https://github.com/coqui-ai/TTS/blob/9e5a469c64ca7121d3558f3ddf40b1a3e993ffcc/TTS/tts/utils/speakers.py#L110-L120
Where it should be speakers_file
instead of speaker_file
?
Also, After disabling model_args.use_d_vector_file
and enabling model_args.use_speaker_embedding
I get this error:-
> Using model: vits
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:0
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:None
| > fft_size:1024
| > power:None
| > preemphasis:0.0
| > griffin_lim_iters:None
| > signal_norm:None
| > symmetric_norm:None
| > mel_fmin:0
| > mel_fmax:None
| > pitch_fmin:None
| > pitch_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
> Model fully restored.
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:64
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:512
| > power:1.5
| > preemphasis:0.97
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:False
| > mel_fmin:0
| > mel_fmax:8000.0
| > pitch_fmin:1.0
| > pitch_fmax:640.0
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:False
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:True
| > db_level:-27.0
| > stats_path:None
| > base:10
| > hop_length:160
| > win_length:400
> initialization of speaker-embedding layers.
> External Speaker Encoder Loaded !!
Traceback (most recent call last):
File "/opt/conda/bin/tts", line 8, in <module>
sys.exit(main())
File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
args.use_cuda,
File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'
Also when restoring from /root/.local/share/tts/tts_models–en–vctk–vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same
Guys @erogol @Edresson Am I doing something wrong or should I create an issue?
To Reproduce
Run train_yourtts with default params
Expected behavior
No response
Logs
No response
Environment
Nvidia 3090
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 16 (16 by maintainers)
After struggling for days I found why I had issue with my fine-tuned model, its because yourtts model is multilingual so I had to turn on
use_language_embedding=True
In order to guide my new model what language to train onIt should not effect the training or inference because we use the speaker name and not the Ids. But yeah it is weird and can cause confusion, I fixed it on https://github.com/coqui-ai/TTS/pull/2234/commits/c8245cde075911a2137d7963feff2abbf48d5d07
Also @Edresson using
/root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth
andd_vector_file
as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it’s because the modeltts_models--en--vctk--vits
uses speaker ids as p231, p232, etc… while my config uses it as VCTK_p231, VCTK_p232, etc…Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402
Is this the reason or I have to look somewhere else?