TTS: train_yourtts speaker embeddings does not generate audio

Describe the bug

Running this code with restore_path=/root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/model_file.pth Gives a log output of Model restored from step 0

Full log:

 > Training Environment:
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 16
 | > Num. of Torch Threads: 24
 | > Torch seed: 54321
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 > Restoring from model_file.pth ...
 > Restoring Model...
 > Partial model initialization...
 | > Layer missing in the model definition: speaker_encoder.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.conv1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer1.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer1.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.bias
 > `speakers.pth` is saved to /workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth.
 > `speakers_file` is updated in the config.json.
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer2.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer2.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.3.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.3.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.4.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.4.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer3.5.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer3.5.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.0.downsample.1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.1.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.1.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn1.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.conv2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_mean
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.running_var
 | > Layer missing in the model definition: speaker_encoder.layer4.2.bn2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.0.bias
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.weight
 | > Layer missing in the model definition: speaker_encoder.layer4.2.se.fc.2.bias
 | > Layer missing in the model definition: speaker_encoder.torch_spec.0.filter
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.spectrogram.window
 | > Layer missing in the model definition: speaker_encoder.torch_spec.1.mel_scale.fb
 | > Layer missing in the model definition: speaker_encoder.attention.0.weight
 | > Layer missing in the model definition: speaker_encoder.attention.0.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.weight
 | > Layer missing in the model definition: speaker_encoder.attention.2.bias
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_mean
 | > Layer missing in the model definition: speaker_encoder.attention.2.running_var
 | > Layer missing in the model definition: speaker_encoder.attention.2.num_batches_tracked
 | > Layer missing in the model definition: speaker_encoder.attention.3.weight
 | > Layer missing in the model definition: speaker_encoder.attention.3.bias
 | > Layer missing in the model definition: speaker_encoder.fc.weight
 | > Layer missing in the model definition: speaker_encoder.fc.bias
 | > Layer missing in the model definition: emb_l.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.weight
 | > Layer missing in the model definition: duration_predictor.cond_lang.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.0.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.1.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.2.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.3.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.4.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.5.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.6.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.7.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.8.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_k
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.emb_rel_v
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_q.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_k.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_v.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.attn_layers.9.conv_o.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_1.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.0.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.1.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.2.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.3.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.4.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.5.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.6.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.7.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.8.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_1.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.weight
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.ffn_layers.9.conv_2.bias
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.0.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.1.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.2.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.3.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.4.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.5.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.6.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.7.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.8.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.gamma
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.encoder.norm_layers_2.9.beta
 | > Layer dimention missmatch between model definition and checkpoint: text_encoder.proj.weight
 | > Layer dimention missmatch between model definition and checkpoint: duration_predictor.pre.weight
 | > 724 / 896 layers are restored.
 > Model restored from step 0

 > Model has 86565676 parameters

Also When I run:

output_dir="YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76"
!tts --text "Hello, Michael how are you?" \
    --model_path "/workspace/project/output/{output_dir}/checkpoint_500.pth" \
    --config_path "/workspace/project/output/{output_dir}/config.json" \
    --list_speaker_idxs \
    --out_path /workspace/output.wav

to test then I get

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{}

Some how the interference cannot read speaker embeddings

Here is my config:

{
    "output_path": "/workspace/project/output",
    "logger_uri": null,
    "run_name": "YourTTS-EN-VCTK",
    "project_name": "YourTTS",
    "run_description": "\n            - Original YourTTS trained using VCTK dataset\n        ",
    "print_step": 50,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": 1000,
    "save_step": 500,
    "save_n_checkpoints": 2,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": "loss_1",
    "print_eval": true,
    "test_delay_epochs": 0,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": false,
    "epochs": 1,
    "batch_size": 18,
    "eval_batch_size": 18,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.001,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": false,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 8,
    "num_eval_loader_workers": 4,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 16000,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0.0,
        "mel_fmax": null
    },
    "use_phonemes": false,
    "phonemizer": "espeak",
    "phoneme_language": "en",
    "compute_input_seq_cache": true,
    "text_cleaner": "multilingual_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": null,
    "characters": {
        "characters_class": "TTS.tts.models.vits.VitsCharacters",
        "vocab_dict": null,
        "pad": "_",
        "eos": "&",
        "bos": "*",
        "blank": null,
        "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ",
        "punctuations": "!'(),-.:;? ",
        "phonemes": "",
        "is_unique": true,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 5,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": 240000,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 12,
    "start_by_longest": true,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "vctk",
            "dataset_name": "vctk",
            "path": "/workspace/project/VCTK",
            "meta_file_train": "",
            "ignored_speakers": null,
            "language": "en",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
            "VCTK_p277",
            null,
            "en"
        ],
        [
            "Be a voice, not an echo.",
            "VCTK_p239",
            null,
            "en"
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that.",
            "VCTK_p258",
            null,
            "en"
        ],
        [
            "This cake is great. It's so delicious and moist.",
            "VCTK_p244",
            null,
            "en"
        ],
        [
            "Prior to November 22, 1963.",
            "VCTK_p305",
            null,
            "en"
        ]
    ],
    "eval_split_max_size": 256,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 165,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 10,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "2",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
        "d_vector_file": [
            "/workspace/project/VCTK/speakers.pth"
        ],
        "speaker_embedding_channels": 256,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json",
        "speaker_encoder_model_path": "https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0002,
    "lr_disc": 0.0002,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 9.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": "/workspace/project/output/YourTTS-EN-VCTK-December-22-2022_06+26AM-0910cb76/speakers.pth",
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": true,
    "d_vector_file": [
        "/workspace/project/VCTK/speakers.pth"
    ],
    "d_vector_dim": 512
}

It might be because of a typo on line #114:- https://github.com/coqui-ai/TTS/blob/9e5a469c64ca7121d3558f3ddf40b1a3e993ffcc/TTS/tts/utils/speakers.py#L110-L120 Where it should be speakers_file instead of speaker_file?

Also, After disabling model_args.use_d_vector_file and enabling model_args.use_speaker_embedding I get this error:-

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > initialization of speaker-embedding layers.
 > External Speaker Encoder Loaded !!
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/workspace/project/TTS/TTS/bin/synthesize.py", line 325, in main
    args.use_cuda,
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/workspace/project/TTS/TTS/utils/synthesizer.py", line 117, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/workspace/project/TTS/TTS/tts/models/vits.py", line 1703, in load_checkpoint
    if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
KeyError: 'emb_g.weight'

Also when restoring from /root/.local/share/tts/tts_models–en–vctk–vits/model_file.pth for 22kHz sample files I do get Model restored from step 1000000 but rest of the interference errors are the same

Guys @erogol @Edresson Am I doing something wrong or should I create an issue?

To Reproduce

Run train_yourtts with default params

Expected behavior

No response

Logs

No response

Environment

Nvidia 3090

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it’s because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc… while my config uses it as VCTK_p231, VCTK_p232, etc… Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?

The vocoder, text encoder, and speaker embedding approach are different between YourTTS and VITS. Given that you are losing a lot of weight so you will need a lot more epochs for the model to converge. However, if you use the YourTTS checkpoint to do transfer learning you will be able to get really great results in the first epoch: “/root/.local/share/tts/tts_models–multilingual–multi-dataset–your_tts/model_file.pth”

Got it! So should I ignore the message (as mentioned in above logs):

 | > 724 / 896 layers are restored.
 > Model restored from step 0

This happens when I use “/root/.local/share/tts/tts_models–multilingual–multi-dataset–your_tts/model_file.pth” as restore path

Yes, the following weights will be not loaded “speaker_encoder." (because we changed this part and this is now on speaker_manager), “emb_l.weight” (because it is not a multilingual training) and "duration_predictor.cond_lang.” (because it is not a multilingual training).

After struggling for days I found why I had issue with my fine-tuned model, its because yourtts model is multilingual so I had to turn on use_language_embedding=True In order to guide my new model what language to train on

Thanks @Edresson Just a quick concern I wanted to ask you input:- My speakers.pth has following when --list_speaker_idxs is used:-

{'VCTK_p225': 0, 'VCTK_p226': 1, 'VCTK_p227': 2, 'VCTK_p228': 3, 'VCTK_p229': 4, 'VCTK_p230': 5, 'VCTK_p231': 6, 'VCTK_p232': 7, 'VCTK_p233': 8, 'VCTK_p234': 9, 'VCTK_p236': 10, 'VCTK_p237': 11, 'VCTK_p238': 12, 'VCTK_p239': 13, 'VCTK_p240': 14, 'VCTK_p241': 15, 'VCTK_p243': 16, 'VCTK_p244': 17, 'VCTK_p245': 18, 'VCTK_p246': 19, 'VCTK_p247': 20, 'VCTK_p248': 21, 'VCTK_p249': 22, 'VCTK_p250': 23, 'VCTK_p251': 24, 'VCTK_p252': 25, 'VCTK_p253': 26, 'VCTK_p254': 27, 'VCTK_p255': 28, 'VCTK_p256': 29, 'VCTK_p257': 30, 'VCTK_p258': 31, 'VCTK_p259': 32, 'VCTK_p260': 33, 'VCTK_p261': 34, 'VCTK_p262': 35, 'VCTK_p263': 36, 'VCTK_p264': 37, 'VCTK_p265': 38, 'VCTK_p266': 39, 'VCTK_p267': 40, 'VCTK_p268': 41, 'VCTK_p269': 42, 'VCTK_p270': 43, 'VCTK_p271': 44, 'VCTK_p272': 45, 'VCTK_p273': 46, 'VCTK_p274': 47, 'VCTK_p275': 48, 'VCTK_p276': 49, 'VCTK_p277': 50, 'VCTK_p278': 51, 'VCTK_p279': 52, 'VCTK_p280': 53, 'VCTK_p281': 54, 'VCTK_p282': 55, 'VCTK_p283': 56, 'VCTK_p284': 57, 'VCTK_p285': 58, 'VCTK_p286': 59, 'VCTK_p287': 60, 'VCTK_p288': 61, 'VCTK_p292': 62, 'VCTK_p293': 63, 'VCTK_p294': 64, 'VCTK_p295': 65, 'VCTK_p297': 66, 'VCTK_p298': 67, 'VCTK_p299': 68, 'VCTK_p300': 69, 'VCTK_p301': 70, 'VCTK_p302': 71, 'VCTK_p303': 72, 'VCTK_p304': 73, 'VCTK_p305': 74, 'VCTK_p306': 75, 'VCTK_p307': 76, 'VCTK_p308': 77, 'VCTK_p310': 78, 'VCTK_p311': 79, 'VCTK_p312': 80, 'VCTK_p313': 81, 'VCTK_p314': 82, 'VCTK_p316': 83, 'VCTK_p317': 84, 'VCTK_p318': 85, 'VCTK_p323': 86, 'VCTK_p326': 87, 'VCTK_p329': 88, 'VCTK_p330': 89, 'VCTK_p333': 90, 'VCTK_p334': 91, 'VCTK_p335': 92, 'VCTK_p336': 93, 'VCTK_p339': 94, 'VCTK_p340': 95, 'VCTK_p341': 96, 'VCTK_p343': 97, 'VCTK_p345': 98, 'VCTK_p347': 99, 'VCTK_p351': 100, 'VCTK_p360': 101, 'VCTK_p361': 102, 'VCTK_p363': 103, 'VCTK_p364': 104, 'VCTK_p374': 105, 'VCTK_p376': 106, 'VCTK_s5': 107, 'VCTK_old_new_voice': 0}

As you can see voice VCTK_p225 and VCTK_old_new_voice (my new voice loaded with formated vctk_old) has both id 0 after I pass my new voice in the DATASETS_CONFIG_LIST Is this a problem?

Also, it looks like my new voice (VCTK_old_new_voice) and all the vctk voices are getting better even after 4k steps but they sound non-English (in another language pr maybe?) https://voca.ro/1lwuJZycpSl7 Although I have set everything en

It should not effect the training or inference because we use the speaker name and not the Ids. But yeah it is weird and can cause confusion, I fixed it on https://github.com/coqui-ai/TTS/pull/2234/commits/c8245cde075911a2137d7963feff2abbf48d5d07

Also @Edresson using /root/.local/share/tts/tts_models--en--vctk--vits/model_file.pth and d_vector_file as a string and typo fix as suggested in the PR #2234 I am able to generate audio but the quality is bad with 1 epoch. I just wanted to check if the model is restored correctly or not, so even in 1 epoch I just get not get a bad voice for already trained voices, right? I think it’s because the model tts_models--en--vctk--vits uses speaker ids as p231, p232, etc… while my config uses it as VCTK_p231, VCTK_p232, etc…

Because of this code is formatted (line 399) https://github.com/coqui-ai/TTS/blob/2e153d54a8f3b997ecb822aaf7add4f4f140908c/TTS/tts/datasets/formatters.py#L395-L402

Is this the reason or I have to look somewhere else?