- When I tried to load the pretrained model
output/LJSpeech/ckpt/900000.pth.tar, I have some errors:
size mismatch for encoder.src_word_emb.weight: copying a param with shape torch.Size([361, 256]) from checkpoint, the shape in current model is torch.Size([151, 256]).
- The code which loads the model from repo
base_config_path = "config/LJSpeech"
prepr_path = f"{base_config_path}/preprocess.yaml"
model_path = f"{base_config_path}/model.yaml"
train_path = f"{base_config_path}/train.yaml"
prepr_config = yaml.load(open(prepr_path, "r"), Loader=yaml.FullLoader)
model_config = yaml.load(open(model_path, "r"), Loader=yaml.FullLoader)
train_config = yaml.load(open(train_path, "r"), Loader=yaml.FullLoader)
configs = (prepr_config, model_config, train_config)
cpkt_path = "output/LJSpeech/ckpt/900000.pth.tar"
def get_model(ckpt_path, configs):
(preprocess_config, model_config, train_config) = configs
model = FastSpeech2(preprocess_config, model_config).to(device)
ckpt = torch.load(ckpt_path)
model.load_state_dict(ckpt["model"])
model.eval()
model.requires_grad_ = False
return model
@leminhnguyen It is a good suggestion. I think using ground-truth pitch, duration, and energy as the inputs of FastSpeech2 is somehow similar to using teacher forcing mode in Tacotron 2 (but still different). Looking forward to your result!