autovc: Bad conversion quality after retraining

Hi, first of all thanks for the great work on the AutoVC system. I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system. I use the the same pre-processing for the mel-spectrograms as discussed in issue #4 and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted). I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I’ve also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice. In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim. Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model	Supplied model

Here is a minimal example of the loss and training loop I use. I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()


# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
	train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
	# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 21 (6 by maintainers)

Most upvoted comments

I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e. Generator(32, 256, 512, 16)), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I use l1_loss instead of mse_loss for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.

This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well: Screen Shot 2020-01-22 at 4 47 53 AM

This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance: Screen Shot 2020-01-22 at 5 12 16 AM

himajin2045 on Jan 21, 2020