TTS: [Bug] The model does not train with the new hyperparameters given by command line when trying to restart a training with `restore_path` and `continue_path`

Describe the bug

I am trying to continue the training of a multi-speaker VITS model in Catalan with 4 16GB V100 GPUs.

I want to try to modify some different hyperparameters (like the learnig rate) to find the most optimal configuration. When launching the new trainings with the --restore_path argument and other hyperparameter arguments, a new config is created with the updated hyperparameters. However, in the training, the model does not use these new hyperparameters, but uses the same ones that appeared in the original model config.

In the “to reproduce” section I attach the config of the original training and the config, the logs and the command line used to run the new training.

Regarding the --contine_path argument, when wanting to continue the training from the same point where the training stopped, the model resets the learning rate to that of the original config.

As in both cases the behavior is the same (using the parameters of the original config ignoring the new ones passed by command line) I thought it appropriate to mention them in the same issue.

To Reproduce

Original config: config.txt

New generated config: config.txt logs of the new training: trainer_0_log.txt The above logs show current_lr_0: 0.00050 current_lr_1: 0.00050:

   --> STEP: 24/1620 -- GLOBAL_STEP: 170025
     | > loss_disc: 2.35827  (2.46076)
     | > loss_disc_real_0: 0.14623  (0.14530)
     | > loss_disc_real_1: 0.23082  (0.20939)
     | > loss_disc_real_2: 0.22020  (0.21913)
     | > loss_disc_real_3: 0.19430  (0.22623)
     | > loss_disc_real_4: 0.21045  (0.22390)
     | > loss_disc_real_5: 0.20165  (0.23435)
     | > loss_0: 2.35827  (2.46076)
     | > grad_norm_0: 24.36758  (16.55595)
     | > loss_gen: 2.37695  (2.37794)
     | > loss_kl: 2.56117  (2.30560)
     | > loss_feat: 9.57505  (8.38634)
     | > loss_mel: 22.84378  (22.47223)
     | > loss_duration: 1.59958  (1.55717)
     | > loss_1: 38.95654  (37.09929)
     | > grad_norm_1: 192.16046  (145.46979)
     | > current_lr_0: 0.00050 
     | > current_lr_1: 0.00050 
     | > step_time: 0.96620  (1.22051)
     | > loader_time: 0.00510  (0.00600)

Below I attach the command line used to launch the new training:

export RECIPE="${RUN_DIR}/recipes/multispeaker/vits/experiments/train_vits_ca.py"
export RESTORE="${RUN_DIR}/recipes/multispeaker/vits/experiments/checkpoint_vits_170000.pth"

python -m trainer.distribute --script ${RECIPE} -gpus "0,1,2,3" \
--restore_path ${RESTORE} --coqpit.lr_gen 0.0002 --coqpit.lr_disc 0.0002 \
--coqpit.eval_batch_size 8 --coqpit.epochs 4 --coqpit.batch_size 16

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.9.0a0+git3d70ab0",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "ppc64le",
        "python": "3.7.4",
        "version": "#1 SMP Tue Sep 25 12:28:39 EDT 2018"
    }
}

Additional context

Trainer was updated to trainer==0.0.13. Please let me know if you need more information and thank you in advance.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (15 by maintainers)

Most upvoted comments

When you restore the model you also restore the scheduler and it overrides what you define on the terminal probably? @loganhart420 can you check if it is the case?

Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in vctk_old format and launched some processes using the train_tts.py, and we still have the same problem, i.e. the stderr shows that the lr_gen and lr_disc used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.

Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.

For the command:

export RUN_DIR=./TTS_v0.8.0
module purge
source $RUN_DIR/use_venv.sh

export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py
export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json
export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth

CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \
                                          --coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \
                                          --coqpit.batch_size 32

files: trainer_0_log.txt config_input.txt config_output.txt

Thanks for letting me know, I’ll run the same the setup and look into it