TTS: [Bug] The model does not train with the new hyperparameters given by command line when trying to restart a training with `restore_path` and `continue_path`
Describe the bug
I am trying to continue the training of a multi-speaker VITS model in Catalan with 4 16GB V100 GPUs.
I want to try to modify some different hyperparameters (like the learnig rate) to find the most optimal configuration. When launching the new trainings with the --restore_path
argument and other hyperparameter arguments, a new config is created with the updated hyperparameters. However, in the training, the model does not use these new hyperparameters, but uses the same ones that appeared in the original model config.
In the “to reproduce” section I attach the config of the original training and the config, the logs and the command line used to run the new training.
Regarding the --contine_path
argument, when wanting to continue the training from the same point where the training stopped, the model resets the learning rate to that of the original config.
As in both cases the behavior is the same (using the parameters of the original config ignoring the new ones passed by command line) I thought it appropriate to mention them in the same issue.
To Reproduce
Original config: config.txt
New generated config: config.txt
logs of the new training: trainer_0_log.txt
The above logs show current_lr_0: 0.00050
current_lr_1: 0.00050
:
[1m --> STEP: 24/1620 -- GLOBAL_STEP: 170025[0m
| > loss_disc: 2.35827 (2.46076)
| > loss_disc_real_0: 0.14623 (0.14530)
| > loss_disc_real_1: 0.23082 (0.20939)
| > loss_disc_real_2: 0.22020 (0.21913)
| > loss_disc_real_3: 0.19430 (0.22623)
| > loss_disc_real_4: 0.21045 (0.22390)
| > loss_disc_real_5: 0.20165 (0.23435)
| > loss_0: 2.35827 (2.46076)
| > grad_norm_0: 24.36758 (16.55595)
| > loss_gen: 2.37695 (2.37794)
| > loss_kl: 2.56117 (2.30560)
| > loss_feat: 9.57505 (8.38634)
| > loss_mel: 22.84378 (22.47223)
| > loss_duration: 1.59958 (1.55717)
| > loss_1: 38.95654 (37.09929)
| > grad_norm_1: 192.16046 (145.46979)
| > current_lr_0: 0.00050
| > current_lr_1: 0.00050
| > step_time: 0.96620 (1.22051)
| > loader_time: 0.00510 (0.00600)
Below I attach the command line used to launch the new training:
export RECIPE="${RUN_DIR}/recipes/multispeaker/vits/experiments/train_vits_ca.py"
export RESTORE="${RUN_DIR}/recipes/multispeaker/vits/experiments/checkpoint_vits_170000.pth"
python -m trainer.distribute --script ${RECIPE} -gpus "0,1,2,3" \
--restore_path ${RESTORE} --coqpit.lr_gen 0.0002 --coqpit.lr_disc 0.0002 \
--coqpit.eval_batch_size 8 --coqpit.epochs 4 --coqpit.batch_size 16
Expected behavior
No response
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB",
"Tesla V100-SXM2-16GB"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.9.0a0+git3d70ab0",
"TTS": "0.6.2",
"numpy": "1.19.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "ppc64le",
"python": "3.7.4",
"version": "#1 SMP Tue Sep 25 12:28:39 EDT 2018"
}
}
Additional context
Trainer was updated to trainer==0.0.13. Please let me know if you need more information and thank you in advance.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (15 by maintainers)
When you restore the model you also restore the scheduler and it overrides what you define on the terminal probably? @loganhart420 can you check if it is the case?
Thanks for letting me know, I’ll run the same the setup and look into it