vits2_pytorch: Training code error
Hi, p0p4k, Thanks for sharing the code. It’s a great project. I have been following this project for a long time and have tried it many times. But when I run the training code, I still get the following error. I guess some parameters were passed wrong in the code. The actual parameters are not completely obtained from vits2_ljs_base.json. I tried to debug and modify it, but it didn’t work. Look forward to your review and reply.
when i run :
python train.py -c configs/vits2_ljs_base.json -m ljs_base
i meet the error:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 157, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 191, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 748, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 495, in forward x = self.pre(x) * x_mask File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[32, 513, 298] to have 80 channels, but got 513 channels instead
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 2
- Comments: 30 (14 by maintainers)
Hi, I tried using Pytorch==1.13.1 and the training worked for me. I suggest using the same version.
I’m training LJSpeech. If I have some results tomorrow, I’ll give them back. Thanks again for the update!
Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.
when i try 30adb2d
i meet the error: Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 346, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:
– Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 160, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 194, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl return forward_call(*input, **kwargs) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel, and by making sure allforwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’sforwardfunction. Please include the loss function and the structure of the return value offorwardof your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 … In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorCheck again. Thanks.
Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 345, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:
– Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 128, in run noise_scale_delta = noise_scale_delta, UnboundLocalError: local variable ‘noise_scale_delta’ referenced before assignment
This issue does not seem to be related to pytorch version. I still have this problem with pytorch 2.0.
I am downloading data and trying to train one step and check the previous error regarding loss.
Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?
I tried the latest code, ee1c94d
when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base
i meet the error: – Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 158, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 192, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl return forward_call(*input, **kwargs) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument
find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel, and by making sure allforwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’sforwardfunction. Please include the loss function and the structure of the return value offorwardof your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 … In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorexplanation for the bug : After generating the wav output (wav_pred), the model converts the wav_pred to mel-spec for comparing with mel-spec of wav_real. The mel-spec is obtained from lin-spec used as input in VITS-1 model. However, in VITS-2 we use mel-spec and so the bug occurs in trying to convert mel-spec to mel-spec (???). We must directly use the mel-spec that was input in the model and compare with wav_pred’s mel-spec.
Fixed. @UESTCgan Really thanks for letting me know the errors. These feedbacks are really helpful! Let’s get the model working ASAP!
Haha, of course. I am making so many silly mistakes. Fixing it right now.
I tried the latest code, just submitted an hour ago, ca7e41d.
when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base
i meet the error:
Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 338, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:
– Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 158, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 194, in train_and_evaluate mel = spec_to_mel_torch( File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/mel_processing.py”, line 85, in spec_to_mel_torch spec = torch.matmul(mel_basis[fmax_dtype_device], spec) RuntimeError: mat1 and mat2 shapes cannot be multiplied (9536x80 and 513x80)
Thank you very much. I will try this latest code and report back the results。
Hello, I made a really silly mistake. Please try the latest patch and let me know. In train.py, I was supposed to modify
hps.data.use_mel_posterior_encoderbased onhps.model.use_mel_posterior_encoderbefore passinghps.datato thedataloader. However, I loaded the dataloader first, which generates Linear-Spectrograms of 513 channels and then the model paramters load a model that accepts Mel-Spectrograms of 80 channels, and then I modify thehps.dataparams (which is never used; since dataloader already loaded). I fixed the order and also added an additional flag inhps.datajust to be sure for now. I will do a clean up to avoid model and data parameters mismatch (minor stuff) later on. Thanks.If any of you have solved this problem, I look forward to sharing your solutions. Thank you very much!