vits2_pytorch: Training code error

Hi, p0p4k, Thanks for sharing the code. It’s a great project. I have been following this project for a long time and have tried it many times. But when I run the training code, I still get the following error. I guess some parameters were passed wrong in the code. The actual parameters are not completely obtained from vits2_ljs_base.json. I tried to debug and modify it, but it didn’t work. Look forward to your review and reply.

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 157, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 191, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 748, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 495, in forward x = self.pre(x) * x_mask File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[32, 513, 298] to have 80 channels, but got 513 channels instead

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 2
Comments: 30 (14 by maintainers)

Most upvoted comments

Hi, I tried using Pytorch==1.13.1 and the training worked for me. I suggest using the same version.

p0p4k on Aug 17, 2023

I’m training LJSpeech. If I have some results tomorrow, I’ll give them back. Thanks again for the update!

WendongGan on Aug 17, 2023

Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.

p0p4k on Aug 17, 2023

Check again. Thanks.

when i try 30adb2d

i meet the error: Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 346, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

– Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 160, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 194, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl return forward_call(*input, **kwargs) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 … In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

WendongGan on Aug 17, 2023

Check again. Thanks.

p0p4k on Aug 17, 2023

But after the latest update and use_noise_scaled_mas=False ?

Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 345, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

WendongGan on Aug 17, 2023

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

This issue does not seem to be related to pytorch version. I still have this problem with pytorch 2.0.

WendongGan on Aug 17, 2023

I am downloading data and trying to train one step and check the previous error regarding loss.

p0p4k on Aug 17, 2023

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

WendongGan on Aug 17, 2023

I tried the latest code, ee1c94d

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error: – Process 0 terminated with the following error: Traceback (most recent call last): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap fn(i, *args) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 158, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 192, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1194, in _call_impl return forward_call(*input, **kwargs) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 1026, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 … In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

WendongGan on Aug 17, 2023

explanation for the bug : After generating the wav output (wav_pred), the model converts the wav_pred to mel-spec for comparing with mel-spec of wav_real. The mel-spec is obtained from lin-spec used as input in VITS-1 model. However, in VITS-2 we use mel-spec and so the bug occurs in trying to convert mel-spec to mel-spec (???). We must directly use the mel-spec that was input in the model and compare with wav_pred’s mel-spec.

p0p4k on Aug 17, 2023

Fixed. @UESTCgan Really thanks for letting me know the errors. These feedbacks are really helpful! Let’s get the model working ASAP!

p0p4k on Aug 17, 2023

Haha, of course. I am making so many silly mistakes. Fixing it right now.

p0p4k on Aug 17, 2023

I tried the latest code, just submitted an hour ago, ca7e41d.

when i run : python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:

Traceback (most recent call last): File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 338, in <module> main() File “/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py”, line 51, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’) File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 198, in start_processes while not context.join(): File “/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py”, line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

WendongGan on Aug 17, 2023

Thank you very much. I will try this latest code and report back the results。

WendongGan on Aug 17, 2023

Hello, I made a really silly mistake. Please try the latest patch and let me know. In train.py, I was supposed to modify hps.data.use_mel_posterior_encoder based on hps.model.use_mel_posterior_encoder before passing hps.data to the dataloader. However, I loaded the dataloader first, which generates Linear-Spectrograms of 513 channels and then the model paramters load a model that accepts Mel-Spectrograms of 80 channels, and then I modify the hps.data params (which is never used; since dataloader already loaded). I fixed the order and also added an additional flag in hps.data just to be sure for now. I will do a clean up to avoid model and data parameters mismatch (minor stuff) later on. Thanks.

p0p4k on Aug 17, 2023

If any of you have solved this problem, I look forward to sharing your solutions. Thank you very much!

WendongGan on Aug 17, 2023