SoundStorm-pytorch: Problems with SoundStorm

Have trained update_v2 branch on :

  • Extracted Semantic token from HuBert Large layer 16 with 1024 cluster Kmean. (50 tok/sec)
  • Extracted Acoustic token from Encodec 24 khz sample rate, 240 hop length with 8 cookbook config from here. (100 tok/sec)

Results: Output is not as desired, here is the sample first 6 sec is prompt.

This thread uses as a potential issue tracker and solution logs.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 49 (26 by maintainers)

Most upvoted comments

For experiment on larger dataset I tried LibriTTS 100/360/500 merged together, the quality is strangely bad.(50% top 10 training accuracy while LJspeech has 65%).

I have also trained on LibriLight large subset from here : https://huggingface.co/datasets/collabora/whisperspeech/tree/main after 100k steps at bs: 24 got top 1 accuracy ~ 27 % to 30 % and top 10 accuracy ~ 55% to 63 % and generated audio is abysmal, nothing in the audio just noise. Model perform very poor on LibriTTS model.

Hi Rishikksh20, your training codes is working well on my data pipeline (I modified the code a little bit to fit my data), for inference I made a new version that combined yours and lucidrain’s inference code, and it gives samples even slightly better than what I already have. code for reference https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm2.py

below I provide the core code and one sample, which I think is very close to the paper’s description https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm.py, hope it can be useful