waveglow: Gradient overflow when training

I am training a Waveglow model from scratch with 8k sampling rate dataset and got this error after 1 epoch ảnh

Since Pytorch havent fully support RTX GPUs yet, I have to use my old 1050ti and set batch_size to 1; this is my config.json. Is batch_size too small cause the problem or I am using wrong audio params?

{ “train_config”: { “fp16_run”: true, “output_directory”: “checkpoints”, “epochs”: 100000, “learning_rate”: 1e-4, “sigma”: 1.0, “iters_per_checkpoint”: 2000, “batch_size”: 1, “seed”: 1234, “checkpoint_path”: “”, “with_tensorboard”: true }, “data_config”: { “training_files”: “train_files.txt”, “segment_length”: 16000, “sampling_rate”: 8000, “filter_length”: 1024, “hop_length”: 256, “win_length”: 1024, “mel_fmin”: 0.0, “mel_fmax”: 4000.0 }, “dist_config”: { “dist_backend”: “nccl”, “dist_url”: “tcp://localhost:54321” },

"waveglow_config": {
    "n_mel_channels": 80,
    "n_flows": 18,
    "n_group": 8,
    "n_early_every": 4,
    "n_early_size": 2,
    "WN_config": {
        "n_layers": 4,
        "n_channels": 256,
        "kernel_size": 3
    }
}

}

About this issue

Original URL
State: open
Created 4 years ago
Comments: 18

Most upvoted comments

I noticed that sigma is set to 1 in config.json but it is 0.666 in tacotron2 inference file. Is it suppose to be like that? or they have to be the same value? moreover, what is sigma?

EuphoriaCelestial on Jun 9, 2020

@EuphoriaCelestial I picked something similar to the original; The original is 16000 segment_length for a 22.05Khz sample rate audio file.

So I know that the original WaveGlow worked well with segments a little over half a second long. 6144 is a little over half of 8000, and 6144 can be divided by the hop_length of 256 so there’s no extra padding.

It does not have to be perfect, but too small makes it hard to learn low frequencies and multiples of the hop_length waste less compute.

CookiePPP on Jun 5, 2020

@CookiePPP cool, thank you for the knowledge!

EuphoriaCelestial on Jun 5, 2020

@EuphoriaCelestial I would use a little over half again (and multiple of hop_length) between 8192 and 12288 would be cool. You can use anything you want, but just don’t make it too small is the main thing.

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/waveglow/arg_parser.py#L53

The DeepLearningExamples version uses segment_length of 4000 for 22.05Khz, so it seems to be all over the place.

…

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/scripts/train_waveglow.sh

or it uses 8000 segment_length?

I have no idea what’s considered normal. I use half a second with my models and it works good enough for me.

CookiePPP on Jun 5, 2020

okay, I will wait and report later

EuphoriaCelestial on Jun 5, 2020

@EuphoriaCelestial loss scale around 256 is normal. I worry once it goes under 64.

CookiePPP on Jun 5, 2020

@EuphoriaCelestial With a sampling_rate of 8000 your segment length is 2 seconds long. I don’t know what dataset you use but you might be training on a lot of padded data (files under 2 seconds long will be padded with zeros). You should probably decrease segment length to 6144 or something along those lines and increase batch_size to 4.

"sampling_rate": 8000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 4000.0

just need to match Tacotron2 and you’re good.

    "n_flows": 18,
    "n_group": 8,
    "n_early_every": 4,
    "n_early_size": 2,

You have 18 flows. You start with 8 channels, and every 4 flows you output 2 of the channels.

At flow 0 you have 8 channels

at flow 4 you have 6 channels
at flow 8 you have 4 channels
at flow 12 you have 2 channels
at flow 16 you have 0 channels

You cannot have 0 channels…

I’m pretty sure this config is not the config you used, as I don’t think this config can start.

CookiePPP on Jun 5, 2020