TTS: [Bug] Noticing some anomalies in the training eval with YourTTS training

Describe the bug

The tensorboard shows that the first two or last two columns in the EvalFigures/evalspectrogram/real seem to be “multiplied” in some instances. This does not happen every time, but when it does, it shows up in the ‘diff’. This may potentially affect the training output and cause artifacts.

The images show that the multiplication occurs uniformly across the entire spectral band. I am investigating this issue and trying to determine its impact. I estimate that the affected area is around 6% (two out of 32 columns) or 12% (four out of 32 columns). I am not sure how many frames are impacted as the tensorboard only shows a snapshot of the particular step. However, this issue occurs frequently throughout different datasets.

Maybe it is related to the beginning or end of a file, or something along those lines and some padding has to be inserted to complete the segment? Am curious to know what it is 😃 I am using mixed precision, I’ll test later without to see if that makes any difference.

To Reproduce

Training on a YourTTS (and possibly VITS) model

Expected behavior

I would expect that in the ‘real’ data and ‘diff’, that there wouldn’t be an section of the spectrogram that intermittently has some consistent difference - as that would suggest that it’s not something from the training data itself.

Logs

No response

Environment

env_info.py
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 2080 Ti"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.13.0",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.6",
        "version": "#42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2"
    }
}

Additional context

No response

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (5 by maintainers)

Most upvoted comments

You may find the answers you are seeking on the Coqui discord, it’s outside the scope of this issue

On Thu, Aug 3, 2023, 20:51 phamkhactu @.***> wrote:

@neural-loop https://github.com/neural-loop, I will try out it, if I have any result, I will share.

@erogol https://github.com/erogol @Edresson https://github.com/Edresson And now, I have try out glow-tts , vits. I have some pros and cons:

Glow: speak is very natural but buzzing noise

Vits: not buzzing noise however speak is not natural. Do you try on some models? and you have any suggestion?

— Reply to this email directly, view it on GitHub https://github.com/coqui-ai/TTS/issues/2569#issuecomment-1664860516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE75ENFH5XTFTXMVMM4H7DXTRIR3ANCNFSM6AAAAAAXRXSHME . You are receiving this because you were mentioned.Message ID: @.***>

neural-loop on Aug 4, 2023

@phamkhactu I’ve experimented with a couple solution attempts but haven’t solved it yet.

One attempt was with ChatGPT (didn’t seem to solve, thinking about trying similar with the mel functions that also use reflect):

It seems that the issue you’re observing might be related to the use of padding with “reflect” mode in the wav_to_spec function. The reflection padding is often used to handle edge effects when calculating the short-time Fourier transform (STFT) spectrogram, but it can sometimes introduce artifacts at the boundaries.

To avoid these artifacts, you can try using “constant” padding with a small constant value (e.g., zero) instead of “reflect”. This way, the padding won’t affect the STFT calculation and should not introduce any unwanted artifacts. You can modify the wav_to_spec function to use “constant” padding as follows:

def wav_to_spec(y, n_fft, hop_length, win_length, center=False):
    """
    Args Shapes:
        - y : :math:`[B, 1, T]`

    Return Shapes:
        - spec : :math:`[B,C,T]`
    """
    y = y.squeeze(1)

    if torch.min(y) < -1.0:
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))

    global hann_window
    dtype_device = str(y.dtype) + "_" + str(y.device)
    wnsize_dtype_device = str(win_length) + "_" + dtype_device
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_length).to(dtype=y.dtype, device=y.device)

    pad_size = int((n_fft - hop_length) / 2)
    y = torch.nn.functional.pad(y.unsqueeze(1), (pad_size, pad_size), mode="constant", value=0)
    y = y.squeeze(1)

    spec = torch.stft(
        y,
        n_fft,
        hop_length=hop_length,
        win_length=win_length,
        window=hann_window[wnsize_dtype_device],
        center=center,
        pad_mode="constant",
        normalized=False,
        onesided=True,
        return_complex=False,
    )

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
    return spec

neural-loop on Aug 3, 2023