FreeVC: Error with 24khz data utils

So when trying to resume one of the existing 24khz models I am getting the following error.

File "Z:\FreeVCtrain24khz\train_24.py", line 65, in run train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps) File "Z:\FreeVCtrain24khz\data_utils_24.py", line 34, in __init__ self._filter() File "Z:\FreeVCtrain24khz\data_utils_24.py", line 46, in _filter lengths.append(os.path.getsize(audiopath[0]) // (2 * self.hop_length)) File "C:\Users\steven\AppData\Local\Programs\Python\Python37\lib\genericpath.py", line 50, in getsize return os.stat(filename).st_size FileNotFoundError: [WinError 3] The system cannot find the path specified: 'DUMMY\\p337\\p337_014.wav'

I noticed this line is missing in the 24khz datautils

audiopath = audiopath[0].replace("\\","/").replace("DUMMY", "dataset/vctk-16k")

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (14 by maintainers)

Most upvoted comments

@OlaWod Alright I have figured out the problem… The original preprocess is incorrect, for both the 24 and 16khz. so the problem is this line here wav, sr = librosa.load(wav_path) Librosa.load load the file with a default sampling rate of 22khz. From the librosa documentation

librosa.load Load an audio file as a floating point time series. Audio will be automatically resampled to the given rate (default sr=22050). To preserve the native sampling rate of the file, use sr=None.

so what happens is the audio is loaded with only 22khz, in both the 16 and 24khz preprocess. This is theoretically fine when doing the 16khz preprocess, but obviously this is a problem when trying to save at 24khz. I noticed this when I first ran the preprocess already and had changed the loading of the wav file, as i noticed that the output of the downsampled files was blank above 10khz. So the 24khz files where sampled at 24khz, but only had the audio information of a 22khz file. See below image of 2 24khz files. Top file was loaded with wav, sr = librosa.load(wav_path) Bottom file was loaded with wav, sr = librosa.load(wav_path, sr=None)

SC1 Can see the upper freq range is blank on the file that used the default load. So this is a problem for trying to train a 24khz model as you are actually training on a 22khz file. The other problem is the way the trim feature works, as it changes based on the sample rate the file is loaded at. So I said it was theoretically fine for the 16khz files, but it isn’t because of the trim feature. So if you look at the 2 files, you can see that the file loaded with the correct SR is slightly shorter then the one loaded with 22khz. Well the same thing happens with the 16khz files, they are longer then they should be. This image shows a 16khz files from the original preprocess, along with the 2 versions of the 24khz files. SC2

So here you can see the the files loaded at the 22khz are longer then the one loaded at 48khz. So this length variance is what causes the tensor size errors, but also look closely at the circled section, the trimming on the 16khz is stretching and mashing words together. You no longer have clean breaks on formants between sounds, everything gets blurred together, this hurts the intelligibility of the model. We see this is even more apparent when we compare to the original file below. SC3 So the phrase is “please call stella”, what we see here is that the trim has removed the beginning of the word please, it has removed the P from please. This explains why the model has a hard time with S and P pronounciations, as the trim on preprocess removes all of these sounds. So if you listen to the 2 files below, you can clearly hear how the trimmed files has a very weak P sound. https://drive.google.com/file/d/1r-pra0feL3aWpWUQDWxGJ7QNf9f4xxj3/view?usp=share_link

https://drive.google.com/file/d/1q4m_to27nYUqpu9EOWpi7Ol5UwsqF2PX/view?usp=sharing

So the trim function needs to be made less aggressive or left out completely as its removing key sounds from words that the model needs to be able to learn properly. and the preprocess scripts for both 16 and 24khz need to be adjusted to use wav, sr = librosa.load(wav_path, sr=None) So that the files load with the correct sampling rate. After having rerun the preprocess for both the 16 and 24khz with sr=None the 24khz training now works. Of course when resuming the original model the mel_loss is over 30, as the new files actually contain audio data above 10khz and the existing model was trained on files with the blank section. Olawod, you may want to train a new 24khz model.

@Likkkez Can you post your full datautils_24 I changed the hop to 480 manually, but im still getting the same error.

import time
import os
import random
import numpy as np
import torch
import torch.utils.data

import commons 
from mel_processing import spectrogram_torch, spec_to_mel_torch
from utils import load_wav_to_torch, load_filepaths_and_text, transform
#import h5py


"""Multi speaker version"""
class TextAudioSpeakerLoader(torch.utils.data.Dataset):
    """
        1) loads audio, speaker_id, text pairs
        2) normalizes text and converts them to sequences of integers
        3) computes spectrograms from audio files.
    """
    def __init__(self, audiopaths, hparams):
        self.audiopaths = load_filepaths_and_text(audiopaths)
        self.max_wav_value = hparams.data.max_wav_value
        self.sampling_rate = hparams.data.sampling_rate
        self.filter_length  = hparams.data.filter_length
        self.hop_length     = hparams.data.hop_length
        self.win_length     = hparams.data.win_length
        self.use_sr = hparams.train.use_sr
        self.use_spk = hparams.model.use_spk
        self.spec_len = hparams.train.max_speclen

        random.seed(1234)
        random.shuffle(self.audiopaths)
        self._filter()

    def _filter(self):
        """
        Filter text & store spec lengths
        """
        # Store spectrogram lengths for Bucketing
        # wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2)
        # spec_length = wav_length // hop_length

        lengths = []
        for audiopath in self.audiopaths:
            lengths.append(os.path.getsize(audiopath[0]) // (2 * self.hop_length))
        self.lengths = lengths

    def get_audio(self, filename):
        audio, sampling_rate = load_wav_to_torch(filename.replace("DUMMY", "dataset/vctk-24k"))
        if sampling_rate != 24000:
            raise ValueError("{} SR doesn't match target {} SR".format(
                sampling_rate, self.sampling_rate))
        audio_norm = audio / self.max_wav_value
        audio_norm = audio_norm.unsqueeze(0)
        spec_filename = filename.replace(".wav", ".spec.pt")
        if os.path.exists(spec_filename):
            spec = torch.load(spec_filename)
        else:
            spec = spectrogram_torch(audio_norm, self.filter_length,
                self.sampling_rate, 480, self.win_length,
                center=False)
            spec = torch.squeeze(spec, 0)
            torch.save(spec, spec_filename)
            
        if self.use_spk:
            spk_filename = filename.replace(".wav", ".npy")
            spk_filename = spk_filename.replace("DUMMY", "dataset/spk")
            spk = torch.from_numpy(np.load(spk_filename))
        
        if not self.use_sr:
            c_filename = filename.replace(".wav", ".pt")
            c_filename = c_filename.replace("DUMMY", "dataset/wavlm")
            c = torch.load(c_filename).squeeze(0)
        else:
            i = random.randint(68,92)
            '''
            basename = os.path.basename(filename)[:-4]
            spkname = basename[:4]
            #print(basename, spkname)
            with h5py.File(f"dataset/rs/wavlm/{spkname}/{i}.hdf5","r") as f:
                c = torch.from_numpy(f[basename][()]).squeeze(0)
            #print(c)
            '''
            c_filename = filename.replace(".wav", f"_{i}.pt")
            c_filename = c_filename.replace("DUMMY", "dataset/sr/wavlm")
            c = torch.load(c_filename).squeeze(0)
            
        '''
        lmin = min(c.size(-1), spec.size(-1))
        spec, c = spec[:, :lmin], c[:, :lmin]
        audio_norm = audio_norm[:, :lmin*480]
        _spec, _c, _audio_norm = spec, c, audio_norm
        while spec.size(-1) < self.spec_len:
            spec = torch.cat((spec, _spec), -1)
            c = torch.cat((c, _c), -1)
            audio_norm = torch.cat((audio_norm, _audio_norm), -1)
        start = random.randint(0, spec.size(-1) - self.spec_len)
        end = start + self.spec_len
        spec = spec[:, start:end]
        c = c[:, start:end]
        audio_norm = audio_norm[:, start*480:end*480]
        '''
        
        if self.use_spk:
            return c, spec, audio_norm, spk
        else:
            return c, spec, audio_norm

    def __getitem__(self, index):
        return self.get_audio(self.audiopaths[index][0])

    def __len__(self):
        return len(self.audiopaths)


class TextAudioSpeakerCollate():
    """ Zero-pads model inputs and targets
    """
    def __init__(self, hps):
        self.hps = hps
        self.use_sr = hps.train.use_sr
        self.use_spk = hps.model.use_spk

    def __call__(self, batch):
        """Collate's training batch from normalized text, audio and speaker identities
        PARAMS
        ------
        batch: [text_normalized, spec_normalized, wav_normalized, sid]
        """
        # Right zero-pad all one-hot text sequences to max input length
        _, ids_sorted_decreasing = torch.sort(
            torch.LongTensor([x[0].size(1) for x in batch]),
            dim=0, descending=True)

        max_spec_len = max([x[1].size(1) for x in batch])
        max_wav_len = max([x[2].size(1) for x in batch])

        spec_lengths = torch.LongTensor(len(batch))
        wav_lengths = torch.LongTensor(len(batch))
        if self.use_spk:
            spks = torch.FloatTensor(len(batch), batch[0][3].size(0))
        else:
            spks = None
        
        c_padded = torch.FloatTensor(len(batch), batch[0][0].size(0), max_spec_len)
        spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), max_spec_len)
        wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
        c_padded.zero_()
        spec_padded.zero_()
        wav_padded.zero_()
        
        for i in range(len(ids_sorted_decreasing)):
            row = batch[ids_sorted_decreasing[i]]
            
            c = row[0]
            c_padded[i, :, :c.size(1)] = c

            spec = row[1]
            spec_padded[i, :, :spec.size(1)] = spec
            spec_lengths[i] = spec.size(1)

            wav = row[2]
            wav_padded[i, :, :wav.size(1)] = wav
            wav_lengths[i] = wav.size(1)
            
            if self.use_spk:
                spks[i] = row[3]
        
        spec_seglen = spec_lengths[-1] if spec_lengths[-1] < self.hps.train.max_speclen + 1 else self.hps.train.max_speclen + 1
        wav_seglen = spec_seglen * 480

        spec_padded, ids_slice = commons.rand_spec_segments(spec_padded, spec_lengths, spec_seglen)
        wav_padded = commons.slice_segments(wav_padded, ids_slice * 480, wav_seglen)
        
        c_padded = commons.slice_segments(c_padded, ids_slice, spec_seglen)[:,:,:-1]
    
        spec_padded = spec_padded[:,:,:-1]
        wav_padded = wav_padded[:,:,:-480]

        if self.use_spk:
          return c_padded, spec_padded, wav_padded, spks
        else:
          return c_padded, spec_padded, wav_padded
          

class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler):
    """
    Maintain similar input lengths in a batch.
    Length groups are specified by boundaries.
    Ex) boundaries = [b1, b2, b3] -> any batch is included either {x | b1 < length(x) <=b2} or {x | b2 < length(x) <= b3}.
  
    It removes samples which are not included in the boundaries.
    Ex) boundaries = [b1, b2, b3] -> any x s.t. length(x) <= b1 or length(x) > b3 are discarded.
    """
    def __init__(self, dataset, batch_size, boundaries, num_replicas=None, rank=None, shuffle=True):
        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
        self.lengths = dataset.lengths
        self.batch_size = batch_size
        self.boundaries = boundaries
  
        self.buckets, self.num_samples_per_bucket = self._create_buckets()
        self.total_size = sum(self.num_samples_per_bucket)
        self.num_samples = self.total_size // self.num_replicas
  
    def _create_buckets(self):
        buckets = [[] for _ in range(len(self.boundaries) - 1)]
        for i in range(len(self.lengths)):
            length = self.lengths[i]
            idx_bucket = self._bisect(length)
            if idx_bucket != -1:
                buckets[idx_bucket].append(i)
  
        for i in range(len(buckets) - 1, 0, -1):
            if len(buckets[i]) == 0:
                buckets.pop(i)
                self.boundaries.pop(i+1)
  
        num_samples_per_bucket = []
        for i in range(len(buckets)):
            len_bucket = len(buckets[i])
            total_batch_size = self.num_replicas * self.batch_size
            rem = (total_batch_size - (len_bucket % total_batch_size)) % total_batch_size
            num_samples_per_bucket.append(len_bucket + rem)
        return buckets, num_samples_per_bucket
  
    def __iter__(self):
      # deterministically shuffle based on epoch
      g = torch.Generator()
      g.manual_seed(self.epoch)
  
      indices = []
      if self.shuffle:
          for bucket in self.buckets:
              indices.append(torch.randperm(len(bucket), generator=g).tolist())
      else:
          for bucket in self.buckets:
              indices.append(list(range(len(bucket))))
  
      batches = []
      for i in range(len(self.buckets)):
          bucket = self.buckets[i]
          len_bucket = len(bucket)
          ids_bucket = indices[i]
          num_samples_bucket = self.num_samples_per_bucket[i]
  
          # add extra samples to make it evenly divisible
          rem = num_samples_bucket - len_bucket
          ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
  
          # subsample
          ids_bucket = ids_bucket[self.rank::self.num_replicas]
  
          # batching
          for j in range(len(ids_bucket) // self.batch_size):
              batch = [bucket[idx] for idx in ids_bucket[j*self.batch_size:(j+1)*self.batch_size]]
              batches.append(batch)
  
      if self.shuffle:
          batch_ids = torch.randperm(len(batches), generator=g).tolist()
          batches = [batches[i] for i in batch_ids]
      self.batches = batches
  
      assert len(self.batches) * self.batch_size == self.num_samples
      return iter(self.batches)
  
    def _bisect(self, x, lo=0, hi=None):
      if hi is None:
          hi = len(self.boundaries) - 1
  
      if hi > lo:
          mid = (hi + lo) // 2
          if self.boundaries[mid] < x and x <= self.boundaries[mid+1]:
              return mid
          elif x <= self.boundaries[mid]:
              return self._bisect(x, lo, mid)
          else:
              return self._bisect(x, mid + 1, hi)
      else:
          return -1

    def __len__(self):
        return self.num_samples // self.batch_size

Make sure that before you run this you remove all spec.pt files from ur dataset find . -name "*.spec.pt" -type f -delete

If I set the hop_length to 480 manually:

spec = spectrogram_torch(audio_norm, self.filter_length,
                self.sampling_rate, 480, self.win_length,
                center=False)

It seems that the sizes almost match: RuntimeError: The expanded size of the tensor (297) must match the existing size (298) at non-singleton dimension 1. Target sizes: [1024, 297]. Tensor sizes: [1024, 298]

self.hop_length seems to be 320 by default. Also, the audio_norm is actually the 24kHZ version of the audio. Is that supposed to be like this? I’ve tried importing the 16KHz versions and doing the spectrograms for them instead but the error persists. Sorry, I’m not really smart enough for all this especially this 16 to 24KHz hack thing.

It works! Thanks a lot

If I set the hop_length to 480 manually:

spec = spectrogram_torch(audio_norm, self.filter_length,
                self.sampling_rate, 480, self.win_length,
                center=False)

It seems that the sizes almost match: RuntimeError: The expanded size of the tensor (297) must match the existing size (298) at non-singleton dimension 1. Target sizes: [1024, 297]. Tensor sizes: [1024, 298]

self.hop_length seems to be 320 by default. Also, the audio_norm is actually the 24kHZ version of the audio. Is that supposed to be like this? I’ve tried importing the 16KHz versions and doing the spectrograms for them instead but the error persists. Sorry, I’m not really smart enough for all this especially this 16 to 24KHz hack thing.

I get that error regardless of what folder I use. well similar, diff size listed. RuntimeError: The expanded size of the tensor (61920) must match the existing size (33824) at non-singleton dimension 1. Target sizes: [1, 61920]. Tensor sizes: [33824]