speechbrain: interfaces.py separate_file() not working properly?

Specs: Windows Version 10.0.19042 Build 19042, Python 3.8.3

I’m following the Source Separation tutorial that can be accessed from this page.

Since the audio I have is already mixed, I tried to use model.separate_file(), based off the HuggingFace speechbrain/sepformer-wsj02mix code. There are two issues with that:

  1. CPU, RAM and disk usage on my PC shoot up into the high nineties, causing the computer to freeze.
  2. After the computer unfreezes (presumably, after the calculations are finished), an error is thrown:
Traceback (most recent call last):
  ...
    est_sources = model.separate_file(path='data/test/audio/speech.wav')
  File "C:\...\venv\lib\site-packages\speechbrain\pretrained\interfaces.py", line 710, in separate_file
    est_sources = self.separate_batch(batch)
  File "C:\...\venv\lib\site-packages\speechbrain\pretrained\interfaces.py", line 669, in separate_batch
    est_mask = self.modules.masknet(mix_w)
  File "C:\...\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\...\venv\lib\site-packages\speechbrain\lobes\models\dual_path.py", line 1124, in forward
    x = self.dual_mdl[i](x)
  File "C:\...\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\...\venv\lib\site-packages\speechbrain\lobes\models\dual_path.py", line 975, in forward
    inter = self.inter_mdl(inter)
  File "C:\...\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\...\venv\lib\site-packages\speechbrain\lobes\models\dual_path.py", line 597, in forward
    return self.mdl(x + pos_enc)[0]
RuntimeError: The size of tensor a (2830) must match the size of tensor b (2500) at non-singleton dimension 1

My guess is that the data that separate_file() passes to separate_batch() is incorrect:

source, fl = split_path(path)
path = fetch(fl, source=source, savedir=savedir)
batch, _ = torchaudio.load(path)
est_sources = self.separate_batch(batch)

Does resampling the data have something to do with it?

If, instead of using separate_file(), I write:

mix, fs = torchaudio.load('data/test/audio/speech.wav')
resampler = torchaudio.transforms.Resample(fs, 8000)
mix = resampler(mix)
est_sources = model.separate_batch(mix)

as suggested in the aforementioned tutorial, the computer doesn’t freeze and est_sources is a torch.Tensor with torch.Size([1, 471272, 2]), which looks like expected behavior to me.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17

Most upvoted comments

Yes, I think you can process the long chunks and concatenate. In the transition you might have artifacts, but you can definitely try. Best, Cem Le ven. 15 déc. 2023, à 18 h 52, YC @.***> a écrit :

Thank you, Cem! I am in the process of this. Hope it works well. Happy early holidays!

Yes, I think you can process the long chunks and concatenate. In the transition you might have artifacts, but you can definitely try.

Best, Cem

Le ven. 15 déc. 2023, à 18 h 52, YC @.***> a écrit :

Thanks for your reply @rogermiranda1000 https://github.com/rogermiranda1000 @yc-li20 https://github.com/yc-li20 , please check what’s happening here:

https://github.com/speechbrain/speechbrain/blob/f285f19e50567d5ba25f06c83c9bd9a31742b386/speechbrain/pretrained/interfaces.py#L2223

the file is automatically converted to mono, and it automatically down/re-samples the input audio to the model sampling frequency.

I think in your case it’s due to the signal being long. And that’s due to the positional embeddings. Can you try feeding a shorter audio?

Thank you for your prompt answer, Cem! @ycemsubakan https://github.com/ycemsubakan I noticed this problem and am wondering if it is proper to simply chop the audio and then concatenate.

— Reply to this email directly, view it on GitHub https://github.com/speechbrain/speechbrain/issues/594#issuecomment-1858617854, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEGOFX4GFWWBMMR6YJ4XZF3YJTPE3AVCNFSM4ZROVI6KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVHA3DCNZYGU2A . You are receiving this because you were mentioned.Message ID: @.***>

@rogermiranda1000 Hi Roger, did you find a solution to this?

I actually don’t remember. I’ve seen this error multiple times, and usually it was one of those 3 things:

  1. The audio is stereo (multiple channels); it has to have one channel
  2. The audio needs to be downsampled (I see in this issue that I tried MODEL_SAMPLE_RATE = 8000)
  3. The audio is too long

You can try if any of those solves the issue, and if it does please state it here ( don’t be like me XD ). Hope it helps!

@mravanelli @ycemsubakan maybe the ticket needs to be re-opened?

Yes, this is a good suggestion actually. We will add this to the code. Thank you @UrosOgrizovic !