uis-rnn: uis-rnn can't work for long utterances dataset?

Describe the question

In Diarization task, i train on AMI train-dev set and ICSI corpus , i test on AMI test set. Both datasets include audios of 3-5 speakers in 50-70 minutes. My d embedding trains on Voxceleb1,2 with EER = 4.55%. I train uirnn with window size .24ms, overlap 50%, segment size .4ms. The result is poor on both train and test set. I also read all your code about uirnn, i don’t understand 1> why do you split up the original utterances and concatenate them by speaker and then use that input for training? 2> Why doese the input ignore which audio the utterance belongs to, just merge all utterances in 1 single audio? .This process seems completely different to inference process and also reduce the capacity of using batch size if one speaker talk too much. For 1 hour audio, the output has 20-30 speakers instead of 3-5 speakers no matter the smaller of crp_alpha is.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

About this issue

Original URL
State: open
Created 5 years ago
Comments: 19 (7 by maintainers)

Most upvoted comments

@wq2012

In your paper, uirnn includes 3 steps: a)Speaker Change Detection (SCD), b)Speaker Assignment Process (SAP), c) Sequence Generation Step c) is necessary to complete the probability distribution

Yes, you use P(X,Y,Z), a generative approach. Other researches use discriminative approach P(Y|X) = P(Y|Z,X) * P(Z|X) = SAP * SCD. I think generative approach P(X,Y,Z) is nearly optimal when you can train it on extremely big dataset as Transformer based algorithms such as BERT, GPT2

UIS-RNN is our initial effort to solve the problem of clustering sequential data in a supervised way.

In unsupervised way, i found that your Spectral Cluster algorithm works quite good in many audios.

wrongbattery on May 29, 2019

Yes, you use P(X,Y,Z), a generative approach. Other researches use discriminative approach P(Y|X) = P(Y|Z,X) * P(Z|X) = SAP * SCD. I think generative approach P(X,Y,Z) is nearly optimal when you can train it on extremely big dataset as Transformer based algorithms such as BERT, GPT2

It’s a good point. I think that’s an interesting direction for future efforts.

In unsupervised way, i found that your Spectral Cluster algorithm works quite good in many audios.

Indeed, spectral clustering is by far the best unsupervised approach that we found. The only drawback is that it’s a bit sensitive to its parameters. So we usually tune the parameters for specific domains that we want to deploy the system to.

wq2012 on May 29, 2019