uis-rnn: uis-rnn can't work for long utterances dataset?
Describe the question
In Diarization task, i train on AMI train-dev set and ICSI corpus , i test on AMI test set. Both datasets include audios of 3-5 speakers in 50-70 minutes. My d embedding trains on Voxceleb1,2 with EER = 4.55%. I train uirnn with window size .24ms, overlap 50%, segment size .4ms. The result is poor on both train and test set. I also read all your code about uirnn, i don’t understand 1> why do you split up the original utterances and concatenate them by speaker and then use that input for training? 2> Why doese the input ignore which audio the utterance belongs to, just merge all utterances in 1 single audio? .This process seems completely different to inference process and also reduce the capacity of using batch size if one speaker talk too much. For 1 hour audio, the output has 20-30 speakers instead of 3-5 speakers no matter the smaller of crp_alpha is.
My background
Have I read the README.md file?
- yes
Have I searched for similar questions from closed issues?
- yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
- yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
- yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
- yes
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 19 (7 by maintainers)
@wq2012
Yes, you use P(X,Y,Z), a generative approach. Other researches use discriminative approach P(Y|X) = P(Y|Z,X) * P(Z|X) = SAP * SCD. I think generative approach P(X,Y,Z) is nearly optimal when you can train it on extremely big dataset as Transformer based algorithms such as BERT, GPT2
In unsupervised way, i found that your Spectral Cluster algorithm works quite good in many audios.
It’s a good point. I think that’s an interesting direction for future efforts.
Indeed, spectral clustering is by far the best unsupervised approach that we found. The only drawback is that it’s a bit sensitive to its parameters. So we usually tune the parameters for specific domains that we want to deploy the system to.