Failing diarization, debugging tips #1869

mo22 · 2025-05-05T10:32:55Z

mo22
May 5, 2025

Good day,

I'm trying to learn pyannote and have some (clear) audio recordings where pyannote fails, and I was wondering if you could suggest how to debug / improve this.

I'm using:

pyannote.audio 3.3.2
pyannote/speaker-diarization-3.1
pyannote/segmentation-3.0
pyannote/wespeaker-voxceleb-resnet34-LM

In the hook method I store the embeddings (ex. with audio input duration 300 seconds):

292x with shape (32, 256) as slice_embeddings, "Slice"
-> are these the 10 second by 1 second voxceleb embeddings of the input file?
and the final embeddings with shape (317, 3, 256) as all_embeddings ("Channel 0, 1, 2")
-> what are these?

I plot a umap scatter plot of each of the channels, as well as a 1d plot of the channels together with the diarization result. The diarization result is stacked: bottom is the pyannote result, middle is the known correct diarization, and top is the result of pyannote pro api.

For the audio file test_42 it works really well, but for test_47 the whole recording is mapped to SPEAKER_1 with only small segments (<1 second) to SPEAKER_2, and pyannote pro also does not seem to work correctly. Both audio files have two speakers.

This works: