Speaker Diarization of Known Speakers #1667

tobiasschmidt89 · 2024-03-08T08:57:16Z

tobiasschmidt89
Mar 8, 2024

Hi,

I really enjoy using this Library for speaker diarization to create labeled transcripts in combination with Whisper: Speaker 1: ..., Speaker 2: ..., Speaker 1: ...
Currently I then search and replace the anonymous speaker labels with the real names.

I have some meetings that always have the same speakers (D&D game sessions). Therefore I am searching for a way to kind of create "voice embeddings" of each speaker by recording them in isolation for a minute or so. Then I want to do a speaker diarization using these embeddings to get labels like: Max: ..., Maria: ..., Tobi: ..., Max: ..., Unknown: ...

I would be interested if someone has some pointers on how I could achieve this with Pyannote. I expect I need to do the following:

Use an audio embedding model to create a normalized embedding for each speaker (with an isolated audio file).
Do a speaker diarization of the entire audio file to get the audio segments.
Create an audio embedding of each audio segment.
Calculate the similarity score of each audio segment embedding with the precalculated embedding of each speaker to identify the most likely speaker.
Use whisper to transcribe each audio segment attaching the label of the speaker.

Crossed items I know how to do.

I am very comfortable with text embeddings. Audio embedding is a new topic for me.
I would really appreciate any pointers or example scripts.

Thank you
T.

desicochrane · 2024-03-13T13:36:08Z

desicochrane
Mar 13, 2024

Hey @tobiasschmidt89 I'm just started working on this too (for podcasts). Digging into pyannote I see it uses speechbrain "speechbrain/spkrec-ecapa-voxceleb" embedding model. https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb

I can share the approach I have in mind:

I am probably going to store k representative embeddings per speaker (I'll likely do k-means clustering of al samples of each speaker).
Run pyannote diarization. I might provide a high number of speakers, since I'll later be mapping them anyway, I am ok with low precision at this stage to get higher recall
Take these and perform dense retrieval against the embedding db. Probably just cosine distance and take, say, 20 nearest neighbours.
Then, I'll do a rerank (cross encoder pattern) to get a probability score.

I'll see how that goes.
Likely I can do more here, I have the text of the audio intervals, so I can also compare the text embeddings along with audio embeddings. Also, in my case, I know some stuff about the prior probabilities ahead of time (the hosts of the podcast are most likely. You might have prior knowledge in your meeting use case too).

Let me know how you get on with it. Feel free to hmu if you wanna collaborate a bit on this.

4 replies

PhilipAmadasun Mar 15, 2024

@desicochrane Could you please explain this "Rerank Cross Encoder Pattern" thing? I'm doing something similar to what you guys are trying, might need that. Also, if you take multiple isolated voice samples(embeddings) of the people you know will be in the conversation, will then have to compare cosine distance on every single one of the voice samples you have on one individual person. I ask because let's say you have 10 voice samples per person, that's 100 cosine distance comparisons no?

desicochrane Mar 19, 2024

Hey mate, sorry late response.

"Rerank Cross Encoder Pattern" thing?

Checkout the retrieve and rerank pattern here: https://www.sbert.net/examples/applications/retrieve_rerank/README.html

if you take multiple isolated voice samples(embeddings) of the people you know will be in the conversation, will then have to compare cosine distance on every single one of the voice samples you have on one individual person. I ask because let's say you have 10 voice samples per person, that's 100 cosine distance comparisons no?

You could do that, sure. 100 cosine distances is lightning fast and vectorizable. 1M cosine distances can be done in milliseconds.

PhilipAmadasun Mar 27, 2024

@desicochrane How can you calculate many cosine distances in real time as you mentioned? Would you happen to have some example code?

desicochrane Mar 27, 2024

you could put the 1M vectors into a matrix and do a dot product (if they're normalised), or you could put them in a vector index like Faiss. Maybe something like this:

!python -m pip install faiss-gpu

import faiss
import numpy as np
import time

# Number of vectors and dimensions
num_vectors = 1000000
dimensions = 1000

# Generate random vectors (1M vectors + 1 query vector)
database_vectors = np.random.rand(num_vectors, dimensions).astype('float32')
query_vector = np.random.rand(1, dimensions).astype('float32')

# Normalize the vectors (FAISS requires normalized vectors for cosine similarity via the inner product)
faiss.normalize_L2(database_vectors)
faiss.normalize_L2(query_vector)

# Create a GPU FAISS index (inner product for cosine similarity, as vectors are normalized)
res = faiss.StandardGpuResources()  # Use all GPU resources
index = faiss.IndexFlatIP(dimensions)  # Inner Product (IP) index
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)  # Move index to GPU 0

# Add vectors to the index
gpu_index.add(database_vectors)

# Perform the search
k = 10  # Number of nearest neighbors to find
start_time = time.time()
D, I = gpu_index.search(query_vector, k)  # D = distances, I = indices of the nearest neighbors
end_time = time.time()

print("FAISS-GPU search time:", end_time - start_time)
print("Distances:", D)
print("Indices:", I)

Running that in google colab I got
FAISS-GPU search time: 0.05401325225830078
Distances: [[0.7987112 0.79602313 0.79593 0.7954348 0.79437023 0.7937976
0.7922267 0.7918892 0.79174185 0.79161 ]]
Indices: [[388597 671101 817522 743371 101155 474046 17489 622588 665845 341719]]

ymednis · 2025-03-16T23:23:43Z

ymednis
Mar 16, 2025

Hi @desicochrane, @tobiasschmidt89 .

I'm exploring ways to improve speaker diarization in recordings with multiple speakers, but my primary goal is to accurately detect a specific target speaker. I have additional enrollment samples for that speaker (which I suppose I can use to generate an embedding), and I'm wondering if anyone here has achieved high precision for this particular use case.
Thanks, in advance.

8000 1 reply

tobiasschmidt89 Mar 17, 2025
Author

No, not yet. I am for my use case (D&D Session Recordings) utilizing LLMs for the speaker assignment. It is like a riddle for the LLM. Here are X speakers, here is the transcript and possible speakers. What is the name of each speaker?

alunkingusw · 2025-06-13T19:58:11Z

alunkingusw
Jun 13, 2025

Hi, here is my gist that applies embedding on a wav of a known speaker, and compares it to the output of the diarisation. I am building an API that manages meetings and audio from meetings to transcribe it.
Let me know if this is helpful.
https://gist.github.com/alunkingusw/4275a26ba79cefd49f5a2e91d91c4da2

1 reply

alunkingusw Jun 13, 2025

Once I have this working, my API (along with a frontend) will accept audio files and transcribe them, automatically labelling known speakers. The sample audio used for the embeddings could be provided by the user, or manually selected from a recording. Currently I'm looking at how I can take the segments from a meeting and order them by the most accurate detections of speech (https://pyannote.github.io/pyannote-metrics/reference.html), so that the comparisons of embeddings can be more accurate.

Speaker Diarization of Known Speakers #1667

Uh oh!

Uh oh!

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tobiasschmidt89 Mar 17, 2025 Author

Uh oh!

Uh oh!

Replies: 3 comments 6 replies

tobiasschmidt89 Mar 17, 2025
Author