Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.04093 (cs)

[Submitted on 9 Sep 2022 (v1), last revised 26 Oct 2022 (this version, v2)]

Title:Learning Audio-Visual embedding for Person Verification in the Wild

Authors:Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu

View PDF

Abstract:It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2209.04093 [cs.CV]
	(or arXiv:2209.04093v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.04093

Submission history

From: Peiwen Sun [view email]
[v1] Fri, 9 Sep 2022 02:29:47 UTC (2,525 KB)
[v2] Wed, 26 Oct 2022 13:55:55 UTC (2,769 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Audio-Visual embedding for Person Verification in the Wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Audio-Visual embedding for Person Verification in the Wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators