[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3548608.3559301acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccirConference Proceedingsconference-collections
research-article

A speaker verification method using frame-level self-attention

Published: 14 October 2022 Publication History

Abstract

A new method of applying the self-attention mechanism for text-independent speaker verification is presented in this paper. There are differences between the speech frames, which may affect the performance of the speaker verification system. Generally, to capture the differences, the speaker verification system needs to perform a weighted average for the frame-level output in the pooling stage of extracting the speaker embedding, and the weights are learned through the self-attention mechanism. In this work, the self-attention mechanism is introduced into the frame-level model in the system to extract frame-level embeddings, which provides a new way of capturing the differences of speech frames for speaker verification. A self-attention layer is added between the frame-level layers to directly obtain different frame-level features. These features are then combined to form the more discriminative speaker embeddings so that the system achieves better performance. Results of experiments conducted on the Voxceleb1 dataset show that the proposed system outperforms the baselines. Moreover, the improvement is consistent with different speech duration.

References

[1]
H. Muckenhirn, M. M. Doss, and S. Marcell, “Towards directly modeling raw speech signal for speaker verification using CNNs,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 2018, pp. 4884-4888.
[2]
Y. Zhao, T. Zhou, Z. Chen, and J. Wu, “Improving deep CNN networks with long temporal context for text-independent speaker verification,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 2020, pp. 6834-6838.
[3]
X. Li, J. Zhong, J. Yu, “Bayesian x-vector: Bayesian neural network based x-vector system for speaker verification,” unpublished.
[4]
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, 2010, vol. 19, pp. 788-798.
[5]
S. H. Ghalehjegh, and R. C. Rose, “Deep bottleneck features for i-vector based text-independent speaker verification,” in 2015 ieee workshop on automatic speech recognition and understanding (asru), Scottsdale, 2015, pp. 555-560.
[6]
Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, 2014, pp. 1695-1699.
[7]
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, 2014, pp. 4052-4056.
[8]
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 5115-5119.
[9]
C. Li, X. Ma, B. Jiang, “Deep speaker: an end-to-end neural speaker embedding system,” unpublished.
[10]
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 2018, pp. 4879-4883.
[11]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 2018, pp. 5329-5333.
[12]
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, 1989, vol. 37, pp. 328-339.
[13]
D. Snyder, J. Villalba, N. Chen, “The JHU Speaker Recognition System for the VOiCES 2019 Challenge,” in INTERSPEECH, Graz, 2019, pp. 2468-2472.
[14]
Y. Zhu, and B. Mak, “Orthogonal training for text-independent speaker verification,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 2020, pp. 6584-6588.
[15]
D. Garcia-Romero, A. McCree, D. Snyder, and G. Sell, “JHU-HLTCOE system for the VoxSRC speaker recognition challenge,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 2020, pp. 7559-7563.
[16]
Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” in Interspeech, Hyderabad, 2018, pp. 3573-3577.
[17]
A. Vaswani, N. Shazeer, N. Parmar, “Attention is all you need,” in Advances in neural information processing systems, Long Beach, 2017, pp. 5998-6008.
[18]
S. Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision, Graz, 2006, pp. 531-542.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCIR '22: Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics
June 2022
905 pages
ISBN:9781450397179
DOI:10.1145/3548608
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCIR 2022

Acceptance Rates

Overall Acceptance Rate 131 of 239 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 34
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media