More Web Proxy on the site http://driver.im/

research-article

Denoised Senone I-Vectors for Robust Speaker Verification

Authors:

Brian Kan-Wing Mak,

Yingke ZhuAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 26, Issue 4

Pages 820 - 830

https://doi.org/10.1109/TASLP.2018.2796843

Published: 01 April 2018 Publication History

Abstract

Recently, it has been shown that senone i-vectors, whose posteriors are produced by senone deep neural networks DNNs, outperform the conventional Gaussian mixture model GMM i-vectors in both speaker and language recognition tasks. The success of senone i-vectors relies on the capability of the DNN to incorporate phonetic information into the i-vector extraction process. In this paper, we argue that to apply senone i-vectors in noisy environments, it is important to robustify the phonetically discriminative acoustic features and senone posteriors estimated by the DNN. To this end, we propose a deep architecture formed by stacking a deep belief network on top of a denoising autoencoder DAE. After backpropagation fine-tuning, the network, referred to as denoising autoencoder–deep neural network DAE–DNN, facilitates the extraction of robust phonetically-discriminitive bottleneck BN features and senone posteriors for i-vector extraction. We refer to the resulting i-vectors as denoised BN-based senone i-vectors. Results on NIST 2012 SRE show that senone i-vectors outperform the conventional GMM i-vectors. More interestingly, the BN features are not only phonetically discriminative, results suggest that they also contain sufficient speaker information to produce BN-based senone i-vectors that outperform the conventional senone i-vectors. This work also shows that DAE training is more beneficial to BN feature extraction than senone posterior estimation.

References

[1]

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788-798, May 2011.

[2]

S. Prince and J.H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," in Proc. IEEE 11th Int. Conf. Comput. Vis., 2007, 2007, pp. 1-8.

[3]

S. Cumani and P. Laface, "Nonlinear i-vector transformations for PLDA-based speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 908-919, Apr. 2017.

[4]

H. Xing and J. H. L. Hansen, "Single sideband frequency offset estimation and correction for quality enhancement and speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 124- 136, Jan. 2017.

[5]

R. Saeidi, P. Alku, and T. Backstrom, "Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 42-53, Jan. 2016.

[6]

T. Hasan and J. Hansen, "Acoustic factor analysis for robust speaker verification," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 4, pp. 842-853, 2013.

[7]

H. Yamamoto and T. Koshinaka, "Denoising autoencoder-based speaker feature restoration for utterances of short duration," in Proc. Interspeech, 2015, pp. 1052-1056.

[8]

T. Pekhovsky, S. Novoselov, A. Sholohov, and O. Kudashev, "On autoencoders in the i-vector space for speaker recognition," in Proc. Odyssey, 2016, pp. 217-224.

[9]

S. Novoselov, T. Pekhovsky, O. Kudashev, V. Mendelev, and A. Prudnikov, "Non-linear PLDA for i-vector speaker verification," Interspeech, Sep. 2015, pp. 214-218.

[10]

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion," J. Mach. Learn. Res., vol. 11, pp. 3371- 3408, 2010.

[11]

T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bo¿il, and J. Hansen, "CRSS systems for 2012 NIST speaker recognition evaluation," in IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 6783-6787.

[12]

D. Garcia-Romero, X. Zhou, and C. Y Espy-Wilson, "Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2012, pp. 4257-4260.

[13]

N. Li and M. W. Mak, "SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1648-1659, 2015.

[14]

N. Li and M. W. Mak, "SNR-invariant PLDA with multiple speaker subspaces," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5565-5569.

[15]

M. W. Mak, X. M. Pang, and J. T. Chien, "Mixture of PLDA for noise robust i-vector speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 130-142, 2016.

[16]

N. Li, M. W. Mak, and J. T. Chien, "DNN-driven mixture of PLDA for robust speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 6, pp. 1371-1383, Jun. 2017.

[17]

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4960-4964.

[18]

T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, "Very deep multilingual convolutional neural networks for LVCSR," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4955-4959.

[19]

J. Li, A. Mohamed, G. Zweig, and Y. Gong, "Exploring multidimensional LSTMs for large vocabulary ASR," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4940-4944.

[20]

L. Lu, X. Zhang, and S. Renais, "On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5060-5064.

[21]

Z. Meng, S. Watanabe, J. R. Hershey, and H. Erdogan, "Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 271-275.

[22]

G. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Comput., vol. 18, no. 7, pp. 1527-1554, 2006.

[23]

O. Ghahabi and J. Hernando, "Deep learning backend for single and multisession i-vector speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 807-817, Apr. 2017.

[24]

Z. Tang, L. Li, D. Wang, and R. Vipperla, "Collaborative joint training with multitask recurrent model for speech and speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 3, pp. 493-504, Mar. 2017.

[25]

W. M. Campbell, "Using deep belief networks for vector-based speaker recognition," in Proc. Interspeech, 2014, pp. 676-680.

[26]

G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Comput., vol. 14, no. 8, pp. 1771-1800, 2002.

[27]

Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, "A novel scheme for speaker recognition using a phonetically-aware deep neural network," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 1695-1699.

[28]

L. Ferrer, Y. Lei, M. McLaren, and N. Scheffer, "Study of senone-based deep neural network approaches for spoken language recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 105- 116, Jan. 2016.

[29]

D. Garcia-Romero and A. McCree, "Insights into deep neural networks for speaker recognition," in Proc. Interspeech, 2015, pp. 1141-1145.

[30]

W. Geng, J. Li, S. Zhang, X. Cai, and B. Xu, "Multilingual tandem bottleneck feature for language identification," in Proc. Interspeech, 2015, pp. 413-417.

[31]

Y. Tian, M. Cai, L. He, and J. Liu, "Investigation of bottleneck features and multilingual deep neural networks for speaker verification," in Proc. Interspeech, 2015, pp. 1151-1155.

[32]

M. Sahidullah and G. Saha, "Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition," Speech Commun., vol. 54, no. 4, pp. 543-565, 2012.

[33]

C. Yu, A. Ogawa, M. Delcroix, T. Yoshioka, T. Nakatani, and J. HL Hansen, "Robust i-vector extraction for neural network adaptation in noisy environment," in Proc. Interspeech, 2015, pp. 2854-2857.

[34]

F. Richardson, D. Reynolds, and N. Dehak, "Deep neural network approaches to speaker and language recognition," IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1671-1675, Oct. 2015.

[35]

A. K. Sarkar, C.-T. Do, V.-B. Le, and C. Barras, "Combination of cepstral and phonetically discriminative features for speaker verification," IEEE Signal Process. Lett., vol. 21, no. 9, pp. 1040-1044, Sep. 2014.

[36]

Z. L. Tan, Y. K. Zhu, M.W. Mak, and B. Mak, "Senone i-vectors for robust speaker verification," in Proc. Int. Symp. Chin. Spoken Lang. Process., Oct. 2016.

[37]

Z. L. Tan and M.W. Mak, "Bottleneck features from SNR-adaptive denoising deep classifier for speaker identification," in Proc. Asia-Pac. Signal Inf. Process. Assoc., Annu. Summit Conf., 2015, pp. 1035-1040.

[38]

E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 4080-4084.

[39]

G. Bhattacharya, J. Alam, and P. Kenny, "Deep speaker embeddings for short-duration speaker verification," in Proc. Interspeech, 2017, pp. 1517-1521.

[40]

W. M. Campbell, D. E. Sturim, and D. A. Reynolds, "Support vector machines using GMM supervectors for speaker verification," IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308-311, May 2006.

[41]

P. Kenny, "Joint factor analysis of speaker and session variability: Theory and algorithms," CRIM, Montreal, Tech. Rep. CRIM-06/08-13, 2005.

[42]

M. W. Mak, "Lecture notes on factor analysis and i-vectors," Dept. Electron. Inf. Eng., The Hong Kong Polytechnic Univ., Hong Kong, 2016.

[43]

Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. "Greedy layer-wise training of deep networks," Adv. Neural Inf. Process. Syst., vol. 19, pp. 153, 2007.

[44]

H. Hirsch, "Fant-filtering and noise adding tool," 2005.

[45]

M. W. Mak and H. B. Yu, "A study of voice activity detection techniques for NIST speaker recognition evaluations," Comput. Speech Lang., vol. 28, no. 1, pp. 295-313, 2014.

[46]

D. Leeuwen and Niko Brümmer, "The distribution of calibrated likelihood-ratios in speaker recognition," in Proc. Interspeech, 2013, pp. 1619-1623.

[47]

N. Brümmer and E. deVilliers, "The BOSARIS toolkit: Theory, algorithms and code for surviving the new DCF," CoRR, vol. abs/1304.2865, 2013.

[48]

T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, "Duration mismatch compensation for i-vector based speaker recognition systems," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2013, pp. 7663-7667.

[49]

Z. L. Tan and M.W. Mak, "I-vector DNN scoring and calibration for noise robust speaker verification," in Proc. Interspeech, 2017, pp. 1562-1566.

Cited By

Krobba ADebyeche MSelouani S(2022)A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortionMultimedia Tools and Applications10.1007/s11042-022-14068-482:11(16195-16212)Online publication date: 27-Oct-2022
https://dl.acm.org/doi/10.1007/s11042-022-14068-4
Chuang CLee CYeng CSo ELin BChen Y(2022)Convolutional denoising autoencoder based SSVEP signal enhancement to SSVEP-based BCIsMicrosystem Technologies10.1007/s00542-019-04654-228:1(237-244)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s00542-019-04654-2
Arora SVig R(2020)An efficient text-independent speaker verification for short utterance data from Mobile devicesMultimedia Tools and Applications10.1007/s11042-019-08196-779:3-4(3049-3074)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1007/s11042-019-08196-7

Denoised Senone I-Vectors for Robust Speaker Verification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
2. Hardware
  1. Communication hardware, interfaces and storage

Recommendations

Multitaper MFCC and PLP features for speaker verification using i-vectors

In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are ...
Speaker verification using excitation source information

In this work we develop a speaker recognition system based on the excitation source information and demonstrate its significance by comparing with the vocal tract information based system. The speaker-specific excitation information is extracted by the ...
Linguistically-constrained formant-based i-vectors for automatic speaker recognition

We present an approach to automatic speaker verification through linguistically-constrained i-vector systems based on formant frequencies.An analysis of discriminative and calibration properties is presented for every linguistic unit (phones and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 26, Issue 4

April 2018

147 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 01 April 2018

Published in TASLP Volume 26, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
49
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Krobba ADebyeche MSelouani S(2022)A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortionMultimedia Tools and Applications10.1007/s11042-022-14068-482:11(16195-16212)Online publication date: 27-Oct-2022
https://dl.acm.org/doi/10.1007/s11042-022-14068-4
Chuang CLee CYeng CSo ELin BChen Y(2022)Convolutional denoising autoencoder based SSVEP signal enhancement to SSVEP-based BCIsMicrosystem Technologies10.1007/s00542-019-04654-228:1(237-244)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s00542-019-04654-2
Arora SVig R(2020)An efficient text-independent speaker verification for short utterance data from Mobile devicesMultimedia Tools and Applications10.1007/s11042-019-08196-779:3-4(3049-3074)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1007/s11042-019-08196-7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents