[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Denoised Senone I-Vectors for Robust Speaker Verification

Published: 01 April 2018 Publication History

Abstract

Recently, it has been shown that senone i-vectors, whose posteriors are produced by senone deep neural networks DNNs, outperform the conventional Gaussian mixture model GMM i-vectors in both speaker and language recognition tasks. The success of senone i-vectors relies on the capability of the DNN to incorporate phonetic information into the i-vector extraction process. In this paper, we argue that to apply senone i-vectors in noisy environments, it is important to robustify the phonetically discriminative acoustic features and senone posteriors estimated by the DNN. To this end, we propose a deep architecture formed by stacking a deep belief network on top of a denoising autoencoder DAE. After backpropagation fine-tuning, the network, referred to as denoising autoencoder–deep neural network DAE–DNN, facilitates the extraction of robust phonetically-discriminitive bottleneck BN features and senone posteriors for i-vector extraction. We refer to the resulting i-vectors as denoised BN-based senone i-vectors. Results on NIST 2012 SRE show that senone i-vectors outperform the conventional GMM i-vectors. More interestingly, the BN features are not only phonetically discriminative, results suggest that they also contain sufficient speaker information to produce BN-based senone i-vectors that outperform the conventional senone i-vectors. This work also shows that DAE training is more beneficial to BN feature extraction than senone posterior estimation.

References

[1]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788-798, May 2011.
[2]
S. Prince and J.H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," in Proc. IEEE 11th Int. Conf. Comput. Vis., 2007, 2007, pp. 1-8.
[3]
S. Cumani and P. Laface, "Nonlinear i-vector transformations for PLDA-based speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 908-919, Apr. 2017.
[4]
H. Xing and J. H. L. Hansen, "Single sideband frequency offset estimation and correction for quality enhancement and speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 124- 136, Jan. 2017.
[5]
R. Saeidi, P. Alku, and T. Backstrom, "Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 42-53, Jan. 2016.
[6]
T. Hasan and J. Hansen, "Acoustic factor analysis for robust speaker verification," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 4, pp. 842-853, 2013.
[7]
H. Yamamoto and T. Koshinaka, "Denoising autoencoder-based speaker feature restoration for utterances of short duration," in Proc. Interspeech, 2015, pp. 1052-1056.
[8]
T. Pekhovsky, S. Novoselov, A. Sholohov, and O. Kudashev, "On autoencoders in the i-vector space for speaker recognition," in Proc. Odyssey, 2016, pp. 217-224.
[9]
S. Novoselov, T. Pekhovsky, O. Kudashev, V. Mendelev, and A. Prudnikov, "Non-linear PLDA for i-vector speaker verification," Interspeech, Sep. 2015, pp. 214-218.
[10]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion," J. Mach. Learn. Res., vol. 11, pp. 3371- 3408, 2010.
[11]
T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Bo¿il, and J. Hansen, "CRSS systems for 2012 NIST speaker recognition evaluation," in IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 6783-6787.
[12]
D. Garcia-Romero, X. Zhou, and C. Y Espy-Wilson, "Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2012, pp. 4257-4260.
[13]
N. Li and M. W. Mak, "SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1648-1659, 2015.
[14]
N. Li and M. W. Mak, "SNR-invariant PLDA with multiple speaker subspaces," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5565-5569.
[15]
M. W. Mak, X. M. Pang, and J. T. Chien, "Mixture of PLDA for noise robust i-vector speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 130-142, 2016.
[16]
N. Li, M. W. Mak, and J. T. Chien, "DNN-driven mixture of PLDA for robust speaker verification," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 6, pp. 1371-1383, Jun. 2017.
[17]
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4960-4964.
[18]
T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, "Very deep multilingual convolutional neural networks for LVCSR," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4955-4959.
[19]
J. Li, A. Mohamed, G. Zweig, and Y. Gong, "Exploring multidimensional LSTMs for large vocabulary ASR," in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 4940-4944.
[20]
L. Lu, X. Zhang, and S. Renais, "On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5060-5064.
[21]
Z. Meng, S. Watanabe, J. R. Hershey, and H. Erdogan, "Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 271-275.
[22]
G. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Comput., vol. 18, no. 7, pp. 1527-1554, 2006.
[23]
O. Ghahabi and J. Hernando, "Deep learning backend for single and multisession i-vector speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 807-817, Apr. 2017.
[24]
Z. Tang, L. Li, D. Wang, and R. Vipperla, "Collaborative joint training with multitask recurrent model for speech and speaker recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 3, pp. 493-504, Mar. 2017.
[25]
W. M. Campbell, "Using deep belief networks for vector-based speaker recognition," in Proc. Interspeech, 2014, pp. 676-680.
[26]
G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Comput., vol. 14, no. 8, pp. 1771-1800, 2002.
[27]
Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, "A novel scheme for speaker recognition using a phonetically-aware deep neural network," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 1695-1699.
[28]
L. Ferrer, Y. Lei, M. McLaren, and N. Scheffer, "Study of senone-based deep neural network approaches for spoken language recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 1, pp. 105- 116, Jan. 2016.
[29]
D. Garcia-Romero and A. McCree, "Insights into deep neural networks for speaker recognition," in Proc. Interspeech, 2015, pp. 1141-1145.
[30]
W. Geng, J. Li, S. Zhang, X. Cai, and B. Xu, "Multilingual tandem bottleneck feature for language identification," in Proc. Interspeech, 2015, pp. 413-417.
[31]
Y. Tian, M. Cai, L. He, and J. Liu, "Investigation of bottleneck features and multilingual deep neural networks for speaker verification," in Proc. Interspeech, 2015, pp. 1151-1155.
[32]
M. Sahidullah and G. Saha, "Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition," Speech Commun., vol. 54, no. 4, pp. 543-565, 2012.
[33]
C. Yu, A. Ogawa, M. Delcroix, T. Yoshioka, T. Nakatani, and J. HL Hansen, "Robust i-vector extraction for neural network adaptation in noisy environment," in Proc. Interspeech, 2015, pp. 2854-2857.
[34]
F. Richardson, D. Reynolds, and N. Dehak, "Deep neural network approaches to speaker and language recognition," IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1671-1675, Oct. 2015.
[35]
A. K. Sarkar, C.-T. Do, V.-B. Le, and C. Barras, "Combination of cepstral and phonetically discriminative features for speaker verification," IEEE Signal Process. Lett., vol. 21, no. 9, pp. 1040-1044, Sep. 2014.
[36]
Z. L. Tan, Y. K. Zhu, M.W. Mak, and B. Mak, "Senone i-vectors for robust speaker verification," in Proc. Int. Symp. Chin. Spoken Lang. Process., Oct. 2016.
[37]
Z. L. Tan and M.W. Mak, "Bottleneck features from SNR-adaptive denoising deep classifier for speaker identification," in Proc. Asia-Pac. Signal Inf. Process. Assoc., Annu. Summit Conf., 2015, pp. 1035-1040.
[38]
E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 4080-4084.
[39]
G. Bhattacharya, J. Alam, and P. Kenny, "Deep speaker embeddings for short-duration speaker verification," in Proc. Interspeech, 2017, pp. 1517-1521.
[40]
W. M. Campbell, D. E. Sturim, and D. A. Reynolds, "Support vector machines using GMM supervectors for speaker verification," IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308-311, May 2006.
[41]
P. Kenny, "Joint factor analysis of speaker and session variability: Theory and algorithms," CRIM, Montreal, Tech. Rep. CRIM-06/08-13, 2005.
[42]
M. W. Mak, "Lecture notes on factor analysis and i-vectors," Dept. Electron. Inf. Eng., The Hong Kong Polytechnic Univ., Hong Kong, 2016.
[43]
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. "Greedy layer-wise training of deep networks," Adv. Neural Inf. Process. Syst., vol. 19, pp. 153, 2007.
[44]
H. Hirsch, "Fant-filtering and noise adding tool," 2005.
[45]
M. W. Mak and H. B. Yu, "A study of voice activity detection techniques for NIST speaker recognition evaluations," Comput. Speech Lang., vol. 28, no. 1, pp. 295-313, 2014.
[46]
D. Leeuwen and Niko Brümmer, "The distribution of calibrated likelihood-ratios in speaker recognition," in Proc. Interspeech, 2013, pp. 1619-1623.
[47]
N. Brümmer and E. deVilliers, "The BOSARIS toolkit: Theory, algorithms and code for surviving the new DCF," CoRR, vol. abs/1304.2865, 2013.
[48]
T. Hasan, R. Saeidi, J. H. L. Hansen, and D. A. van Leeuwen, "Duration mismatch compensation for i-vector based speaker recognition systems," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2013, pp. 7663-7667.
[49]
Z. L. Tan and M.W. Mak, "I-vector DNN scoring and calibration for noise robust speaker verification," in Proc. Interspeech, 2017, pp. 1562-1566.

Cited By

View all
  • (2022)A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortionMultimedia Tools and Applications10.1007/s11042-022-14068-482:11(16195-16212)Online publication date: 27-Oct-2022
  • (2022)Convolutional denoising autoencoder based SSVEP signal enhancement to SSVEP-based BCIsMicrosystem Technologies10.1007/s00542-019-04654-228:1(237-244)Online publication date: 1-Jan-2022
  • (2020)An efficient text-independent speaker verification for short utterance data from Mobile devicesMultimedia Tools and Applications10.1007/s11042-019-08196-779:3-4(3049-3074)Online publication date: 1-Jan-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 26, Issue 4
April 2018
147 pages
ISSN:2329-9290
EISSN:2329-9304
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 April 2018
Published in TASLP Volume 26, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)A novel hybrid feature method based on Caelen auditory model and gammatone filterbank for robust speaker recognition under noisy environment and speech coding distortionMultimedia Tools and Applications10.1007/s11042-022-14068-482:11(16195-16212)Online publication date: 27-Oct-2022
  • (2022)Convolutional denoising autoencoder based SSVEP signal enhancement to SSVEP-based BCIsMicrosystem Technologies10.1007/s00542-019-04654-228:1(237-244)Online publication date: 1-Jan-2022
  • (2020)An efficient text-independent speaker verification for short utterance data from Mobile devicesMultimedia Tools and Applications10.1007/s11042-019-08196-779:3-4(3049-3074)Online publication date: 1-Jan-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media