Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems

Md. Jahangir Alam^1,2,
Patrick Kenny² &
Douglas O’Shaughnessy¹

371 Accesses
19 Citations
Explore all metrics

Abstract

In this paper, we investigate low-variance multitaper spectrum estimation methods to compute the mel-frequency cepstral coefficient (MFCC) features for robust speech and speaker recognition systems. In speech and speaker recognition, MFCC features are usually computed from a single-tapered (e.g., Hamming window) direct spectrum estimate, that is, the squared magnitude of the Fourier transform of the observed signal. Compared with the periodogram, a power spectrum estimate that uses a smooth window function, such as Hamming window, can reduce spectral leakage. Windowing may help to reduce spectral bias, but variance often remains high. A multitaper spectrum estimation method that uses well-selected tapers can gain from the bias-variance trade-off, giving an estimate that has small bias compared with a single-taper spectrum estimate but substantially lower variance. Speech recognition and speaker verification experimental results on the AURORA-2 and AURORA-4 corpora and the NIST 2010 speaker recognition evaluation corpus (telephone as well as microphone speech), respectively, show that the multitaper methods perform better compared with the Hamming-windowed spectrum estimation method. In a speaker verification task, compared with the Hamming window technique, the sinusoidal weighted cepstrum estimator, multi-peak, and Thomson multitaper techniques provide a relative improvement of 20.25, 18.73, and 12.83 %, respectively, in equal error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Smoothed Nonlinear Energy Operator-Based Amplitude Modulation Features for Robust Speech Recognition

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

Article 17 September 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

O’Shaughnessy D. Invited paper: automatic speech recognition: history, methods and challenges. Pattern Recognit. 2008;41(10):2965–79.
Article Google Scholar
O’Shaughnessy D. Speech communications—human and machine, vol. I-XXV. 2nd ed. New York: IEEE Press; 2000. p. 1–547.
Google Scholar
Kotnik B, Vlaj D, Kacic A, Horvat B. Robust MFCC feature extraction algorithm using efficient addictive and convolutional noise reduction procedures. Proc ICSLP, p. 445–48 (2002).
Alam Md J, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D. Multi-taper MFCC features for speaker verification using I-vectors. ASRU, p. 547–52 (2011).
Kinnunen T, Li H. An overview of text-independent speaker recognition-from features to supervectors. Speech Comm. 2010;52(1):12–40.
Article Google Scholar
Kinnunen T. Spectral features for automatic text-independent speaker recognition. Licentiate’s thesis, University of Joensuu, Finland, December (2003).
Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66.
Article Google Scholar
Bimbot F, Bonastre J-F, Fredouille C, Gravier G, Magrin-Chagnolleau I, Meignier S, Merlin T, Ortega-Garcia J, Petrovska-Delacretaz D, Reynolds DA. A tutorial on text-independent speaker verification. EURASIP J Appl Signal Process. 2004;4:430–51.
Article Google Scholar
Alam MJ, Kinnunen T, Ouellet P, Kenny P, O’Shaughnessy D. Multitaper MFCC and PLP features for speaker verification using I-vectors. accepted for publication in Speech Comm. (2012). doi:10.1016/j.specom.2012.08.007.
Hu Y, Loizou P. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans Speech Audio Proc. 2004;12(1):59–67.
Article Google Scholar
Percival DB, Walden AT. Spectral analysis for physical applications, multitaper and conventional univariate techniques. Cambridge: Cambridge University Press; 1993.
Book Google Scholar
Coy EJ, Walden AT, Percival DB. Multitaper Spectral Estimation of Power Law Process. IEEE Trans Signal Process. 1998;46(3):655–68.
Article Google Scholar
Kinnunen T, Saeidi R, Sandberg J, Hansson-Sandsten M. What Else is New Than the Hamming Window? Robust MFCCs for speaker recognition via multitapering. Interspeech, Makuhari, Japan, p. 2734–37 (2010).
Sandberg J, Hansson-Sandsten M, Kinnunen T, Saeidi R, Flandrin P, Borgnat P. Multitaper estimation of frequency-warped cepstra with application to speaker verification. IEEE Signal Process Lett. 2010;17(4):343–6.
Article Google Scholar
Thomson DJ. Spectrum estimation and harmonic analysis. IEEE Proc. 1982;70(9):1055–96.
Article Google Scholar
Riedel KS, Sidorenko A. Minimum bias multiple taper spectral estimation. IEEE Trans Signal Proc. 1995;43(1):188–95.
Article Google Scholar
Prieto GA, Parker RL, Thomson DJ, Vernon FL, Graham RL. Reducing the bias of multitaper spectrum estimates. Geophys J Int. 2007;171:1269–81.
Article Google Scholar
Wieczorek MA, Simons FJ. Localized spectral analysis on the sphere. Geophys J Int. 2005;162:655–75.
Article Google Scholar
Kinnunen T, Saeidi R, Sedlak F, Lee KA, Sandberg J, Hansson-Sandsten M, Li H. Low-variance multitaper MFCC features: a case study in robust speaker verification. IEEE Trans Audio Speech Lang Process. 2012;20(7):1990–2001.
Article Google Scholar
Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000;10(1):19–41.
Article Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process. 2007;15(4):1435–47.
Article Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P. Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process. 2007;15(4):1448–60.
Article Google Scholar
Hirsch HG, Pearce D. The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Condition. ISCA ITRW ASR2000 Automatic Speech Recognition: Challenges for the Next Millennium, France (2000). online: http://aurora.hsnr.de/aurora-2/publications.html.
Parihar N, Picone J, Pearce D, Hirsch HG. Performance analysis of the Aurora large vocabulary baseline system. Vienna: Proceedings of the European Signal Processing Conference; 2004.
Google Scholar
Kim C, Stern RM. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. Proceedings of IEEE ICASSP, p. 4574–577 (2010).
Alam MJ, Kenny P, O’Shaughnessy D. Robust feature extraction for speech recognition by enhancing auditory spectrum. Proceedings of INTERSPEECH, Portland, Oregon, September (2012).
Schuster A. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terr Magn. 1898;3:13–41.
Article Google Scholar
Priestley MB. Spectral analysis and time series. I & II. London: Academic Press; 1981.
Google Scholar
Kay SM. Modern spectral estimation. Englewood Cliffs: Prentice-Hall; 1988.
Google Scholar
Djuric PM, Kay SM. Spectrum estimation and modeling. Digital signal processing handbook. Boca Raton: CRC Press LLC; 1999.
Google Scholar
Walden AT, McCoy EJ, Percival DB. The variance of multitaper spectrum estimates for real Gaussian processes. IEEE Trans Signal Process. 1994;2:479–82.
Article Google Scholar
Komm RW, Gu Y, Hill F. Multitaper spectral analysis and wavelet denoising applied to helioseismic data. Astrophys J. 1999;519:407–21.
Article Google Scholar
Wieczorek MA, Simons FJ. Minimum variance multitaper spectrum estimation on the sphere. J Fourier Anal Appl. 2007;13(6):665–92.
Article Google Scholar
Alam Md J, Kenny P, O’Shaughnessy D. A Study of low-variance multi-taper features for distributed speech recognition. Proceedings of NOLISP, LNAI 7015, p. 239–45 (2011).
Hansson-Sandsten M, Sandberg J. Optimal cepstrum estimation using multiple windows. Taipei: IEEE ICASSP; 2009. p. 3077–80.
Google Scholar
Hansson M, Salomonsson G. A multiple window method for estimation of peaked spectra. IEEE Trans Sign Proc. 1997;45(3):778–81.
Article Google Scholar
Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am. 1990;87(4):1738–52.
Article PubMed CAS Google Scholar
Young SJ et al. HTK book, Entropic Cambridge Research Laboratory Ltd., 3.4 edn (2006). online: http://htk.eng.cam.ac.uk/.
Pan Shing-Tai, Lai Chih-Chin, Tsai Bo-Yu. The implementation of speech recognition systems on FPGA-based embedded systems with soc architecture. Int J Innov Comput Inf Control. 2011;7(11):6161–76.
Google Scholar
Picone JW. Signal modeling techniques in speech recognition. Proc IEEE. 1993;81:1215–47.
Article Google Scholar
Ezeiza A, Lopez de Ipina K, Hernandez C, Barosso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput J. 2012. doi:10.1007/s12559-012-9165-0.
Huang XD, Acero A, Hon HW. Spoken language processing: A guide to theory, algorithm, and system development. Englewood Cliffs: Prentice-Hall; 2001.
Google Scholar
von Bekesy G. Experiments in Hearing. New York: McGraw-Hill; 1960.
Google Scholar
Oppenheim AV, Schafer RW. Digital signal processing. Englewood Cliffs: Prentice-Hall; 1975.
Google Scholar
Kenny P, Ouellet P, Senoussaoui M. The CRIM system for the 2010 NIST speaker Recognition Evaluation, April (2010).
ABC (Agnitio BUT and CRIM) system description for NIST Speaker Recognition Evaluation, June (2010).
Dehak N et al. MIT-CSAIL Spoken Language Systems and Lincoln Labs NIST SRE systems (2010).
Alam MJ, Ouellet P, Kenny P, O Shaughnessy D. Comparative Evaluation of Feature normalization techniques for speaker verification. Proceedings of NOLISP, LNAI 7015, p. 246–53 (2011).
Pelecanos J, Sridharan S. Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the speaker recognition workshop, Crete, Greece, p. 213–18 (2001).
Xiang B, Chaudhari U, Navratil J, Ramaswamy G, Gopinath R. Short-time Gaussianization for robust speaker verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, p. 681–684 (2002).
Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process. 1981;29(2):254–72.
Article Google Scholar
Chen C-P, Bilmes J. MVA processing of speech features. Technical Report UWEETR-2003-0024, EE Department, University of Washington, USA (2003).
Rabiner L, Juang BH. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice-Hall; 1993.
Google Scholar
Makhoul J, Schwartz J. State of the art in continuous speech recognition. In: Roe D, Wilpon J, editors. Voice communication between humans and machines. Washington, DC: National Academy Press; 1994. p. 165–88.
Google Scholar
Au Yeung SK, Siu M-H. Improved performance of Aurora-4 using HTK and unsupervised MLLR adaptation, Proceedings of the Int. Conference on Spoken Language Processing, Jeju, Korea, (2004).
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2011;19(4):788–98.
Article Google Scholar
Kenny P. Bayesian speaker verification with heavy tailed priors. The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010).
Brümmer N, de Villiers E. The speaker partitioning problem. The Odyssey speaker and language recognition workshop, Brno, Czech Republic, June (2010).
Senoussaoui M, Kenny P, Brummer N, de Villiers E, Dumouchel P. Mixture of PLDA models in I-vector space for gender independent speaker recognition. Interspeech, Florence, Italy, August (2011).
National Institute of Standards and Technology, NIST 2010 Speaker Recognition Evaluation Plan, http://www.itl.nist.gov/iad/mig/tests/spk/2010/index.html.
Garcia-Romero D, Espy-Wilson CY. Analysis of i-vector length normalization in speaker recognition systems. Interspeech 2011, Florence, Italy, August (2011).

Download references

Author information

Authors and Affiliations

INRS-EMT, University of Quebec, Montreal, Canada
Md. Jahangir Alam & Douglas O’Shaughnessy
CRIM, Montreal, Canada
Md. Jahangir Alam & Patrick Kenny

Authors

Md. Jahangir Alam
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Kenny
View author publications
You can also search for this author in PubMed Google Scholar
Douglas O’Shaughnessy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Jahangir Alam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alam, M.J., Kenny, P. & O’Shaughnessy, D. Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems. Cogn Comput 5, 533–544 (2013). https://doi.org/10.1007/s12559-012-9197-5

Download citation

Received: 11 January 2012
Accepted: 09 November 2012
Published: 07 December 2012
Issue Date: December 2013
DOI: https://doi.org/10.1007/s12559-012-9197-5

Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Smoothed Nonlinear Energy Operator-Based Amplitude Modulation Features for Robust Speech Recognition

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Feature Extraction Based on Teager-Entropy and Half Power Spectrum Estimation for Speech Recognition

Smoothed Nonlinear Energy Operator-Based Amplitude Modulation Features for Robust Speech Recognition

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation