Abstract
In this paper, we investigate low-variance multitaper spectrum estimation methods to compute the mel-frequency cepstral coefficient (MFCC) features for robust speech and speaker recognition systems. In speech and speaker recognition, MFCC features are usually computed from a single-tapered (e.g., Hamming window) direct spectrum estimate, that is, the squared magnitude of the Fourier transform of the observed signal. Compared with the periodogram, a power spectrum estimate that uses a smooth window function, such as Hamming window, can reduce spectral leakage. Windowing may help to reduce spectral bias, but variance often remains high. A multitaper spectrum estimation method that uses well-selected tapers can gain from the bias-variance trade-off, giving an estimate that has small bias compared with a single-taper spectrum estimate but substantially lower variance. Speech recognition and speaker verification experimental results on the AURORA-2 and AURORA-4 corpora and the NIST 2010 speaker recognition evaluation corpus (telephone as well as microphone speech), respectively, show that the multitaper methods perform better compared with the Hamming-windowed spectrum estimation method. In a speaker verification task, compared with the Hamming window technique, the sinusoidal weighted cepstrum estimator, multi-peak, and Thomson multitaper techniques provide a relative improvement of 20.25, 18.73, and 12.83 %, respectively, in equal error rate.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
O’Shaughnessy D. Invited paper: automatic speech recognition: history, methods and challenges. Pattern Recognit. 2008;41(10):2965–79.
O’Shaughnessy D. Speech communications—human and machine, vol. I-XXV. 2nd ed. New York: IEEE Press; 2000. p. 1–547.
Kotnik B, Vlaj D, Kacic A, Horvat B. Robust MFCC feature extraction algorithm using efficient addictive and convolutional noise reduction procedures. Proc ICSLP, p. 445–48 (2002).
Alam Md J, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D. Multi-taper MFCC features for speaker verification using I-vectors. ASRU, p. 547–52 (2011).
Kinnunen T, Li H. An overview of text-independent speaker recognition-from features to supervectors. Speech Comm. 2010;52(1):12–40.
Kinnunen T. Spectral features for automatic text-independent speaker recognition. Licentiate’s thesis, University of Joensuu, Finland, December (2003).
Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66.
Bimbot F, Bonastre J-F, Fredouille C, Gravier G, Magrin-Chagnolleau I, Meignier S, Merlin T, Ortega-Garcia J, Petrovska-Delacretaz D, Reynolds DA. A tutorial on text-independent speaker verification. EURASIP J Appl Signal Process. 2004;4:430–51.
Alam MJ, Kinnunen T, Ouellet P, Kenny P, O’Shaughnessy D. Multitaper MFCC and PLP features for speaker verification using I-vectors. accepted for publication in Speech Comm. (2012). doi:10.1016/j.specom.2012.08.007.
Hu Y, Loizou P. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans Speech Audio Proc. 2004;12(1):59–67.
Percival DB, Walden AT. Spectral analysis for physical applications, multitaper and conventional univariate techniques. Cambridge: Cambridge University Press; 1993.
Coy EJ, Walden AT, Percival DB. Multitaper Spectral Estimation of Power Law Process. IEEE Trans Signal Process. 1998;46(3):655–68.
Kinnunen T, Saeidi R, Sandberg J, Hansson-Sandsten M. What Else is New Than the Hamming Window? Robust MFCCs for speaker recognition via multitapering. Interspeech, Makuhari, Japan, p. 2734–37 (2010).
Sandberg J, Hansson-Sandsten M, Kinnunen T, Saeidi R, Flandrin P, Borgnat P. Multitaper estimation of frequency-warped cepstra with application to speaker verification. IEEE Signal Process Lett. 2010;17(4):343–6.
Thomson DJ. Spectrum estimation and harmonic analysis. IEEE Proc. 1982;70(9):1055–96.
Riedel KS, Sidorenko A. Minimum bias multiple taper spectral estimation. IEEE Trans Signal Proc. 1995;43(1):188–95.
Prieto GA, Parker RL, Thomson DJ, Vernon FL, Graham RL. Reducing the bias of multitaper spectrum estimates. Geophys J Int. 2007;171:1269–81.
Wieczorek MA, Simons FJ. Localized spectral analysis on the sphere. Geophys J Int. 2005;162:655–75.
Kinnunen T, Saeidi R, Sedlak F, Lee KA, Sandberg J, Hansson-Sandsten M, Li H. Low-variance multitaper MFCC features: a case study in robust speaker verification. IEEE Trans Audio Speech Lang Process. 2012;20(7):1990–2001.
Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000;10(1):19–41.
Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process. 2007;15(4):1435–47.
Kenny P, Boulianne G, Ouellet P, Dumouchel P. Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process. 2007;15(4):1448–60.
Hirsch HG, Pearce D. The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Condition. ISCA ITRW ASR2000 Automatic Speech Recognition: Challenges for the Next Millennium, France (2000). online: http://aurora.hsnr.de/aurora-2/publications.html.
Parihar N, Picone J, Pearce D, Hirsch HG. Performance analysis of the Aurora large vocabulary baseline system. Vienna: Proceedings of the European Signal Processing Conference; 2004.
Kim C, Stern RM. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. Proceedings of IEEE ICASSP, p. 4574–577 (2010).
Alam MJ, Kenny P, O’Shaughnessy D. Robust feature extraction for speech recognition by enhancing auditory spectrum. Proceedings of INTERSPEECH, Portland, Oregon, September (2012).
Schuster A. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terr Magn. 1898;3:13–41.
Priestley MB. Spectral analysis and time series. I & II. London: Academic Press; 1981.
Kay SM. Modern spectral estimation. Englewood Cliffs: Prentice-Hall; 1988.
Djuric PM, Kay SM. Spectrum estimation and modeling. Digital signal processing handbook. Boca Raton: CRC Press LLC; 1999.
Walden AT, McCoy EJ, Percival DB. The variance of multitaper spectrum estimates for real Gaussian processes. IEEE Trans Signal Process. 1994;2:479–82.
Komm RW, Gu Y, Hill F. Multitaper spectral analysis and wavelet denoising applied to helioseismic data. Astrophys J. 1999;519:407–21.
Wieczorek MA, Simons FJ. Minimum variance multitaper spectrum estimation on the sphere. J Fourier Anal Appl. 2007;13(6):665–92.
Alam Md J, Kenny P, O’Shaughnessy D. A Study of low-variance multi-taper features for distributed speech recognition. Proceedings of NOLISP, LNAI 7015, p. 239–45 (2011).
Hansson-Sandsten M, Sandberg J. Optimal cepstrum estimation using multiple windows. Taipei: IEEE ICASSP; 2009. p. 3077–80.
Hansson M, Salomonsson G. A multiple window method for estimation of peaked spectra. IEEE Trans Sign Proc. 1997;45(3):778–81.
Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am. 1990;87(4):1738–52.
Young SJ et al. HTK book, Entropic Cambridge Research Laboratory Ltd., 3.4 edn (2006). online: http://htk.eng.cam.ac.uk/.
Pan Shing-Tai, Lai Chih-Chin, Tsai Bo-Yu. The implementation of speech recognition systems on FPGA-based embedded systems with soc architecture. Int J Innov Comput Inf Control. 2011;7(11):6161–76.
Picone JW. Signal modeling techniques in speech recognition. Proc IEEE. 1993;81:1215–47.
Ezeiza A, Lopez de Ipina K, Hernandez C, Barosso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput J. 2012. doi:10.1007/s12559-012-9165-0.
Huang XD, Acero A, Hon HW. Spoken language processing: A guide to theory, algorithm, and system development. Englewood Cliffs: Prentice-Hall; 2001.
von Bekesy G. Experiments in Hearing. New York: McGraw-Hill; 1960.
Oppenheim AV, Schafer RW. Digital signal processing. Englewood Cliffs: Prentice-Hall; 1975.
Kenny P, Ouellet P, Senoussaoui M. The CRIM system for the 2010 NIST speaker Recognition Evaluation, April (2010).
ABC (Agnitio BUT and CRIM) system description for NIST Speaker Recognition Evaluation, June (2010).
Dehak N et al. MIT-CSAIL Spoken Language Systems and Lincoln Labs NIST SRE systems (2010).
Alam MJ, Ouellet P, Kenny P, O Shaughnessy D. Comparative Evaluation of Feature normalization techniques for speaker verification. Proceedings of NOLISP, LNAI 7015, p. 246–53 (2011).
Pelecanos J, Sridharan S. Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the speaker recognition workshop, Crete, Greece, p. 213–18 (2001).
Xiang B, Chaudhari U, Navratil J, Ramaswamy G, Gopinath R. Short-time Gaussianization for robust speaker verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, p. 681–684 (2002).
Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process. 1981;29(2):254–72.
Chen C-P, Bilmes J. MVA processing of speech features. Technical Report UWEETR-2003-0024, EE Department, University of Washington, USA (2003).
Rabiner L, Juang BH. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice-Hall; 1993.
Makhoul J, Schwartz J. State of the art in continuous speech recognition. In: Roe D, Wilpon J, editors. Voice communication between humans and machines. Washington, DC: National Academy Press; 1994. p. 165–88.
Au Yeung SK, Siu M-H. Improved performance of Aurora-4 using HTK and unsupervised MLLR adaptation, Proceedings of the Int. Conference on Spoken Language Processing, Jeju, Korea, (2004).
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2011;19(4):788–98.
Kenny P. Bayesian speaker verification with heavy tailed priors. The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010).
Brümmer N, de Villiers E. The speaker partitioning problem. The Odyssey speaker and language recognition workshop, Brno, Czech Republic, June (2010).
Senoussaoui M, Kenny P, Brummer N, de Villiers E, Dumouchel P. Mixture of PLDA models in I-vector space for gender independent speaker recognition. Interspeech, Florence, Italy, August (2011).
National Institute of Standards and Technology, NIST 2010 Speaker Recognition Evaluation Plan, http://www.itl.nist.gov/iad/mig/tests/spk/2010/index.html.
Garcia-Romero D, Espy-Wilson CY. Analysis of i-vector length normalization in speaker recognition systems. Interspeech 2011, Florence, Italy, August (2011).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Alam, M.J., Kenny, P. & O’Shaughnessy, D. Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems. Cogn Comput 5, 533–544 (2013). https://doi.org/10.1007/s12559-012-9197-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-012-9197-5