Abstract
A range of speech extraction techniques have been applied to improve speech recognition when the signals are mixed with noise. Degradation of the speech recognition performance is caused by differences between the model training environment and the recognition environment due to inaccurate voice versus non-voice classification at low signal-to-noise ratios (SNRs). Problems also arise because voice activity detection is inaccurate when noise is caused by inconsistent changes in the recognition environment and the learning model. One technique is to extract a speech feature that is resistant to noise by removing that noise to improve the speech recognition performance. This study extracted such a feature using an equivalent rectangular bandwidth (ERB) filter bank cepstrum and constructed a learning model using the acoustic model to improve the speech recognition rate. The ERB filter bank cepstrum was examined in a computational auditory scene analysis system, which analyzes the properties of the speech signal. This paper improved the speech recognition rate by extracting such a feature with an ERB filter bank cepstrum. The proposed model used train and train station noises to evaluate the performance. The distortion was measured by performing noise reduction at SNRs of \(-10\) and \(-5\) dB in noisy environments, showing a respective 1.67 and 1.74 dB improvement in performance.
Similar content being viewed by others
References
Lee, Y.-K., & Kwon, O.-W. (2008). Application of shape analysis techniques for improved CASA-based speech separation. The Korean Society of Phonetic Sciences and Speech Technology: MALSORI., 65, 153–168.
Choi, T. & Kim, S.-H. (2013). Target speech segregation using non-parametric correlation feature extraction in CASA system. The Journal of the Acoustical Society of Korea, 32(1), 79–85.
Jin, Z. & Wang, D. L. (2011). Reverberant speech segregation based on multipitch tracki ng and classification. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2328–2337.
Wu, B. F., & Wang, K. C. (2005). Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments. IEEE Transactions on Speech Audio Processing, 13(5), 762–775.
Elmezain, M., Al-Hamadi, A., Appenrodt, J., & Michaelis, B. (2008). A hidden markov model-based continuous gesture recognition system for hand motion trajectory. ICPR, 2008, 1–4.
Homer, J., & Mareels, I. (2004). LS detection guided NLMS estimation of sparse system. Proceedings of the IEEE 2004 international conference on acoustic. Speech and signal processing (ICASSP). Montreal, Quebec, Canada.
Li, Q., Zheng, J., Tsai, A., & Zhou, Q. (2002). Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Transactions on Speech Audio Processing, 10(3), 146–157.
Ahmed, B., & Holmes, P. H. (2004). A voice activity detector using the Chi square test. In Proceedings of the international conference on acoustics, speech, and signal processing, 2004 (pp. I-625–I-628). Royal Melbourne Institute of Technology, Victoria.
Oh, S. Y., & Chung, K. Y. (2013). Target speech feature extraction using non-parametric correlation coefficient. Cluster Comput. doi:10.1007/s10586-013-0284-5.
Ko, J. W., Chung, K. Y., & Han, J. S. (2013). Model transformation verification using similarity and graph comparison algorithm. Multimedia Tools and Applications. doi:10.1007/s11042-013-1581-y.
ETSI Standard Document. (2003). Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. ETSI ES 202 050 v. 1.1.3 (2003–11).
Kozel, D., & Apostoaia, C. (2007) Colored noise reduction using bark scale spectral subtraction, statistics, and multiple time frames. IEEE EIT proceedings 2007, Chicago USA, pp. 416–421.
Wang, K. C., & Tsai, Y. H. (2008). Voice activity detection algorithm with low signal-to-noise ratios based on spectrum entropy. Second International Symposium on Universal Communication, 2008, 423–428.
Kim, S. H., & Chung, K. Y. (2013). Medical information service system based on human 3D anatomical model. Multimedia Tools and Applications. doi:10.1007/s11042-013-1584-8.
Naqvi, S. M., Yu, M., & Chamber, J. A. (2010). A multimodal approach to blind source separation of moving sources. IEEE Transactions on Signal Processing, 4(5), 895–910.
Kang, S. K., Chung, K. Y., & Lee, J. H. (2013). Development of head detection and tracking systems for visual surveillance. Personal and Ubiquitous Computing. doi:10.1007/s00779-013-0668-9.
Jung, H., & Chung, K. Y. (2013). Discovery of automotive design paradigm using relevance feedback. Personal and Ubiquitous Computing. doi:10.1007/s00779-013-0738-z.
Kim, S. H., & Chung, K. Y. (2013). 3D simulator for stability analysis of finite slope causing plane activity. Multimedia Tools and Applications. doi:10.1007/s11042-013-1356-5.
Shao, Y., Srinivasan, S., Jin, Z., & Wang, D. (2010). A computational auditory scene analysis system for robust speech recognition. Computer Speech & Language, 24(1), 77–93.
Li, P., Guan, Y., Xu, B., & Liu, W. (2006). Monaural speech separation based on computational auditory analysis and objective quality assessment of speech. IEEE Transactions on Audio, Speech and Language Processing, 14(6), 2014–2022.
Klapuri, A. P. (2008). Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Transactions on Audio, Speech and Language Processing, 16(2), 255–266.
Ahn, C.-S., & Oh, S.-Y. (2012). Echo noise robust HMM learning model using average estimator LMS algorithm. The Journal of Digital Policy and Management, 10(10), 277–282.
Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Transactions on Audio Speech Lang Processing, 17(1), 66–83.
Nose, T., Yamagishi, J., & Kobayashi, T. (2007). A style control technique for HMM-based expressive speech synthesis. IEICE Transactions on Information and System, E90–D(9), 1406–1413.
Yamagishi, J., Nose, T., Zen, H., Toda, T., Ling, Z.-H., Toda, T., et al. (2009). A robust speaker-adaptive HMM based text-to-speech synthesis. IEEE Transactions on Audio Speech Lang Processing, 17(6), 1208–1230.
Hu, G., & Wang, D. (2010). A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech and Language Processing, 18(8), 2067–2079.
Cho, S. Y., Sun, D. M., & Qiu, Z. D. (2011). A spearman correlation coefficient ranking for matching-score fusion on speaker recognition. Proceedings of the TENCON conference, pp. 736–741.
Kim, B., Choi, T., & Kim, S. (2012). Colored noise cancellation algorithm using average estimator. Proceedings of the Acoustical Society of Korea Conference, 29(1), 71–74.
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2007). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and System, E90–D(5), 825–834.
Tuske, Z., Mihajlik, P., Tobler, Z., & Fegyo, T. (2005). Robust voice activity detection based on the entropy of noise suppressed spectrum, interspeech 2005. Lisbon Portugal, pp. 245–248.
Acknowledgments
This work was supported by the Gachon University research fund of 2013. (GCU-2013-R366).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oh, SY., Chung, K. Improvement of Speech Detection Using ERB Feature Extraction. Wireless Pers Commun 79, 2439–2451 (2014). https://doi.org/10.1007/s11277-014-1752-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-014-1752-9