Speech Recognition Combining MFCCs and Image Features

Stamatis Karlos¹⁶,
Nikos Fazakis¹⁶,
Katerina Karanikola¹⁶,
Sotiris Kotsiantis¹⁶ &
…
Kyriakos Sgarbas¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9811))

Included in the following conference series:

International Conference on Speech and Computer

2400 Accesses
2 Citations

Abstract

Automatic speech recognition (ASR) task constitutes a well-known issue among fields like Natural Language Processing (NLP), Digital Signal Processing (DSP) and Machine Learning (ML). In this work, a robust supervised classification model is presented (MFCCs + autocor + SVM) for feature extraction of solo speech signals. Mel Frequency Cepstral Coefficients (MFCCs) are exploited combined with Content Based Image Retrieval (CBIR) features extracted from spectrogram produced by each frame of the speech signal. Improvement of classification accuracy using such extended feature vectors is examined against using only MFCCs with several classifiers for three scenarios of different number of speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Speech classification using SIFT features on spectrogram images

Article Open access 16 June 2016

Automatic Speech Recognition for Moroccan Dialects: A Review

Spectral Analysis for Automatic Speech Recognition and Enhancement

References

Yu, G.: Audio Classification From Time-Frequency Texture, Massachusetts Institute of Technology. Ecole Polytechnique, Palaiseau Cedex, NSL, Time, pp. 1677–1680 (2009)
Google Scholar
Muroi, T., Takashima, R., Takiguchi, T., Ariki, Y.: Gradient-based acoustic features for speech recognition. In: International Symposium on Intelligent Signal Processing Communication Systems 2009, ISPACS 2009, pp. 445–448 (2009)
Google Scholar
Khunarsa, P., Lursinsap, C., Raicharoen, T.: Impulsive environment sound detection by neural classification of spectrogram and mel-frequency coefficient images. In: Zeng, Z., Wang, J. (eds.) Advances in Neural Network Research and Applications. LNEE, vol. 67, pp. 337–346. Springer, Heidelberg (2010)
Chapter Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Huang, J., Kumar, S.R., Mitra, M., Zhu, W.-J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 762–768 (1997)
Google Scholar
Lux, M., Chatzichristofis, S.A.: Lire: lucene image retrieval. In: Proceedings of the 16th ACM International Conference on Multimedia - MM 2008, p. 1085 (2008)
Google Scholar
Lux, M.: Content based image retrieval with LIRe. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 735–738 (2011)
Google Scholar
Lux, M., Oge, M.: Visual Information Retrieval using Java and LIRE. Morgan & Claypool, San Rafael (2013)
Google Scholar
Souli, S., Lachiri, Z.: Environmental sounds spectrogram classification using log-gabor filters and multiclass support vector machines. Int. J. Comput. 9(4–3), 142–149 (2012)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 720–723 (2007)
Google Scholar
Lei, H., Meyer, B.T., Mirghafori, N.: Spectro-temporal Gabor features for speaker recognition. In: ICASSP, pp. 4241–4244 (2012)
Google Scholar
Gramss, T.: Fast algorithms to find invariant features for a word recognizing neural net. Int. J. Speech Technol. 18(1), 180–184 (2014)
Google Scholar
Kleinschmidt, M.: Localized spectro-temporal features for automatic speech recognition, pp. 2573–2576 (2003)
Google Scholar
Kleinschmidt, M.: Methods for capturing spectro-temporal modulations in automatic speech recognition. Acta Acust. - Acust. 88(3), 416–422 (2002)
Google Scholar
Nilufar, S., Ray, N., Molla, M.K.I., Hirose, K. Spectrogram based features selection using multiple kernel learning for speech/music discrimination. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 501–504 (2012)
Google Scholar
Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Process. Lett. 18(2), 130–133 (2011)
Article Google Scholar
Ghosal, A., Chakraborty, R., Dhara, B.C., Saha, S.K.: Song/instrumental classification using spectrogram based contextual features. In: Proceedings of the CUBE International Information Technology Conference - CUBE 2012, p. 21 (2012)
Google Scholar
Khunarsal, P., Lursinsap, C., Raicharoen, T.: Very short time environmental sound classification based on spectrogram pattern matching. Inf. Sci. (Ny) 243, 57–74 (2013)
Article Google Scholar
He, L., Lech, M., Maddage, N., Allen, N.: Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms. In: Proceedings - 2009 3rd International Conference on Affective Computing and Intelligent Interaction Work. ACII 2009, pp. 1–5 (2009)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. In: ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, p. 10 (2009)
Google Scholar
Mayo, M.: ImageFilter WEKA filter that uses LIRE to extract image features (2015). https://github.com/mmayo888/ImageFilter
Georganti, E., May, T., Van De Par, S., Mourjopoulos, J.: Sound source distance estimation in rooms based on statistical properties of binaural signals. IEEE Trans. Audio, Speech Lang. Process. 21(8), 1727–1741 (2013)
Article Google Scholar
Cummins, F., Grimaldi, M., Leonard, T., Simko, J.: The CHAINS speech corpus: CHAracterizing INdividual speakers. In: Proceedings of the SPECOM, pp. 1–6 (2006)
Google Scholar
Chatzichristofis, S.A., Boutalis, Y.S., Arampatzis, A.: Accelerating image retrieval using Binary Haar Wavelet transform on the color and edge directivity descriptor. In: Proceedings of the 5th International Multi-Conference Computing in the Global Information Technology, ICCGI 2010, vol. 4, no. 1, pp. 41–47 (2010)
Google Scholar
Jalab, H.: Image retrieval system based on color layout descriptor and Gabor filters. In: IEEE Conference on Open Systems, pp. 32–36 (2011)
Google Scholar
Chatzichristofis, S.A., Boutalis, Y.S.: FCTH: fuzzy color and texture histogram - a low level feature for accurate image retrieval. In: 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, pp. 191–196 (2008)
Google Scholar
Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: CIVR 2007 Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 401–408 (2007)
Google Scholar
Thiruvengatanadhan, R.: Speech/Music Classification using SVM. Int. J. Comput. Appl. 65(6), 36–41 (2013)
Google Scholar
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–39 (2011)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Paraskevas, I., Rangoussi, M.: The hartley phase spectrum as an assistive feature for classification. In: Solé-Casals, J., Zaiats, V. (eds.) NOLISP 2009. LNCS, vol. 5933, pp. 51–59. Springer, Heidelberg (2010)
Chapter Google Scholar
Hong, Y., Zhu, W.: Spatial co-training for semi-supervised image classification. Pattern Recognit. Lett. 63, 59–65 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Patras, Patras, Greece
Stamatis Karlos, Nikos Fazakis, Katerina Karanikola, Sotiris Kotsiantis & Kyriakos Sgarbas

Authors

Stamatis Karlos
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Fazakis
View author publications
You can also search for this author in PubMed Google Scholar
Katerina Karanikola
View author publications
You can also search for this author in PubMed Google Scholar
Sotiris Kotsiantis
View author publications
You can also search for this author in PubMed Google Scholar
Kyriakos Sgarbas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stamatis Karlos .

Editor information

Editors and Affiliations

SPIIRAS , Saint-Petersburg, Russia
Andrey Ronzhin
Moscow State Linguistic University , Moscow, Russia
Rodmonga Potapova
Budapest University of Technology and Economics, Budapest, Hungary
Géza Németh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karlos, S., Fazakis, N., Karanikola, K., Kotsiantis, S., Sgarbas, K. (2016). Speech Recognition Combining MFCCs and Image Features. In: Ronzhin, A., Potapova, R., Németh, G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science(), vol 9811. Springer, Cham. https://doi.org/10.1007/978-3-319-43958-7_79

Download citation

DOI: https://doi.org/10.1007/978-3-319-43958-7_79
Published: 13 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43957-0
Online ISBN: 978-3-319-43958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics