[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3386164.3389100acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiscsicConference Proceedingsconference-collections
research-article

Image Approach to Speech Recognition on CNN

Published: 06 June 2020 Publication History

Abstract

In this paper has been discussed about speech recognition using spectrogram images and deep convolution neural network(CNN) of Uzbek spoken digits. Spectrogram images from speech signal were generated and it were used for deep CNN training. Presented CNN model contains 3 convolution layers and 2 fully connected layers that discriminative features can be divided and estimated of spectrogram images by those layers. In current research period, dataset of Uzbek spoken digits were made and in based on presented CNN model they were trained. Testing results shows that, proposed approach for Uzbek spoken digits classified 100% accuracy.

References

[1]
A. Incze, Henrietta-Bernadette Jancsó, Z. Szilagyi, A. Farkas, C. Sulyok. Bird Sound Recognition Using a Convolutional Neural Network. SISY 2018 - IEEE 16th International Symposium on Intelligent Systems and Informatics, Proceedings. 2018, pp.295--300.
[2]
A.M. Badshah, J. Ahmad, N.Rahim, S.W.Baik. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. 2017 International Conference on Platform Technology and Service, PlatCon 2017 - Proceedings.
[3]
Adrian Rosebrock. Deep Learning for Computer Vision with Python Starter Bundle. 1st Edition (1.2.2). PyImageSearch.com. 2017.
[4]
Al-Darkazali Mohammed. Image processing methods to segment speech spectrograms for word level recognition. Doctoral thesis (PhD), University of Sussex. (2017).
[5]
Andrew Ng, Yan Zhang. Speech Recognition Using Deep Learning Algorithms. Published in 2013.
[6]
B.D. Sarma, S.R.M. Prasanna. Acoustic--Phonetic Analysis for Speech Recognition: A Review. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India). 2018. pp.305--327.
[7]
C. Glackin, J. Wall, G. Chollet, N. Dugan, N. Cannings. Convolutional neural networks for phoneme recognition. ICPRAM 2018 - Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods. 2018. pp.190--195.
[8]
D. Polap, M. Woźniak. Image approach to voice recognition. 2017 IEEE Symposium Series on Computational Intelligence, SSCI 2017 - Proceedings. 2018. pp.1--7.
[9]
Dennis, J., Tran, H. D., & Li, H. Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions. IEEE Signal Processing Letters, 18(2), 130--133.
[10]
Diederik P. Kingma, Jimmy Lei Ba. ADAM: A Method for stochastic optimization. Published as a conference paper at ICLR 2015.
[11]
Fisher, William M.; Doddington, George R.; Goudie-Marshall, Kathleen M. (1986). The DARPA Speech Recognition Research Database: Specifications and Status. pp. 93--99.
[12]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevskiy, Ilya Sutskever, Ruslan R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15(1):1929--1958. June 2014.
[13]
Gulmezoglu, M.B., et al., A novel approach to isolated word recognition. IEEE Transactions on Speech and Audio Processing, 1999. 7(6): p. 620--628.
[14]
H. R. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, & H. S. Seung, (2000). Erratum: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789), 947--951.
[15]
Ibrahim Patel, Dr. Y. Srinivas Rao. Speech recognition using HMM with MFCC-an analysis using frequency Spectral decomposing technique. Signal & Image Processing: An International Journal (SIPIJ) Vol.1, No.2, December 2010.
[16]
J. Ahmad;, M. Fiaz;, S.-i. Kwon;, M. Sodanil;, B. Vo;, and S. W. Baik, "Gender Identification using MFCC for Telephone Applications - A Comparative Study," International Journal of Computer Science and Electronics Engineering, vol. 3, pp. 351- 355, 2015.
[17]
J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin hui Lee, N. Morgan, and D. O'Shaughnessy, "Developments and directions in speech recognition and understanding, part 1," Signal Processing Magazine, IEEE, vol. 26, no. 3, pp. 75--80, may 2009.
[18]
J. Padmanabhan, M.J.J. Premkumar. Machine learning in automatic speech recognition: A survey. IETE Technical Review (Institution of Electronics and Telecommunication Engineers, India).2015. pp. 240--251.
[19]
J. Zhang, S. Xiao, H. Zhang, L. Jiang. Isolated word recognition with audio derivation and CNN. Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI. 2018, pp. 336--341.
[20]
Jaron Collis. "Glossary of Deep Learning: Batch Normalization". medium.com. Retrieved 24 April 2018.
[21]
Klára, V., Viktor, I., Krisztina, M.: Voice disorder detection on the basis of continuous speech. In: 5th European Conference of the International Federation for Medical and Biological Engineering. Springer, Berlin (2011).
[22]
Lonce Wyse. Audio Spectrogram Representations for Processing with Convolutional Neural Networks. Published 2017 in ArXiv.org.
[23]
Longhao Yuan, Jianting Cao. Patients' EEG Data Analysis via Spectrogram Image with a Convolution Neural Network. Conference: International Conference on Intelligent Decision Technologies.
[24]
M Ahmadi, N J Bailey, B S Hoyle. Phoneme recognition using speech image (spectrogram). Published in IEEE: Proceedings of Third International Conference on Signal Processing (ICSP'96).
[25]
M.M.Musaev, U.A.Berdanov, M.F.Rahimov, Shukurov K.E, "Parallel algorithms for acoustic processing of speech signals" International Conference on Signal and Image Processing (ICSIP 2016). China during August 13--15.
[26]
Mark Gales, Steve Young. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing Vol. 1, No. 3 (2007) 195--304.
[27]
Mohamed O.M. Khelifa, Yahya Mohamed Elhadj, Yousfi Abdellah, Mostafa Belkasmi. Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system. International Journal of Speech Technology. December 2017, Volume 20, Issue 4, pp 937--949.
[28]
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Transactions on Audio, speech, and language processing, vol. 22, NO. 10, October 2014.
[29]
Q. T. Nguyen et al., "Speech classification using sift features on spectrogram images," Vietnam Journal of Computer Science, vol. 3, no. 4, pp. 247--257, 2016.
[30]
Rekik, S., Guerchi, D., Selouani, S.A., et al.: Speech steganography using wavelet and Fourier transforms. EURASIP J. Audio Speech Music Process. 2012(1), 20 (2012).
[31]
S. Chu, S. Narayanan, and C.-C. J. Kuo, "Environmental sound recognition with time--frequency audio features," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1142--1158, 2009.
[32]
Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv.org > cs > arXiv:1502.03167.
[33]
Sukmawati Nur Endah, Satriyo Adhy, Sutikno, Rizky Akbar. Automatic Speech Recognition for Indonesian using Linear Predictive Coding (LPC) and Hidden Markov Model (HMM). Proceeding of 5th International Seminar on New Paradigm and Innovation on Natural Science and Its Application (5th ISNPINSA), 7-8 October 2015, Semarang.
[34]
Tungikar, V.V. and J. Mokashi, Study of Hidden Markov Model for Isolated Word Recognition. SYSTEM, 2016. 4(8).
[35]
Venkatesh Boddapati, Andrej Petef, Jim Rasmusson, Lars Lundberg. Classifying environmental sounds using image recognition networks. December 2017. Procedia Computer Science 112:2048--2056.
[36]
Vinod Nair, Geoffrey E, Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. Conference: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21--24, 2010, Haifa, Israel.
[37]
Waibel A. H, Hanazawa T, Hinton G, Shikano K, Lang K. "Phoneme Recognition Using Time-Delay Neural Networks.", IEEE Trans. on ASSP, Vol. ASSP-37, No. 3, March 1989.
[38]
Wang, S., Chen, X., Cai, G., et al.: Matching demodulation transform and synchro squeezing in time-frequency analysis. IEEE Trans. Signal Process. 62(1), 69--84 (2013).
[39]
X. Glorot, A. Bordes, Y. Bengio. Deep Sparse Rectifier Neural Networks. Conference: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 2015.
[40]
Yingying Li, Siyuan Pi, Nanfeng Xiao. Speech Recognition Method Based on Spectrogram. Proceedings of the International Conference on Mechatronics and Intelligent Robotics (ICMIR2017) - Volume 1.

Cited By

View all
  • (2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771(112771)Online publication date: Jan-2025
  • (2024)Amharic spoken digits recognition using convolutional neural networkJournal of Big Data10.1186/s40537-024-00910-z11:1Online publication date: 4-May-2024
  • (2024)Unlocking the Potential of Spiking Neural Networks: Understanding the What, Why, and WhereIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2023.332974716:5(1648-1663)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Image Approach to Speech Recognition on CNN

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ISCSIC 2019: Proceedings of the 2019 3rd International Symposium on Computer Science and Intelligent Control
    September 2019
    397 pages
    ISBN:9781450376617
    DOI:10.1145/3386164
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional Neural Network
    2. Spectrogram Image
    3. Speech Classification
    4. Speech Recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ISCSIC 2019

    Acceptance Rates

    ISCSIC 2019 Paper Acceptance Rate 77 of 152 submissions, 51%;
    Overall Acceptance Rate 192 of 401 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771(112771)Online publication date: Jan-2025
    • (2024)Amharic spoken digits recognition using convolutional neural networkJournal of Big Data10.1186/s40537-024-00910-z11:1Online publication date: 4-May-2024
    • (2024)Unlocking the Potential of Spiking Neural Networks: Understanding the What, Why, and WhereIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2023.332974716:5(1648-1663)Online publication date: Oct-2024
    • (2024)Dynamic Attention Fusion Decoder for Speech Recognition2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10831705(3125-3132)Online publication date: 6-Oct-2024
    • (2024)Comparison review of image classification techniques for early diagnosis of diabetic retinopathyBiomedical Physics & Engineering Express10.1088/2057-1976/ad726710:6(062001)Online publication date: 5-Sep-2024
    • (2024)Subject dependent speech verification approach for assistive special educationEducation and Information Technologies10.1007/s10639-024-12474-929:13(16157-16175)Online publication date: 7-Feb-2024
    • (2023)Automated Battery Making Fault Classification Using Over-Sampled Image Data CNN FeaturesSensors10.3390/s2304192723:4(1927)Online publication date: 8-Feb-2023
    • (2023)Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek LanguageElectronics10.3390/electronics1223485012:23(4850)Online publication date: 30-Nov-2023
    • (2023)Parallel Approaches in Deep Learning: Use Parallel ComputingProceedings of the 7th International Conference on Future Networks and Distributed Systems10.1145/3644713.3644738(192-201)Online publication date: 21-Dec-2023
    • (2023)Identifying bird species by their calls in SoundscapesApplied Intelligence10.1007/s10489-023-04486-853:19(21485-21499)Online publication date: 20-Mar-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media