[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Automatic speech recognition: a survey

Published: 01 March 2021 Publication History

Abstract

Recently great strides have been made in the field of automatic speech recognition (ASR) by using various deep learning techniques. In this study, we present a thorough comparison between cutting-edged techniques currently being used in this area, with a special focus on the various deep learning methods. This study explores different feature extraction methods, state-of-the-art classification models, and vis-a-vis their impact on an ASR. As deep learning techniques are very data-dependent different speech datasets that are available online are also discussed in detail. In the end, the various online toolkits, resources, and language models that can be helpful in the formulation of an ASR are also proffered. In this study, we captured every aspect that can impact the performance of an ASR. Hence, we speculate that this work is a good starting point for academics interested in ASR research.

References

[1]
Abdulla W H, Kasabov N (1999) The concepts of hidden Markov model in speech recognition.
[2]
Abe S Analysis of multiclass support vector machines Thyroid 2003 21 3 3772
[3]
Alkhaldi W, Fakhr W, Hamdy N (2002) Automatic speech/speaker recognition in noisy environments using wavelet transform, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002., Tulsa, OK, USA, pp. I-463.
[4]
Anusuya MA and Katti SK Front end analysis of speech recognition: a review Int J Speech Technol 2011 14 2 99-145
[5]
Anusuya MA and Katti SK Comparison of different speech feature extraction techniques with and without wavelet transform to Kannada speech recognition Int J Comput Appl 2011 26 4 19-24
[6]
Atmaja BT, Akagi M (2020) Deep multilayer Perceptrons for dimensional speech emotion recognition. arXiv preprint arXiv:2004.02355.
[7]
Bahl LR, Brown PF, de Souza PV, and Mercer RL A tree-based statistical language model for natural language speech recognition IEEE Trans Acoust Speech Signal Process 1989 37 7 1001-1008
[8]
Barker J, Watanabe S, Vincent E, Trmal J (2018) The fifth’CHiME’speech separation and recognition challenge: dataset, task and baselines. arXiv preprint arXiv:1803.10609.
[9]
Batuwita R and Palade V FSVM-CIL: fuzzy support vector machines for class imbalance learning IEEE Trans Fuzzy Syst 2010 18 3 558-571
[10]
Baum LE and Eagon JA An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology Bull Am Math Soc 1967 73 3 360-363
[11]
Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, and West M Generative or discriminative? Getting the best of both worlds Bayesian stat 2007 8 3 3-24
[12]
Besacier L, Barnard E, Karpov A, and Schultz T Automatic speech recognition for under-resourced languages: a survey Speech Comm 2014 56 85-100
[13]
Birkenes O, Matsui T, Tanabe K, Siniscalchi SM, Myrvoll TA, and Johnsen MH Penalized logistic regression with HMM log-likelihood regressors for speech recognition IEEE Trans Audio Speech Lang Process 2009 18 6 1440-1454
[14]
Bourlard H A, Morgan N (2012). Connectionist speech recognition: a hybrid approach (Vol. 247). Springer Science & Business Media.
[15]
Bu H, Du J, Na X, Wu B, Zheng H (2017). Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) (pp. 1-5). IEEE.
[16]
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, and Narayanan SS IEMOCAP: interactive emotional dyadic motion capture database Lang Resour Eval 2008 42 4 335-359
[17]
Campos MM, Carpenter GA (1998) WSOM: building adaptive wavelets with self-organizing maps. In 1998 IEEE international joint conference on neural networks proceedings. IEEE world congress on computational intelligence (cat. No. 98CH36227) (Vol. 1, pp. 763-767). IEEE
[18]
Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960-4964). IEEE.
[19]
Chang T H, Luo Z Q, Deng L, Chi C Y (2008) A convex optimization method for joint mean and variance parameter estimation of large-margin CDHMM. In 2008 IEEE international conference on acoustics, speech and signal processing (pp. 4053-4056). IEEE.
[20]
Chen C P, Bilmes J, Ellis D P (2005) Speech feature smoothing for robust ASR. In proceedings.(ICASSP'05). IEEE international conference on acoustics, speech, and signal processing, 2005. (Vol. 1, pp. I-525). IEEE.
[21]
Cheng O, Abdulla W, Salcic Z (2005) Performance evaluation of front-end processing for speech recognition systems. The University of Auckland.
[22]
Chiu, C. C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., ..., Jaitly, N. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774–4778). IEEE.
[23]
Chow Y, Dunham M, Kimball O, Krasner M, Kubala G, Makhoul J, ..., Schwartz R (1987) BYBLOS: The BBN continuous speech recognition system. In ICASSP'87. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 12, pp. 89–92). IEEE
[24]
Chow YL, Schwartz R (1989) The n-best algorithm: an efficient procedure for finding top n sentence hypotheses. In proceedings of the workshop on speech and natural language (pp. 199-202). Association for Computational Linguistics
[25]
Clarkson P, Moreno PJ (1999) On the use of support vector machines for phonetic classification. In 1999 IEEE international conference on acoustics, speech, and signal processing. Proceedings. ICASSP99 (cat. No. 99CH36258) (Vol. 2, pp. 585-588). IEEE
[26]
Coifman R R, Meyer Y, Wickerhauser V (1992) Wavelet analysis and signal processing. In In Wavelets and their applications.
[27]
Collobert R, Puhrsch C, Synnaeve G (2016) Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193.
[28]
Cortes C and Vapnik V Support-vector networks Mach Learn 1995 20 3 273-297
[29]
Crouse MS, Nowak RD, and Baraniuk RG Wavelet-based statistical signal processing using hidden Markov models IEEE Trans Signal Process 1998 46 4 886-902
[30]
Cutajar M, Gatt E, Grech I, Casha O, and Micallef J Comparative study of automatic speech recognition techniques IET Signal Proc 2013 7 1 25-46
[31]
Cutajar M, Gatt E, Micallef J, Grech I, Casha O (2010) Digital hardware implementation of self-organising maps. In Melecon 2010-2010 15th IEEE Mediterranean Electrotechnical conference (pp. 1123-1128). IEEE
[32]
Dansena D K, Rathore Y A Survey Paper on Automatic Speech Recognition by Machine
[33]
Davis KH, Biddulph R, and Balashek S Automatic recognition of spoken digits J Acoust Soc Am 1952 24 6 637-642
[34]
Deshmukh N, Picone J (1995) Methodologies for language modeling and search in continuous speech recognition. In proceedings IEEE Southeastcon’95. Visualize the future (pp. 192-198). IEEE
[35]
Du X P, He P L (2006) The clustering solution of speech recognition models with SOM. In international symposium on neural networks (pp. 150-157). Springer, Berlin, Heidelberg.
[36]
Duan KB, Keerthi SS (2005) Which is the best multiclass SVM method? An empirical study. In international workshop on multiple classifier systems (pp. 278-285). Springer, Berlin, Heidelberg
[37]
Dumitru C O, Gavat I (2006) A comparative study of feature extraction methods applied to continuous speech recognition in romanian language. In proceedings ELMAR 2006 (pp. 115-118). IEEE.
[38]
Fontaine V, Ris C, Leich H (1996) Nonlinear discriminant analysis with neural networks for speech recognition. In 1996 8th European signal processing conference (EUSIPCO 1996) (pp. 1-4). IEEE.
[39]
Forgie JW and Forgie CD Results obtained from a vowel recognition computer program J Acoust Soc Am 1959 31 11 1480-1489
[40]
Forsberg M (2003) Why is speech recognition difficult. Chalmers University of Technology.
[41]
Friedman JH Another approach to polychotomous classification 1996 Technical Report Statistics Department, Stanford University
[42]
Gaikwad SK, Gawali BW, and Yannawar P A review on speech recognition technique Int J Comput Appl 2010 10 3 16-24
[43]
Gamulkiewicz B, Weeks M (2003) Wavelet based speech recognition. In 2003 46th Midwest symposium on circuits and systems (Vol. 2, pp. 678-681). IEEE.
[44]
Ganapathy S, Thomas S, and Hermansky H Modulation frequency features for phoneme recognition in noisy speech J Acoust Soc Am 2009 125 1 EL8-EL12
[45]
Garofolo JS (1993) TIMIT acoustic phonetic continuous speech corpus. Linguist Data Consortium 1993
[46]
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In proceedings of the 23rd international conference on machine learning (pp. 369-376)
[47]
Gupta M, Gilbert A (2001) Robust speech recognition using wavelet coefficient features. In IEEE workshop on automatic speech recognition and understanding, 2001. ASRU'01. (pp. 445-448). IEEE.
[48]
Hai J, Joo E M (2003) Improved linear predictive coding method for speech recognition. In fourth international conference on information, communications and signal processing, 2003 and the fourth Pacific rim conference on multimedia. Proceedings of the 2003 joint (Vol. 3, pp. 1614-1618). IEEE.
[49]
Halabi N (2016) Modern standard arabic phonetics for speech synthesis (Doctoral dissertation, University of Southampton).
[50]
Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, ..., Ng A Y (2014) Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
[51]
Hardy RL Multiquadric equations of topography and other irregular surfaces J Geophys Res 1971 76 8 1905-1915
[52]
Helmi N, Helmi BH (2008) Speech recognition with fuzzy neural network for discrete words. In 2008 fourth international conference on natural computation (Vol. 7, pp. 265-269). IEEE
[53]
Hemakumar G and Punitha P Speech recognition technology: a survey on Indian languages Int J Inf Sci Intell Syst 2013 2 4 1-38
[54]
Hennebert J, Hasler M, Dedieu H (1994) Neural networks in speech recognition. Department of Electrical Engineering, Swiss Federal Institute of Technology, 1015.
[55]
Hermansky H Perceptual linear predictive (PLP) analysis of speech. The J Acoust Soc Am 1990 87 4 1738-1752
[56]
Hermansky H and Morgan N RASTA processing of speech IEEE Trans Speech Audio Process 1994 2 4 578-589
[57]
Hermansky H, Morgan N, Bayya A, Kohn P (1991) RASTA-PLP speech analysis. In Proc. IEEE Int’l Conf. Acoustics, speech and signal processing (Vol. 1, pp. 121-124).
[58]
Hou X (2009) Noise robust speech recognition based on wavelet-RBF neural network. In PIAGENG 2009: intelligent information, control, and communication Technology for Agricultural Engineering (Vol. 7490, p. 74902O). International Society for Optics and Photonics
[59]
Hsu CW and Lin CJ A comparison of methods for multiclass support vector machines IEEE Trans Neural Netw 2002 13 2 415-425
[60]
Hu X, Zhan L, Xue Y, Zhou W, Zhang L (2011) Spoken arabic digits recognition based on wavelet neural networks. In 2011 IEEE international conference on systems, man, and cybernetics (pp. 1481-1485). IEEE.
[61]
Huang X, Alleva F, Hon HW, Hwang MY, Lee KF, and Rosenfeld R The SPHINX-II speech recognition system: an overview Comput Speech Lang 1993 7 2 137-148
[62]
Huang X, Baker J, and Reddy R A historical perspective of speech recognition Commun ACM 2014 57 1 94-103
[63]
Hung JW and Fan HT Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition IEEE Signal Process Lett 2009 16 9 806-809
[64]
Hunt A, Favero R (1994) Using principal component analysis with wavelets in speech recognition. In SST Conf., ASSTA Inc., Perth (pp. 296-301).
[65]
Illina I, Gong Y (1996) Improvement in N-best search for continuous speech recognition. In proceeding of fourth international conference on spoken language processing. ICSLP'96 (Vol. 4, pp. 2147-2150). IEEE
[66]
Islam J, Mubassira M, Islam MR, Das AK (2019) A speech recognition system for Bengali language using recurrent neural network. In 2019 IEEE 4th international conference on computer and communication systems (ICCCS) (pp. 73-76). IEEE
[67]
Jiang H, Li X, and Liu C Large margin hidden Markov models for speech recognition IEEE Trans Audio Speech Lang Process 2006 14 5 1584-1595
[68]
Juang BH and Rabiner LR Hidden Markov models for speech recognition Technometrics 1991 33 3 251-272
[69]
Juang B H, Rabiner L R (2005) Automatic speech recognition–a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, 1, 67.
[70]
Jung S, Son J, Bae K (2004) Feature extraction based on wavelet domain hidden Markov tree model for robust speech recognition. In Australasian joint conference on artificial intelligence (pp. 1154-1159). Springer, Berlin, Heidelberg.
[71]
Kaur P, Singh P, and Garg V Speech recognition system; challenges and techniques Int J Comput Sci Inf Technol 2012 3 3 3989-3992
[72]
Kesarkar M P (2003) Feature extraction for speech recognition. Electronic systems, EE. Dept., IIT Bombay.
[73]
Khan A, Sohail A, Zahoora U, Qureshi AS (2020) A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev, 1–62
[74]
Köhn A, Stegen F, Baumann T (2016) Mining the spoken wikipedia for speech data and beyond. In proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 4644-4647).
[75]
Kohonen T Self-organized formation of topologically correct feature maps Biol Cybern 1982 43 1 59-69
[76]
Korba M C A, Messadeg D, Djemili R, Bourouba H (2008) Robust speech recognition using perceptual wavelet denoising and mel-frequency product spectrum cepstral coefficient features. Informatica, 32(3).
[77]
Kriman S, Beliaev S, Ginsburg B, Huang J, Kuchaiev O, Lavrukhin V, ..., Zhang Y (2020) Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6124–6128). IEEE
[78]
Krishnan VV and Anto PB Features of wavelet packet decomposition and discrete wavelet transform for malayalam speech recognition Int J Recent Trends Eng 2009 1 2 93
[79]
Krüger SE, Schafföner M, Katz M, Andelic E, Wendemuth A (2005) Speech recognition with support vector machines in a hybrid system. In Ninth European Conference on Speech Communication and Technology
[80]
Kupiec J (1989) Probabilistic models of short and long distance word dependencies in running text. In Speech and Natural Language: Proceedings of a Workshop Held at Philadelphia, Pennsylvania, February 21-23, 1989
[81]
Lamere P, Kwok P, Gouvea E, Raj B, Singh R, Walker W, ..., Wolf P (2003) The CMU SPHINX-4 speech recognition system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong (Vol. 1, pp. 2–5)
[82]
Lawrence R (2008) Fundamentals of speech recognition. Pearson Education India.
[83]
Lazli L, Sellami M (2003) Connectionist probability estimators in HMM arabic speech recognition using fuzzy logic. In international workshop on machine learning and data Mining in Pattern Recognition (pp. 379-388). Springer, Berlin, Heidelberg.
[84]
Lee J Y, Hung J W (2011) Exploiting principal component analysis in modulation spectrum enhancement for robust speech recognition. In 2011 eighth international conference on fuzzy systems and knowledge discovery (FSKD) (Vol. 3, pp. 1947-1951). IEEE.
[85]
Lee A, Kawahara T, Shikano K (2001) Julius---an open source real-time large vocabulary recognition engine
[86]
Lekshmi KR and Elizabeth S Automatic speech recognition using different neural network architectures – a survey Int J Comput Sci Inf Technol 2016 7 6 2422-2427
[87]
Leung K F, Leung F H, Lam H K, Tam P K S (2003) Recognition of speech commands using a modified neural fuzzy network and an improved GA. In the 12th IEEE international conference on fuzzy systems, 2003. FUZZ’03. (Vol. 1, pp. 190-195). IEEE.
[88]
Li T F, Chang S C (2007) Speech recognition of mandarin syllables using both linear predict coding cepstra and Mel frequency cepstra. In ROCLING 2007 poster papers (pp. 379-390).
[89]
Lin CT (1996) Neural fuzzy systems: a neuro-fuzzy synergism to intelligent systems. Prentice hall PTR
[90]
Lin CF and Wang SD Fuzzy support vector machines IEEE Trans Neural Netw 2002 13 2 464-471
[91]
Liu X (2009) A new wavelet threshold denoising algorithm in speech recognition. In 2009 Asia-Pacific conference on information processing (Vol. 2, pp. 310-313). IEEE.
[92]
Lowerre BT (1976) The HARPY speech recognition system. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE
[93]
Maheswari NU, Kabilan AP, and Venkatesh R A hybrid model of neural network approach for speaker independent word recognition Int J Comput Theory Eng 2010 2 6 912
[94]
Makino T, Liao H, Assael Y, Shillingford B, Garcia B, Braga O, Siohan O (2019) Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 905-912). IEEE
[95]
Malekzadeh S, Gholizadeh M H, Razavi S N (2018). Persian vowel recognition with MFCC and ANN on PCVC speech dataset. arXiv preprint arXiv:1812.06953.
[96]
Mallat SG A theory for multiresolution signal decomposition: the wavelet representation IEEE Trans Pattern Anal Mach Intell 1989 11 7 674-693
[97]
Mehla R and Aggarwal R Automatic speech recognition: a survey Int J Adv Res Comput Sci Electron Eng (IJARCSEE) 2014 3 1 45-53
[98]
Messaoud Z B, Hamida A B (2010) CDHMM parameters selection for speaker-independent phone recognition in continuous speech system. In MELECON 2010-2010 15th IEEE Mediterranean Electrotechnical conference (pp. 253-258). IEEE.
[99]
Meyer Y (1993) Wavelets: Algorithms and Applications, SIAM, Philadelphia, 1993. MR 95f, 94005.
[100]
Milone DH and Di Persia LE Learning hidden Markov models with hidden Markov trees as observation distributions. Inteligencia artificial Revista Iberoamericana de Inteligencia Artificial 2008 12 37 7-13
[101]
Modic R, Lindberg B, Petek B (2003) Comparative wavelet and mfcc speech recognition experiments on the slovenian and english speechdat2. In ISCA tutorial and research workshop on non-linear speech processing
[102]
Mohamadpour M, Farokhi F (2009) A new approach for Persian speech recognition. In 2009 IEEE international advance computing conference (pp. 153-158). IEEE
[103]
Molau S, Pitz M, Schluter R, Ney H (2001) Computing mel-frequency cepstral coefficients on the power spectrum. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (cat. No. 01CH37221) (Vol. 1, pp. 73-76). IEEE.
[104]
Morgan N, Bourlard H (1990). Continuous speech recognition using multilayer perceptrons with hidden Markov models. In international conference on acoustics, speech, and signal processing (pp. 413-416). IEEE
[105]
Mporas I, Ganchev T, Siafarikas M, and Fakotakis N Comparison of speech features on the speech recognition task J Comput Sci 2007 3 8 608-616
[106]
Muller D N, De Siqueira M L, Navaux P O A (2006) A connectionist approach to speech understanding. In the 2006 IEEE international joint conference on neural network proceedings (pp. 3790-3797). IEEE.
[107]
Nataraj K S, Pandey P C, Shah M S (2011) Improving the consistency of vocal tract shape estimation. In 2011 National Conference on communications (NCC) (pp. 1-5). IEEE.
[108]
Nehe NS and Holambe RS New feature extraction techniques for Marathi digit recognition Int J Recent Trends Eng 2009 2 2 22
[109]
Nehe NS and Holambe RS DWT and LPC based feature extraction methods for isolated word recognition EURASIP J Audio Speech Music Process 2012 2012 1 7
[110]
Nguyen P, Heigold G, and Zweig G Speech recognition with flat direct models IEEE J Sel Top Sign Proces 2010 4 6 994-1006
[111]
Nouza J, Zdansky J, Cerva P (2010) System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In MELECON 2010–2010 15th IEEE Mediterranean Electrotechnical Conference (pp. 202–205). IEEE
[112]
O’Shaughnessy D Automatic speech recognition: history, methods and challenges Pattern Recogn 2008 41 10 2965-2979
[113]
O'Shaughnessy D Linear predictive coding IEEE potentials 1988 7 1 29-32
[114]
O'Shaughnessy D Interacting with computers by voice: automatic speech recognition and synthesis Proc IEEE 2003 91 9 1272-1305
[115]
Pallett DS, Fiscus JG, Garofolo JS (1990) DARPA resource management. In speech and natural language: proceedings of a workshop held at Hidden Valley, Pennsylvania, June 24-27, 1990 (p. 298). Morgan Kaufmann pub
[116]
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206-5210). IEEE.
[117]
Paul AK, Das D, Kamal MM (2009) Bangla speech recognition system using LPC and ANN. In 2009 seventh international conference on advances in pattern recognition (pp. 171-174). IEEE
[118]
Paulson LD Speech recognition moves from software to hardware Computer 2006 39 11 15-18
[119]
Picone JW Signal modeling techniques in speech recognition Proc IEEE 1993 81 9 1215-1247
[120]
Ping Z, Li-Zhen T, and Dong-Feng X Speech recognition algorithm of parallel subband HMM based on wavelet analysis and neural network Inf Technol J 2009 8 5 796-800
[121]
Polikar R (1996) The wavelet tutorial.
[122]
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, ..., Silovsky J (2011) The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Process Soc
[123]
Rabiner LR A tutorial on hidden Markov models and selected applications in speech recognition Proc IEEE 1989 77 2 257-286
[124]
Rabiner L, Juang B H (1993) Fundamental of speech recognition prentice-hall international.
[125]
Rabiner L and Levinson S Isolated and connected word recognition-theory and selected applications IEEE Trans Commun 1981 29 5 621-659
[126]
Radha V and Vimala C A review on speech recognition challenges and approaches Doaj Org 2012 2 1 1-7
[127]
Ranjan S (2010) A discrete wavelet transform based approach to Hindi speech recognition. In 2010 international conference on signal acquisition and processing (pp. 345-348). IEEE.
[128]
Rosenblatt F (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms (no. VG-1196-G-8). Cornell aeronautical lab Inc Buffalo NY
[129]
Rosenfeld R (1994) A hybrid approach to adaptive statistical language modeling. CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE
[130]
Rosenfeld R, Huang X (1992) Improvements in stochastic language modeling. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992
[131]
Rousseau A, Deléglise P, Esteve Y (2012) TED-LIUM: an automatic speech recognition dedicated corpus. In LREC (pp. 125-129).
[132]
Rybach D, Gollan C, Heigold G, Hoffmeister B, Lööf J, Schlüter R, Ney H (2009) The RWTH Aachen University open source speech recognition system. In Tenth Annual Conference of the International Speech Communication Association
[133]
Sabah R, Ainon RN (2009) Isolated digit speech recognition in Malay language using neuro-fuzzy approach. In 2009 third Asia international conference on Modelling & Simulation (pp. 336-340). IEEE
[134]
Saeed TR, Salman J, and Ali AH Classification improvement of spoken arabic language based on radial basis function Int J Electr Comput Eng 2019 9 1 2088-8708
[135]
Saha G, Chakroborty S, Senapati S (2005) A new silence removal and endpoint detection algorithm for speech and speaker recognition applications. In proceedings of the NCC (pp. 56-61).
[136]
Sainath TN, Pang R, Rybach D, He Y, Prabhavalkar R, Li W, ..., McGraw I (2019) Two-pass end-to-end speech recognition. arXiv preprint arXiv:1908.10992
[137]
Sak H, Senior A, Rao K, Beaufays F (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947.
[138]
Sakoe H and Chiba S Dynamic programming algorithm optimization for spoken word recognition IEEE Trans Acoust Speech Signal Process 1978 26 1 43-49
[139]
Sárosi G, Mozsáry M, Mihajlik P, Fegyó T (2011) Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment. In 2011 6th conference on speech technology and human-computer dialogue (SpeD) (pp. 1-8). IEEE.
[140]
Sayers C (1991). Self organizing feature maps and their applications to robotics
[141]
Sha F, Saul LK (2007) Large margin hidden Markov models for automatic speech recognition. In advances in neural information processing systems (pp. 1249-1256)
[142]
Shanthi TS and Lingam C Review of feature extraction techniques in automatic speech recognition Int J Sci Eng Technol 2013 2 6 479-484
[143]
Shewalkar A, Nyavanandi D, and Ludwig SA Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU J Artif Intel Soft Comput Res 2019 9 4 235-245
[144]
Singh MT, Fayjie AR, Kachari B (2015) A survey report on speech recognition system. Int J Comput Appl 121(11)
[145]
Sivaram GS, Hermansky H (2011) Multilayer perceptron with sparse hidden outputs for phoneme recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5336-5339). IEEE
[146]
Sivaram GS and Hermansky H Sparse multilayer perceptron for phoneme recognition IEEE Trans Audio Speech Lang Process 2011 20 1 23-29
[147]
Smaragdis P, Radhakrishnan R, Wilson K W (2009) Context extraction through audio signal analysis. In multimedia content analysis (pp. 1–34). Springer, Boston, MA
[148]
Solera-Ureña R, Padrell-Sendra J, Martín-Iglesias D, Gallardo-Antolín A, Peláez-Moreno C, Díaz-de-María F (2007) Svms for automatic speech recognition: a survey. In Progress in nonlinear speech processing (pp. 190–216). Springer, Berlin, Heidelberg
[149]
Sonkamble BA, Doye DD, Sonkamble S, PICT P, MMCOE P (2009) An efficient use of support vector machines for speech signal classification. In Proc eighth WSEAS Int Conf computational intelligence., man-machine systems and cybernetics (pp. 117-120)
[150]
Sukumar AR, Shah AF, Anto PB (2010) Isolated question words recognition from speech queries by using artificial neural networks. In 2010 second international conference on computing, communication and networking technologies (pp. 1-4). IEEE.
[151]
Tang X (2009) Hybrid hidden Markov model and artificial neural network for automatic speech recognition. In 2009 Pacific-Asia conference on circuits, communications and systems (pp. 682-685). IEEE.
[152]
Tang H, Meng CH, Lee LS (2010) An initial attempt for phoneme recognition using structured support vector machine (SVM). In 2010 IEEE international conference on acoustics, speech and signal processing (pp. 4926-4929). IEEE
[153]
Tavanaei A, Manzuri M T, Sameti H (2011) Mel-scaled discrete wavelet transform and dynamic features for the Persian phoneme recognition. In 2011 international symposium on artificial intelligence and signal processing (AISP) (pp. 138-140). IEEE.
[154]
Thubthong N and Kijsirikul B Support vector machines for Thai phoneme recognition Int J Uncertainty Fuzziness Knowledge Based Syst 2001 9 06 803-813
[155]
Toshniwal S, Sainath T N, Weiss R J, Li B, Moreno P, Weinstein E, Rao K (2018) Multilingual speech recognition with a single end-to-end model. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4904-4908). IEEE.
[156]
Tóth L (2011) A hierarchical, context-dependent neural network architecture for improved phone recognition. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5040–5043). IEEE
[157]
Trentin E and Gori M A survey of hybrid ANN/HMM models for automatic speech recognition Neurocomputing 2001 37 1–4 91-126
[158]
Trentin E and Gori M Robust combination of neural networks and hidden Markov models for speech recognition IEEE Trans Neural Netw 2003 14 6 1519-1531
[159]
Umarani SD, Raviram P, Wahidabanu RSD (2009) Implementation of HMM and radial basis function for speech recognition. In 2009 international conference on Intelligent Agent & Multi-Agent Systems (pp. 1-4). IEEE
[160]
Vadwala AY, Suthar KA, Karmakar YA, and Pandya N Survey paper on different speech recognition algorithm: challenges and techniques Int J Comput Appl 2017 175 1 31-36
[161]
Vapnik V (2013) The nature of statistical learning theory. Springer science & business media
[162]
Veaux C, Yamagishi J, MacDonald K (2016) Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit.
[163]
Veisi H and Sameti H The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition Digital Signal Process 2011 21 1 36-53
[164]
Velichko VM and Zagoruyko NG Automatic recognition of 200 words Int J Man Mach Stud 1970 2 3 223-234
[165]
Venkateswarlu R L K, Kumari R V (2011) Novel approach for speech recognition by using self—organized maps. In 2011 international conference on emerging trends in networks and computer communications (ETNCC) (pp. 215-222). IEEE.
[166]
Venkateswarlu RLK, Kumari RV, Jayasri GV (2011) Speech recognition using radial basis function neural network. In 2011 3rd international conference on electronics computer technology (Vol. 3, pp. 441-445). IEEE
[167]
Walker SL and Foo SY Optimal wavelets for speech signal representations J Syst Cybern Inform 2003 1 4 44-46
[168]
Wang Y, Han K, and Wang D Exploring monaural features for classification-based speech segregation IEEE Trans Audio Speech Lang Process 2012 21 2 270-279
[169]
Wang Y, Wang S, and Lai KK A new fuzzy support vector machine to evaluate credit risk IEEE Trans Fuzzy Syst 2005 13 6 820-831
[170]
Wang D, Wang X, and Lv S End-to-end mandarin speech recognition combining CNN and BLSTM Symmetry 2019 11 5 644
[171]
Wang B, Yin Y, Lin H (2020) Attention-based transducer for online speech recognition. arXiv preprint arXiv:2005.08497
[172]
Weston J, Watkins C (1998) Multi-class support vector machines (pp. 98-04). Technical report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, may
[173]
Weston J, Watkins C (1999) Support vector machines for multi-class pattern recognition. In Esann (Vol. 99, pp. 219-224)
[174]
Wijoyo S, Wijoyo S (2011) Speech recognition using linear predictive coding and artificial neural network for controlling movement of mobile robot. In proceedings of 2011 international conference on information and electronics engineering (ICIEE 2011) (pp. 28-29).
[175]
Woodland PC, Leggetter CJ, Odell JJ, Valtchev V, Young SJ (1995) The 1994 HTK large vocabulary speech recognition system. In 1995 international conference on acoustics, speech, and signal processing (Vol. 1, pp. 73-76). IEEE
[176]
Yegnanarayana B and Veldhuis RN Extraction of vocal-tract system characteristics from speech signals IEEE Trans Speech Audio Process 1998 6 4 313-327
[177]
Yu H, Xie T, Paszczynski S, and Wilamowski BM Advantages of radial basis function networks for dynamic system design IEEE Trans Ind Electron 2011 58 12 5438-5450
[178]
Zamani B, Akbari A, Nasersharif B, and Jalalvand A Optimized discriminative transformations for speech features based on minimum classification error Pattern Recogn Lett 2011 32 7 948-955
[179]
Zhao Y, Wakita H, Zhuang X (1991) An HMM based speaker-independent continuous speech recognition system with experiments on the TIMIT DATABASE. In acoustics, speech, and signal processing, IEEE international conference on (pp. 333-336). IEEE computer society

Cited By

View all
  • (2025)A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end modelsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00388-w2025:1Online publication date: 21-Jan-2025
  • (2025)A novel approach to enhancing biomedical signal recognition via hybrid high-order information bottleneck driven spiking neural networksNeural Networks10.1016/j.neunet.2024.106976183:COnline publication date: 1-Mar-2025
  • (2024)Optimizing Video Queries with Declarative CluesProceedings of the VLDB Endowment10.14778/3681954.368199817:11(3256-3268)Online publication date: 1-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Multimedia Tools and Applications
Multimedia Tools and Applications  Volume 80, Issue 6
Mar 2021
1586 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2021
Accepted: 13 October 2020
Revision received: 04 September 2020
Received: 31 May 2020

Author Tags

  1. Speech recognition
  2. ASR
  3. Automatic speech recognition
  4. Feature extraction
  5. Classification models
  6. Language models

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end modelsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00388-w2025:1Online publication date: 21-Jan-2025
  • (2025)A novel approach to enhancing biomedical signal recognition via hybrid high-order information bottleneck driven spiking neural networksNeural Networks10.1016/j.neunet.2024.106976183:COnline publication date: 1-Mar-2025
  • (2024)Optimizing Video Queries with Declarative CluesProceedings of the VLDB Endowment10.14778/3681954.368199817:11(3256-3268)Online publication date: 1-Jul-2024
  • (2024)Convolutional Neural Networks to Facilitate the Continuous Recognition of Arabic Speech with Independent SpeakersJournal of Electrical and Computer Engineering10.1155/2024/49769442024Online publication date: 1-Jan-2024
  • (2024)Integrating Attention Mechanisms with Bidirectional Long Short-term Memory Recurrent Neural Networks for Improved Speech RecognitionProceedings of the 2024 3rd International Conference on Algorithms, Data Mining, and Information Technology10.1145/3701100.3701147(224-227)Online publication date: 27-Sep-2024
  • (2024)Speech Recognition of Tagalog Talisay Batangueño Accent in the Philippines using Wav2Vec2.0Proceedings of the 2024 15th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3670013.3670031(416-421)Online publication date: 18-Mar-2024
  • (2024)Disambiguation of Isolated Manipuri Tonal Contrast Word Pairs Using Acoustic FeaturesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364383023:3(1-18)Online publication date: 9-Mar-2024
  • (2024)Multimodal Prediction of Obsessive-Compulsive Disorder and Comorbid Depression Severity and Energy Delivered by Deep Brain ElectrodesIEEE Transactions on Affective Computing10.1109/TAFFC.2024.339511715:4(2025-2041)Online publication date: 30-Apr-2024
  • (2024)Uncertainty-based knowledge distillation for Bayesian deep neural network compressionInternational Journal of Approximate Reasoning10.1016/j.ijar.2024.109301175:COnline publication date: 1-Dec-2024
  • (2024)Robust Speech Enhancement Using Dabauchies Wavelet Based Adaptive Wavelet Thresholding for the Development of Robust Automatic Speech Recognition: A Comprehensive ReviewWireless Personal Communications: An International Journal10.1007/s11277-024-11448-x137:4(2085-2119)Online publication date: 1-Aug-2024
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media