This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.
Cited By
- Yu J, Ye N, Du X, Han L and Jain D (2022). Automated English Speech Recognition Using Dimensionality Reduction with Deep Learning Approach, Wireless Communications & Mobile Computing, 2022, Online publication date: 1-Jan-2022.
- Pan C and Chen J (2022). A Framework of Directional-Gain Beamforming and a White-Noise-Gain-Controlled Solution, IEEE/ACM Transactions on Audio, Speech and Language Processing, 30, (2875-2887), Online publication date: 1-Jan-2022.
- Jorge J, Giménez A, Silvestre-Cerdà J, Civera J, Sanchis A and Juan A (2021). Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models, IEEE/ACM Transactions on Audio, Speech and Language Processing, 30, (148-161), Online publication date: 1-Jan-2022.
- Berrada L, Dathathri S, Dvijotham K, Stanforth R, Bunel R, Uesato J, Gowal S and Kumar M Make sure you're unsure Proceedings of the 35th International Conference on Neural Information Processing Systems, (11136-11147)
- Spolaôr N, Lee H, Takaki W, Ensina L, Parmezan A, Oliva J, Coy C and Wu F (2021). A video indexing and retrieval computational prototype based on transcribed speech, Multimedia Tools and Applications, 80:25, (33971-34017), Online publication date: 1-Oct-2021.
- Ouisaadane A and Safi S (2021). A comparative study for Arabic speech recognition system in noisy environments, International Journal of Speech Technology, 24:3, (761-770), Online publication date: 1-Sep-2021.
- Jones R, Zamani H, Schedl M, Chen C, Reddy S, Clifton A, Karlgren J, Hashemi H, Pappu A, Nazari Z, Yang L, Semerci O, Bouchard H and Carterette B Current Challenges and Future Directions in Podcast Information Access Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, (1554-1565)
- Tan K, Feng L and Jiang M (2021). Evolutionary Transfer Optimization - A New Frontier in Evolutionary Computation Research, IEEE Computational Intelligence Magazine, 16:1, (22-33), Online publication date: 1-Feb-2021.
- Shamshirband S, Fathi M, Dehzangi A, Chronopoulos A and Alinejad-Rokny H (2021). A review on deep learning approaches in healthcare systems, Journal of Biomedical Informatics, 113:C, Online publication date: 1-Jan-2021.
- Becerra A, Rosa J, González E, Pedroza A, Escalante N and Santos E (2020). A comparative case study of neural network training by using frame-level cost functions for automatic speech recognition purposes in Spanish, Multimedia Tools and Applications, 79:27-28, (19669-19715), Online publication date: 1-Jul-2020.
- Kolesau A and Šešok D (2020). Voice Activation Systems for Embedded Devices, Informatica, 31:1, (65-88), Online publication date: 1-Jan-2020.
- Wang Z, Wang P and Wang D (2020). Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR, IEEE/ACM Transactions on Audio, Speech and Language Processing, 28, (1778-1787), Online publication date: 1-Jan-2020.
- Gosztolya G (2019). Posterior-thresholding feature extraction for paralinguistic speech classification, Knowledge-Based Systems, 186:C, Online publication date: 15-Dec-2019.
- Dimitrova-Grekow T and Konopko P New Parameters for Improving Emotion Recognition in Human Voice* 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), (4205-4210)
- Shahnawazuddin S, Adiga N, Sai B, Ahmad W and Kathania H (2019). Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins, Digital Signal Processing, 93:C, (34-42), Online publication date: 1-Oct-2019.
- Hou B, Chen Q, Shen J, Liu X, Zhong P, Wang Y, Chen Z and Li Z Gradual Machine Learning for Entity Resolution The World Wide Web Conference, (3526-3530)
- Ma Y, Hao Y, Chen M, Chen J, Lu P and Košir A (2019). Audio-visual emotion fusion (AVEF), Information Fusion, 46:C, (184-192), Online publication date: 1-Mar-2019.
- Schluter R, Beck E and Ney H (2019). Upper and Lower Tight Error Bounds for Feature Omission with an Extension to Context Reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2, (502-514), Online publication date: 1-Feb-2019.
- Prajongjai S, Triyason T and Mongkolnam P Satja Proceedings of the 10th International Conference on Advances in Information Technology, (1-7)
- Chakraborty D, Garg D, Ghosh A and Chan J Trigger Detection System for American Sign Language using Deep Convolutional Neural Networks Proceedings of the 10th International Conference on Advances in Information Technology, (1-6)
- Yuan X, Chen Y, Wang A, Chen K, Zhang S, Huang H and Molloy I All Your Alexa Are Belong to Us: A Remote Voice Control Attack against Echo 2018 IEEE Global Communications Conference (GLOBECOM), (1-6)
- Dwijayanti S, Yamamori K and Miyoshi M (2018). Enhancement of speech dynamics for voice activity detection using DNN, EURASIP Journal on Audio, Speech, and Music Processing, 2018:1, (1-15), Online publication date: 1-Dec-2018.
- Long Y, Li Y and Zhang B (2018). Offline to online speaker adaptation for real-time deep neural network based LVCSR systems, Multimedia Tools and Applications, 77:21, (28101-28119), Online publication date: 1-Nov-2018.
- Becerra A, Rosa J, González E, Pedroza A and Escalante N (2018). Training deep neural networks with non-uniform frame-level cost function for automatic speech recognition, Multimedia Tools and Applications, 77:20, (27231-27267), Online publication date: 1-Oct-2018.
- Sarria-Paja M and Falk T (2018). Fusion of bottleneck, spectral and modulation spectral features for improved speaker verification of neutral and whispered speech, Speech Communication, 102:C, (78-86), Online publication date: 1-Sep-2018.
- Yin P, Xin J and Qi Y (2018). Linear Feature Transform and Enhancement of Classification on Deep Neural Network, Journal of Scientific Computing, 76:3, (1396-1406), Online publication date: 1-Sep-2018.
- Xu C, Xie L and Xiao X (2018). A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection, Journal of Signal Processing Systems, 90:7, (1063-1075), Online publication date: 1-Jul-2018.
- Liu J, Ling Z, Wei S, Hu G and Dai L (2018). Improving the Decoding Efficiency of Deep Neural Network Acoustic Models by Cluster-Based Senone Selection, Journal of Signal Processing Systems, 90:7, (999-1011), Online publication date: 1-Jul-2018.
- Becerra A, De La Rosa J and González E (2018). Speech recognition in a dialog system, Multimedia Tools and Applications, 77:12, (15875-15911), Online publication date: 1-Jun-2018.
- Mao S, Li X, Li K, Wu Z, Liu X and Meng H Unsupervised Discovery of an Extended Phoneme Set in L2 English Speech for Mispronunciation Detection and Diagnosis 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (6244-6248)
- Li L, Wang D, Chen Y, Shi Y, Tang Z and Zheng T Deep Factorization for Speech Signal 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (5094-5098)
- Mao S, Wu Z, Li R, Li X, Meng H and Cai L Applying Multitask Learning to Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (6254-6258)
- Kobayashi T Trainable Co-Occurrence Activation Unit for Improving Convnet 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (1273-1277)
- Ogawa A, Delcroix M, Karita S and Nakatani T Rescoring N-Best Speech Recognition List Based on One-on-One Hypothesis Comparison Using Encoder-Classifier Model 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (6099-6103)
- Wang S, Li Z, Ding C, Yuan B, Qiu Q, Wang Y and Liang Y C-LSTM Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, (11-20)
- Vegesna V, Gurugubelli K, Vydana H, Pulugandla B, Shrivastava M and Vuppala A DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition Mining Intelligence and Knowledge Exploration, (189-197)
- Huang Z, Pan Z, Liu Q, Long B, Ma H and Chen E An Ad CTR Prediction Method Based on Feature Learning of Deep and Shallow Layers Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, (2119-2122)
- Rodomagoulakis I, Katsamanis A, Potamianos G, Giannoulis P, Tsiami A and Maragos P (2017). Room-localized spoken command recognition in multi-room, multi-microphone environments, Computer Speech and Language, 46:C, (419-443), Online publication date: 1-Nov-2017.
- Glasser A, Kushalnagar K and Kushalnagar R Deaf, Hard of Hearing, and Hearing Perspectives on Using Automatic Speech Recognition in Conversation Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, (427-432)
- Dauphin Y, Fan A, Auli M and Grangier D Language modeling with gated convolutional networks Proceedings of the 34th International Conference on Machine Learning - Volume 70, (933-941)
- Lu X, Shen P, Tsao Y and Kawai H (2017). Regularization of neural network model with distance metric learning for i-vector based spoken language identification, Computer Speech and Language, 44:C, (48-60), Online publication date: 1-Jul-2017.
- Potamianos G, Marcheret E, Mroueh Y, Goel V, Koumbaroulis A, Vartholomaios A and Thermos S Audio and visual modality combination in speech processing applications The Handbook of Multimodal-Multisensor Interfaces, (489-543)
- Kennedy J, Lemaignan S, Montassier C, Lavalade P, Irfan B, Papadopoulos F, Senft E and Belpaeme T Child Speech Recognition in Human-Robot Interaction Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, (82-90)
- Higuchi T, Yoshioka T, Kinoshita K and Nakatani T Unsupervised utterance-wise beamformer estimation with speech recognition-level criterion 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (5170-5174)
- Li Y, Zhang X, Li X, Feng X, Yang J, Chen A and He Q Mobile phone clustering from acquired speech recordings using deep Gaussian supervector and spectral clustering 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2137-2141)
- Vaughan J (2017). Making better use of the crowd, The Journal of Machine Learning Research, 18:1, (7026-7071), Online publication date: 1-Jan-2017.
- Ogawa A, Hori T, Nakamura A, Ogawa A, Hori T and Nakamura A (2016). Estimating Speech Recognition Accuracy Based on Error Type Classification, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:12, (2400-2413), Online publication date: 1-Dec-2016.
- Samarakoon L, Sim K, Samarakoon L and Khe Chai Sim (2016). Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:12, (2241-2250), Online publication date: 1-Dec-2016.
- Le D, Licata K, Persad C, Provost E, Le D, Licata K, Persad C and Provost E (2016). Automatic Assessment of Speech Intelligibility for Individuals With Aphasia, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:11, (2187-2199), Online publication date: 1-Nov-2016.
- Palangi H, Ward R and Deng L (2016). Distributed Compressive Sensing: A Deep Learning Approach, IEEE Transactions on Signal Processing, 64:17, (4504-4518), Online publication date: 1-Sep-2016.
- (2016). Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription, Speech Communication, 82:C, (1-13), Online publication date: 1-Sep-2016.
- Kontschieder P, Fiterau M, Criminisi A and Bulò S Deep neural decision forests Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, (4190-4194)
- Chen K and Huo Q (2016). Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:7, (1185-1193), Online publication date: 1-Jul-2016.
- Nguyen H, Lee S, Tian X, Dong M and Chng E (2016). High quality voice conversion using prosodic and high-resolution spectral features, Multimedia Tools and Applications, 75:9, (5265-5285), Online publication date: 1-May-2016.
- Hung J, Hsieh H and Chen B (2016). Robust speech recognition via enhancing the complex-valued acoustic spectrum in modulation domain, IEEE/ACM Transactions on Audio, Speech and Language Processing, 24:2, (236-251), Online publication date: 1-Feb-2016.
- Mesnil G, Dauphin Y, Yao K, Bengio Y, Deng L, Hakkani-Tur D, He X, Heck L, Tur G, Yu D and Zweig G (2015). Using recurrent neural networks for slot filling in spoken language understanding, IEEE/ACM Transactions on Audio, Speech and Language Processing, 23:3, (530-539), Online publication date: 1-Mar-2015.
- Chen K and Huo Q Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (5880-5884)
- Zhang G and Heusdens R On simplifying the primal-dual method of multipliers 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (4826-4830)
Index Terms
- Automatic Speech Recognition: A Deep Learning Approach
Recommendations
Cued Speech automatic recognition in normal-hearing and deaf subjects
This article discusses the automatic recognition of Cued Speech in French based on hidden Markov models (HMMs). Cued Speech is a visual mode which, by using hand shapes in different positions and in combination with lip patterns of speech, makes all the ...
Psycho-acoustics inspired automatic speech recognition
AbstractUnderstanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
Syllable-based automatic arabic speech recognition in noisy-telephone channel
The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of ...