Abstract
The Grid Long Short-Term Memory (Grid-LSTM), which is consisted of three steps, i.e., two-dimensional grid splitting, local feature projection, and grid sequence modeling, has been widely used in Automatic Speech Recognition (ASR) tasks, since it has a strong time-frequency modeling ability. However, the network suffers from a serious problem that heavy computing time is always required. It can be found that the reason for this problem is in the last step, two cross-working LSTMs are employed to model time-frequency features in the grid via an analysis of its process. Thus, we try to speed up the Grid-LSTM by using a smaller grid and propose two enhanced Grid-LSTM models, i.e., Convolutional Grid-LSTM (ConvGrid-LSTM) and Multichannel ConvGrid-LSTM (MCConvGrid-LSTM) to reduce the grid size from the two dimensions of the Grid-LSTM respectively. In the frequency axis, we try to do this by using a large frequency stride and further to prevent performance loss by embedding a CNN in the Grid-LSTM. Moreover, in the time axis, we model several adjacent frames by the multichannel processing ability of CNN. Our method achieves \(54\%\) relative reduction of training time and \(19\%\) relative reduction of Word Error Rate (WER) for a character level End-to-End ASR task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4277–4280 (2012)
Graves, A., Jaitly, N., Mohamed, A.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kalchbrenner, N., Danihelka, I., Graves, A.: Grid long short-term memory. In: International Conference of Learning Representation, ICLR, pp. 1–15. Open Publishing (2016)
Li, B., Sainath, T.N.: Reducing the computational complexity of two-dimensional LSTMs. In: INTERSPEECH, pp. 964–968 (2017)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5206–5210 (2015)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2011)
Pundak, G., Sainath, T.N.: Lower frame rate neural network acoustic models. In: INTERSPEECH, pp. 22–26 (2016)
Sainath, T.N., et al.: Improvements to deep convolutional neural networks for LVCSR. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 315–320 (2013)
Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: INTERSPEECH, pp. 813–817 (2016)
Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 8614–8618 (2013)
Sainath, T.N., Vinyals, O., Senior, A.W., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4580–4584 (2015)
Stollenga, M.F., Byeon, W., Liwicki, M., Schmidhuber, J.: Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. In: Advances in Neural Information Processing Systems NIPS, pp. 2998–3006 (2015)
Acknowledgements
This research was supported by the National Key Research and Development Program of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Xue, J., Zheng, T., Han, J. (2019). Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_76
Download citation
DOI: https://doi.org/10.1007/978-3-030-36802-9_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36801-2
Online ISBN: 978-3-030-36802-9
eBook Packages: Computer ScienceComputer Science (R0)