[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

Abstract

The Grid Long Short-Term Memory (Grid-LSTM), which is consisted of three steps, i.e., two-dimensional grid splitting, local feature projection, and grid sequence modeling, has been widely used in Automatic Speech Recognition (ASR) tasks, since it has a strong time-frequency modeling ability. However, the network suffers from a serious problem that heavy computing time is always required. It can be found that the reason for this problem is in the last step, two cross-working LSTMs are employed to model time-frequency features in the grid via an analysis of its process. Thus, we try to speed up the Grid-LSTM by using a smaller grid and propose two enhanced Grid-LSTM models, i.e., Convolutional Grid-LSTM (ConvGrid-LSTM) and Multichannel ConvGrid-LSTM (MCConvGrid-LSTM) to reduce the grid size from the two dimensions of the Grid-LSTM respectively. In the frequency axis, we try to do this by using a large frequency stride and further to prevent performance loss by embedding a CNN in the Grid-LSTM. Moreover, in the time axis, we model several adjacent frames by the multichannel processing ability of CNN. Our method achieves \(54\%\) relative reduction of training time and \(19\%\) relative reduction of Word Error Rate (WER) for a character level End-to-End ASR task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)

    Article  Google Scholar 

  2. Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4277–4280 (2012)

    Google Scholar 

  3. Graves, A., Jaitly, N., Mohamed, A.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2015)

    Google Scholar 

  4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  5. Kalchbrenner, N., Danihelka, I., Graves, A.: Grid long short-term memory. In: International Conference of Learning Representation, ICLR, pp. 1–15. Open Publishing (2016)

    Google Scholar 

  6. Li, B., Sainath, T.N.: Reducing the computational complexity of two-dimensional LSTMs. In: INTERSPEECH, pp. 964–968 (2017)

    Google Scholar 

  7. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5206–5210 (2015)

    Google Scholar 

  8. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4 (2011)

    Google Scholar 

  9. Pundak, G., Sainath, T.N.: Lower frame rate neural network acoustic models. In: INTERSPEECH, pp. 22–26 (2016)

    Google Scholar 

  10. Sainath, T.N., et al.: Improvements to deep convolutional neural networks for LVCSR. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 315–320 (2013)

    Google Scholar 

  11. Sainath, T.N., Li, B.: Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In: INTERSPEECH, pp. 813–817 (2016)

    Google Scholar 

  12. Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 8614–8618 (2013)

    Google Scholar 

  13. Sainath, T.N., Vinyals, O., Senior, A.W., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4580–4584 (2015)

    Google Scholar 

  14. Stollenga, M.F., Byeon, W., Liwicki, M., Schmidhuber, J.: Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation. In: Advances in Neural Information Processing Systems NIPS, pp. 2998–3006 (2015)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the National Key Research and Development Program of China under Grant 2017YFB1002102 and National Natural Science Foundation of China under Grant U1736210.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiqing Han .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xue, J., Zheng, T., Han, J. (2019). Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_76

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36802-9_76

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36801-2

  • Online ISBN: 978-3-030-36802-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics