[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3123939.3124542acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition

Published: 14 October 2017 Publication History

Abstract

Accurate, real-time Automatic Speech Recognition (ASR) requires huge memory storage and computational power. The main bottleneck in state-of-the-art ASR systems is the Viterbi search on a Weighted Finite State Transducer (WFST). The WFST is a graph-based model created by composing an Acoustic Model (AM) and a Language Model (LM) offline. Offline composition simplifies the implementation of a speech recognizer as only one WFST has to be searched. However, the size of the composed WFST is huge, typically larger than a Gigabyte, resulting in a large memory footprint and memory bandwidth requirements.
In this paper, we take a completely different approach and propose a hardware accelerator for speech recognition that composes the AM and LM graphs on-the-fly. In our ASR system, the fully-composed WFST is never generated in main memory. On the contrary, only the subset required for decoding each input speech fragment is dynamically generated from the AM and LM models. In addition to the direct benefits of this on-the-fly composition, the resulting approach is more amenable to further reduction in storage requirements through compression techniques.
The resulting accelerator, called UNFOLD, performs the decoding in real-time using the compressed AM and LM models, and reduces the size of the datasets from more than one Gigabyte to less than 40 Megabytes, which can be very important in small form factor mobile and wearable devices.
Besides, UNFOLD improves energy-efficiency by orders of magnitude with respect to CPUs and GPUs. Compared to a state-of-the-art Viterbi search accelerators, the proposed ASR system outperforms by providing 31x reduction in memory footprint and 28% energy savings on average.

References

[1]
Diamantino Caseiro and Isabel Trancoso. 2001. On Integrating the Lexicon with the Language Model. (2001).
[2]
D. Caseiro and I. Trancoso. 2001. Transducer composition for "on-the-fly" lexicon and language model integration. In Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on. 393--396.
[3]
J. Choi, K. You, and W. Sung. 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 1602--1605.
[4]
Jike Chong, Ekaterina Gonina, and Kurt Keutzer. 2011. Efficient automatic speech recognition on the gpu. Chapter in GPU Computing Gems Emerald Edition, Morgan Kaufmann 1 (2011).
[5]
VoxForge Speech Corpus. 2009. http://www.voxforge.org.
[6]
H. J. G. A. Dolfing and I. L. Hetherington. 2001. Incremental language models for speech recognition using finite-state transducers. In Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on. 194--197.
[7]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 92--104.
[8]
M. Friesen. 2016. Linux Power Management Optimization on the Nvidia Jetson Platform. Technical Report. "https://www.socallinuxexpo.org/sites/default/files/presentations/scale14x_r27_final.pdf"
[9]
Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. PMLR, Bejing, China, 1764--1772. http://proceedings.mlr.press/v32/graves14.html
[10]
A. Graves, A. r. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.
[11]
S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally. 2016. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. ArXiv e-prints (Dec. 2016). arXiv:cs.CL/1612.00694
[12]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1135--1143. http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf
[13]
T. Hori, C. Hori, Y. Minami, and A. Nakamura. 2007. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 15, 4 (May 2007), 1352--1365.
[14]
K. Hwang and W. Sung. 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and -1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS). 1--6.
[15]
J. R. Johnston and R. A. Rutenbar. 2012. A High-Rate, Low-Power, ASIC Speech Decoder Using Finite State Transducers. In 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors. 77--85.
[16]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multi-core and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 469--480.
[17]
Andrej Ljolje, Fernando Pereira, and Michael Riley. 1999. Efficient general lattice generation and rescoring. In EUROSPEECH.
[18]
Y. Miao, M. Gowayyed, and F. Metze. 2015. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 167--174.
[19]
Mehryar Mohri, Fernando Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language 16, 1 (2002), 69 -- 88.
[20]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. 5206--5210.
[21]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.
[22]
Michael Price. 2016. Energy-scalable speech recognition circuits (Doctoral dissertation). In Massachusetts Institute of Technology. http://hdl.handle.net/1721.1/106090
[23]
Michael Price, Anantha Chandrakasan, and James R. Glass. 2016. Memory-Efficient Modeling and Search Techniques for Hardware ASR Decoders. In INTERSPEECH. 1893--1897.
[24]
M. Price, J. Glass, and A. P. Chandrakasan. 2015. A 6 mW, 5,000-Word Real-Time Speech Recognizer Using WFST Models. IEEE Journal of Solid-State Circuits 50, 1 (Jan 2015), 102--112.
[25]
L. R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb 1989), 257--286.
[26]
Anthony Rousseau, Paul DelÃl'glise, and Yannick EstÃl'Ave. 2014. Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In In Proc. LREC. 26--31.
[27]
Hamid Tabani, Jose-Maria Arnau, Jordi Tubella, and Antonio Gonzalez. 2017. Performance Analysis and Optimization of Automatic Speech Recognition. Multi-Scale Computing Systems, IEEE Transactions on (2017).
[28]
Hamid Tabani, Jose-Maria Arnau, Jordi Tubella, and Antonio Gonzalez. 2017. An Ultra Low-power Hardware Accelerator for Acoustic Scoring in Speech Recognition. In Parallel Architecture and Compilation Techniques (PACT), 26th International Conference on. IEEE.
[29]
TN-41--01. 2007. Calculating Memory System Power for DDR3, Micron Technology, Tech. Rep. Technical Report.
[30]
TN-53--01. 2016. LPDDR4 Power Calculator, Micron Technology, Tech. Rep. Technical Report.
[31]
Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. 2004. Sphinx-4: A Flexible Open Source Framework for Speech Recognition. Technical Report. Mountain View, CA, USA.
[32]
D. Willett and S. Katagiri. 2002. Recent advances in efficient decoding combining on-line transducer composition and smoothed language model incorporation. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I-713--I-716.
[33]
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. CoRR abs/1610.05256 (2016). http://arxiv.org/abs/1610.05256
[34]
Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2016. An ultra low-power hardware accelerator for automatic speech recognition. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12.
[35]
Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2017. Low-Power Automatic Speech Recognition Through a Mobile GPU and a Viterbi Accelerator. IEEE Micro 37, 1 (Jan 2017), 22--29.
[36]
K. You, J. Chong, Y. Yi, E. Gonina, C. J. Hughes, Y. K. Chen, W. Sung, and K. Keutzer. 2009. Parallel scalability in speech recognition. IEEE Signal Processing Magazine 26, 6 (November 2009), 124--135.
[37]
X. Zhang, J. Trmal, D. Povey, and S. Khudanpur. 2014. Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 215--219.

Cited By

View all
  • (2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
  • (2023)I-TAINTED: Identification of Turmeric Adulteration Using the CavIty PerturbatioN Technique and Technology OptimizED Machine LearningIEEE Access10.1109/ACCESS.2023.328971711(66456-66466)Online publication date: 2023
  • (2022)Reduced Memory Viterbi Decoding for Hardware-accelerated Speech RecognitionACM Transactions on Embedded Computing Systems10.1145/351002821:3(1-18)Online publication date: 28-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. WFST
  2. automatic-speech-recognition (ASR)
  3. hardware accelerator
  4. memory-efficient
  5. on-the-fly composition
  6. viterbi search

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
  • (2023)I-TAINTED: Identification of Turmeric Adulteration Using the CavIty PerturbatioN Technique and Technology OptimizED Machine LearningIEEE Access10.1109/ACCESS.2023.328971711(66456-66466)Online publication date: 2023
  • (2022)Reduced Memory Viterbi Decoding for Hardware-accelerated Speech RecognitionACM Transactions on Embedded Computing Systems10.1145/351002821:3(1-18)Online publication date: 28-May-2022
  • (2020)LAWS: Locality-AWare Scheme for Automatic Speech RecognitionIEEE Transactions on Computers10.1109/TC.2020.2991002(1-1)Online publication date: 2020
  • (2019)MnnFastProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322214(250-263)Online publication date: 22-Jun-2019
  • (2019)Leveraging Run-Time Feedback for Efficient ASR AccelerationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00046(462-463)Online publication date: 23-Sep-2019
  • (2019)MASR: A Modular Accelerator for Sparse RNNsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00009(1-14)Online publication date: 23-Sep-2019
  • (2018)The dark side of DNN pruningProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00071(790-801)Online publication date: 2-Jun-2018
  • (2018)Computation reuse in DNNs by exploiting input similarityProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00016(57-68)Online publication date: 2-Jun-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media