More Web Proxy on the site http://driver.im/

research-article

UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition

Authors:

Jose-Maria Arnau,

Antonio GonzálezAuthors Info & Claims

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 69 - 81

https://doi.org/10.1145/3123939.3124542

Published: 14 October 2017 Publication History

Abstract

Accurate, real-time Automatic Speech Recognition (ASR) requires huge memory storage and computational power. The main bottleneck in state-of-the-art ASR systems is the Viterbi search on a Weighted Finite State Transducer (WFST). The WFST is a graph-based model created by composing an Acoustic Model (AM) and a Language Model (LM) offline. Offline composition simplifies the implementation of a speech recognizer as only one WFST has to be searched. However, the size of the composed WFST is huge, typically larger than a Gigabyte, resulting in a large memory footprint and memory bandwidth requirements.

In this paper, we take a completely different approach and propose a hardware accelerator for speech recognition that composes the AM and LM graphs on-the-fly. In our ASR system, the fully-composed WFST is never generated in main memory. On the contrary, only the subset required for decoding each input speech fragment is dynamically generated from the AM and LM models. In addition to the direct benefits of this on-the-fly composition, the resulting approach is more amenable to further reduction in storage requirements through compression techniques.

The resulting accelerator, called UNFOLD, performs the decoding in real-time using the compressed AM and LM models, and reduces the size of the datasets from more than one Gigabyte to less than 40 Megabytes, which can be very important in small form factor mobile and wearable devices.

Besides, UNFOLD improves energy-efficiency by orders of magnitude with respect to CPUs and GPUs. Compared to a state-of-the-art Viterbi search accelerators, the proposed ASR system outperforms by providing 31x reduction in memory footprint and 28% energy savings on average.

References

[1]

Diamantino Caseiro and Isabel Trancoso. 2001. On Integrating the Lexicon with the Language Model. (2001).

[2]

D. Caseiro and I. Trancoso. 2001. Transducer composition for "on-the-fly" lexicon and language model integration. In Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on. 393--396.

[3]

J. Choi, K. You, and W. Sung. 2010. An FPGA implementation of speech recognition with weighted finite state transducers. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 1602--1605.

[4]

Jike Chong, Ekaterina Gonina, and Kurt Keutzer. 2011. Efficient automatic speech recognition on the gpu. Chapter in GPU Computing Gems Emerald Edition, Morgan Kaufmann 1 (2011).

Digital Library

[5]

VoxForge Speech Corpus. 2009. http://www.voxforge.org.

[6]

H. J. G. A. Dolfing and I. L. Hetherington. 2001. Incremental language models for speech recognition using finite-state transducers. In Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on. 194--197.

[7]

Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 92--104.

Digital Library

[8]

M. Friesen. 2016. Linux Power Management Optimization on the Nvidia Jetson Platform. Technical Report. "https://www.socallinuxexpo.org/sites/default/files/presentations/scale14x_r27_final.pdf"

[9]

Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. PMLR, Bejing, China, 1764--1772. http://proceedings.mlr.press/v32/graves14.html

Digital Library

[10]

A. Graves, A. r. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.

[11]

S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally. 2016. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. ArXiv e-prints (Dec. 2016). arXiv:cs.CL/1612.00694

[12]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 1135--1143. http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf

Digital Library

[13]

T. Hori, C. Hori, Y. Minami, and A. Nakamura. 2007. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 15, 4 (May 2007), 1352--1365.

Digital Library

[14]

K. Hwang and W. Sung. 2014. Fixed-point feedforward deep neural network design using weights +1, 0, and -1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS). 1--6.

[15]

J. R. Johnston and R. A. Rutenbar. 2012. A High-Rate, Low-Power, ASIC Speech Decoder Using Finite State Transducers. In 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors. 77--85.

Digital Library

[16]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multi-core and manycore architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 469--480.

Digital Library

[17]

Andrej Ljolje, Fernando Pereira, and Michael Riley. 1999. Efficient general lattice generation and rescoring. In EUROSPEECH.

[18]

Y. Miao, M. Gowayyed, and F. Metze. 2015. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 167--174.

[19]

Mehryar Mohri, Fernando Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language 16, 1 (2002), 69 -- 88.

Digital Library

[20]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. 5206--5210.

[21]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.

[22]

Michael Price. 2016. Energy-scalable speech recognition circuits (Doctoral dissertation). In Massachusetts Institute of Technology. http://hdl.handle.net/1721.1/106090

[23]

Michael Price, Anantha Chandrakasan, and James R. Glass. 2016. Memory-Efficient Modeling and Search Techniques for Hardware ASR Decoders. In INTERSPEECH. 1893--1897.

[24]

M. Price, J. Glass, and A. P. Chandrakasan. 2015. A 6 mW, 5,000-Word Real-Time Speech Recognizer Using WFST Models. IEEE Journal of Solid-State Circuits 50, 1 (Jan 2015), 102--112.

[25]

L. R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb 1989), 257--286.

Digital Library

[26]

Anthony Rousseau, Paul DelÃl'glise, and Yannick EstÃl'Ave. 2014. Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In In Proc. LREC. 26--31.

[27]

Hamid Tabani, Jose-Maria Arnau, Jordi Tubella, and Antonio Gonzalez. 2017. Performance Analysis and Optimization of Automatic Speech Recognition. Multi-Scale Computing Systems, IEEE Transactions on (2017).

[28]

Hamid Tabani, Jose-Maria Arnau, Jordi Tubella, and Antonio Gonzalez. 2017. An Ultra Low-power Hardware Accelerator for Acoustic Scoring in Speech Recognition. In Parallel Architecture and Compilation Techniques (PACT), 26th International Conference on. IEEE.

[29]

TN-41--01. 2007. Calculating Memory System Power for DDR3, Micron Technology, Tech. Rep. Technical Report.

[30]

TN-53--01. 2016. LPDDR4 Power Calculator, Micron Technology, Tech. Rep. Technical Report.

[31]

Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. 2004. Sphinx-4: A Flexible Open Source Framework for Speech Recognition. Technical Report. Mountain View, CA, USA.

Digital Library

[32]

D. Willett and S. Katagiri. 2002. Recent advances in efficient decoding combining on-line transducer composition and smoothed language model incorporation. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. I-713--I-716.

[33]

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. CoRR abs/1610.05256 (2016). http://arxiv.org/abs/1610.05256

[34]

Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2016. An ultra low-power hardware accelerator for automatic speech recognition. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--12.

[35]

Reza Yazdani, Albert Segura, Jose-Maria Arnau, and Antonio Gonzalez. 2017. Low-Power Automatic Speech Recognition Through a Mobile GPU and a Viterbi Accelerator. IEEE Micro 37, 1 (Jan 2017), 22--29.

Digital Library

[36]

K. You, J. Chong, Y. Yi, E. Gonina, C. J. Hughes, Y. K. Chen, W. Sung, and K. Keutzer. 2009. Parallel scalability in speech recognition. IEEE Signal Processing Magazine 26, 6 (November 2009), 124--135.

[37]

X. Zhang, J. Trmal, D. Povey, and S. Khudanpur. 2014. Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 215--219.

Cited By

Pinto DArnau JRiera MCruz JGonzález A(2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
https://doi.org/10.1007/s11227-024-06351-y
Kaur TGamez AOlvera-Cervantes JSchaefer BCorona-Chavez A(2023)I-TAINTED: Identification of Turmeric Adulteration Using the CavIty PerturbatioN Technique and Technology OptimizED Machine LearningIEEE Access10.1109/ACCESS.2023.328971711(66456-66466)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3289717
Raj PReddy PChandrachoodan N(2022)Reduced Memory Viterbi Decoding for Hardware-accelerated Speech RecognitionACM Transactions on Embedded Computing Systems10.1145/351002821:3(1-18)Online publication date: 28-May-2022
https://dl.acm.org/doi/10.1145/3510028
Show More Cited By

Index Terms

UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Special purpose systems
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Leveraging Run-Time Feedback for Efficient ASR Acceleration
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

In this work, we propose Locality-AWare-Scheme (LAWS) for an Automatic Speech Recognition (ASR) accelerator in order to significantly reduce its energy consumption and memory requirements, by leveraging the locality among consecutive segments of the ...
A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition
Traditional weighted finite-state transducer– (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system ...
Parallel Speech Recognition

Computer speech recognition has been very successful in limited domains and for isolated word recognition. However, widespread use of large-vocabulary continuous-speech recognizers is limited by the speed of current recognizers, which cannot reach ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

October 2017

850 pages

ISBN:9781450349529

DOI:10.1145/3123939

General Chairs:
Hillery Hunter
IBM Research
,
Jaime Moreno
IBM Research
,
Program Chairs:
Joel Emer
NVIDIA and MIT
,
Daniel Sanchez
MIT

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministerio de Economía y Competitividad

Conference

MICRO-50

Sponsor:

SIGMICRO
IEEE-CS\DATC

MICRO-50: The 50th Annual IEEE/ACM International Symposium on Microarchitecture

October 14 - 18, 2017

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
507
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pinto DArnau JRiera MCruz JGonzález A(2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
https://doi.org/10.1007/s11227-024-06351-y
Kaur TGamez AOlvera-Cervantes JSchaefer BCorona-Chavez A(2023)I-TAINTED: Identification of Turmeric Adulteration Using the CavIty PerturbatioN Technique and Technology OptimizED Machine LearningIEEE Access10.1109/ACCESS.2023.328971711(66456-66466)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3289717
Raj PReddy PChandrachoodan N(2022)Reduced Memory Viterbi Decoding for Hardware-accelerated Speech RecognitionACM Transactions on Embedded Computing Systems10.1145/351002821:3(1-18)Online publication date: 28-May-2022
https://dl.acm.org/doi/10.1145/3510028
Yazdani RArnau JGonzalez A(2020)LAWS: Locality-AWare Scheme for Automatic Speech RecognitionIEEE Transactions on Computers10.1109/TC.2020.2991002(1-1)Online publication date: 2020
https://doi.org/10.1109/TC.2020.2991002
Jang HKim JJo JLee JKim JManne SHunter HAltman E(2019)MnnFastProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322214(250-263)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322214
Yazdani RArnau JGonzález A(2019)Leveraging Run-Time Feedback for Efficient ASR AccelerationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00046(462-463)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00046
Gupta UReagen BPentecost LDonato MTambe TRush AWei GBrooks D(2019)MASR: A Modular Accelerator for Sparse RNNsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00009(1-14)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00009
Yazdani RRiera MArnau JGonzález A(2018)The dark side of DNN pruningProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00071(790-801)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00071
Riera MArnau JGonzález A(2018)Computation reuse in DNNs by exploiting input similarityProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00016(57-68)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00016

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents