More Web Proxy on the site http://driver.im/

research-article

MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks

Authors:

Jongwook Chung,

Jung Ho AhnAuthors Info & Claims

IEEE Transactions on Computers, Volume 69, Issue 7

Pages 955 - 967

https://doi.org/10.1109/TC.2020.2984496

Published: 01 July 2020 Publication History

Abstract

Recurrent Neural Networks (RNNs) spend most of their execution time performing matrix-vector multiplication (MV-mul). Because the matrices in RNNs have poor reusability and the ever-increasing size of the matrices becomes too large to fit in the on-chip storage of mobile/IoT devices, the performance and energy efficiency of MV-mul is determined by those of main-memory DRAM. Therefore, computing MV-mul within DRAM draws much attention. However, previous studies lacked consideration for the matrix sparsity, the power constraints of DRAM devices, and concurrency in accessing DRAM from processors while performing MV-mul. We propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2× higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload.

References

[1]

M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997.

Digital Library

[2]

D. Amodei, et al., “Deep speech 2: End-to-End speech recognition in english and mandarin,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 173–182.

[3]

A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. 38th IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 6645–6649.

[4]

J. Fowers, et al., “A configurable cloud-scale DNN processor for real-time AI,” in Proc. 45th ACM/IEEE Int. Symp. Comput. Archit., 2018, pp. 1–14.

[5]

N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. 44th ACM/IEEE Int. Symp. Comput. Archit., 2017, pp. 1–12.

[6]

Z. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu, “Towards memory friendly long-short term memory networks (LSTMs) on mobile GPUs,” in Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchit., 2018, pp. 162–174.

[7]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

Digital Library

[8]

K. Cho, et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2014, pp. 1724–1734.

[9]

M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction,” in Proc. 22nd Int. Symp. High-Perform. Comput. Archit., 2016, pp. 64–76.

[10]

Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” in Proc. IEEE Int. Solid-State Circuits Conf., 2016, pp. 262–263.

[11]

S. Han, et al., “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. 25th ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 75–84.

[12]

M. Horowitz, “Computing's energy problem (and what we can do about it),” in Proc. Int. Solid-State Circuits Conf., 2014, pp. 10–14.

[13]

H. Asghari-Moghaddam, Y. H. Son, J. Ahn, and N. S. Kim, “Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–13.

[14]

A. Farmahini-Farahani, J. Ahn, K. Morrow, and N. S. Kim, “NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules,” in Proc. 21st Int. Symp. High-Perform. Comput. Archit., 2015, pp. 283–295.

[15]

M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and efficient neural network acceleration with 3D memory,” in Proc. 22nd Int. Conf. Architect. Support Program. Lang. Operating Syst., 2017, pp. 751–764.

[16]

H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “McDRAM: Low latency and energy-efficient matrix computations in DRAM,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 37, no. 11, pp. 2613–2622, Nov. 2018.

[17]

S. Lee, H. Cho, Y. Son, Y. Ro, N. S. Kim, and J. Ahn, “Leveraging power-performance relationship of energy-efficient modern DRAM devices,” IEEE Access, vol. 6, pp. 31387–31398, 2018.

[18]

S. Han, et al., “EIE: Efficient inference engine on compressed deep neural network,” in Proc. 43rd ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 243–254.

[19]

S. Zhang, et al., “Cambricon-X: An accelerator for sparse neural networks,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–12.

[20]

F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, “Efficient SpMV Operation for large and highly sparse matrices using scalable multi way merge parallelization,” in Proc. 52th Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 347–358.

[21]

J. Doweck, et al., “Inside 6th-generation intel core: New microarchitecture code-named skylake,” IEEE Micro, vol. 37, no. 2, pp. 52–62, Mar./Apr. 2017.

Digital Library

[22]

D. Foley and J. Danskin, “Ultra-performance pascal GPU and NVLink interconnect,” IEEE Micro, vol. 37, no. 2, pp. 7–17, Mar./Apr. 2017.

Digital Library

[23]

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. 31st Int. Conf. Mach. Learn., 2014, pp. II-1764–II-1772.

[24]

Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-End speech recognition using deep RNN models and WFST-based decoding,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2015, pp. 167–174.

[25]

G. Saon, et al., “English conversational telephone speech recognition by humans and machines,” Interspeech, 2017, pp. 132–136.

[26]

K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CAPIO 2017 conversational speech recognition system,” 2018, arXiv:1801.00059.

[27]

C.-C. Chiu, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. 43rd IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 4774–4778.

[28]

N. E. Jerger, A. Kannan, Z. Li, and G. H. Loh, “NoC architectures for silicon interposer systems: Why pay for more wires when you can get them (from Your Interposer) for free?,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., 2014, pp. 458–470.

[29]

R. Oh, et al., “Design technologies for a 1.2V 2.4Gb/s/pin high capacity DDR4 SDRAM with TSVs,” in Proc. IEEE Symp. VLSI Circuits Digest Tech. Papers, 2014, pp. 1–2.

[30]

M. O'Connor, et al., “Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems,” in Proc. 50th Annu. IEEE/ACM Int. Symp. Microarchit., 2017, pp. 41–54.

[31]

JEDEC Solid State Technology Association, “Low power double data rate 4 (LPDDR4),” JESD209-4A, 2015.

[32]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. 40th IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.

[33]

S. Narang, G. Diamos, S. Sengupta, and E. Elsen, “Exploring sparsity in recurrent neural networks,” in Proc. 5th Int. Conf. Learn. Representations, Toulon, France, 2017. [Online]. Available: https://openreview.net/forum?id=BylSPv9gx

[34]

N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2009, pp. 1–11.

[35]

E. Lee, J. Chung, D. Jung, S. Lee, S. Li, and J. Ahn, “Work as a team or individual: Characterizing system-level impacts of main memory partitioning,” in Proc. IEEE Int. Symp. Workload Characterization, 2017, pp. 156–166.

[36]

NCSU, “FreePDK45,” 2011. [Online]. Available: https://www.eda.ncsu.edu/wiki/FreePDK45:Contents

[37]

S. Thoziyoor, J. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi, “A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies,” in Proc. 35th ACM/IEEE Int. Symp. Comput. Archit., 2008, pp. 51–62.

[38]

O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems,” in Proc. 35th ACM/IEEE Int. Symp. Comput. Archit., 2008, pp. 63–74.

[39]

J. Ahn, S. Li, S. O. Jouppi, “McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2013, pp. 74–85.

[40]

A. Limaye and T. Adegbija, “A workload characterization of the SPEC CPU2017 benchmark suite,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2018, pp. 149–158.

[41]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior,” in Proc. 7th Int. Conf. Architect. Support Program. Lang. Operating Syst., 2002, pp. 45–57.

[42]

“Deep speech 2 pytorch,” 2017. [Online]. Available: https://github.com/SeanNaren/deepspeech.pytorch

[43]

B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “CHoNDA: Near data acceleration with concurrent host access,” 2019,.

[44]

G. Diamos, et al., “Persistent RNNs: Stashing recurrent weights On-Chip,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2024–2033.

[45]

F. Zhu, J. Pool, M. Andersch, J. Appleyard, and F. Xie, “Sparse persistent RNNs: Squeezing Large recurrent networks on-chip,” in Proc. Int. Conf. Learn. Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HkxF5RgC-

Cited By

Park JChoi JKyung KKim MKwon YKim NAhn JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640422
Kang JXu WBittremieux WMoshiri NRosing T(2024)DRAM-Based Acceleration of Open Modification Search in Hyperdimensional SpaceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338284243:9(2592-2605)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3382842
Liu JXiao GWu FLiao XLi K(2023)AAPP: An Accelerative and Adaptive Path Planner for Robots on GPUIEEE Transactions on Computers10.1109/TC.2023.324827472:8(2336-2349)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TC.2023.3248274
Show More Cited By

Index Terms

MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Integrated circuits

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploiting Refresh Effect of DRAM Read Operations: A Practical Approach to Low-Power Refresh

Dynamic random access memory (DRAM) requires periodic refresh operations to retain its data. In practice, DRAM retention times are normally distributed from 64 ms to several seconds. However, the conventional refresh method uses 64 ms as the refresh ...
Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format
HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications

Sparse matrix vector multiplication (SpMV) is used in many scientific computations. The main bottleneck of this algorithm is memory bandwidth and many methods reduce memory bandwidth usage by compressing the index array. The matrices from finite ...
VRL-DRAM: improving DRAM performance via variable refresh latency
DAC '18: Proceedings of the 55th Annual Design Automation Conference

A DRAM chip requires periodic refresh operations to prevent data loss due to charge leakage in DRAM cells. Refresh operations incur significant performance overhead as a DRAM bank/rank becomes unavailable to service access requests while being ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 69, Issue 7

July 2020

167 pages

ISSN:0018-9340

Issue’s Table of Contents

0018-9340 © 2020 Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 July 2020

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Park JChoi JKyung KKim MKwon YKim NAhn JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640422
Kang JXu WBittremieux WMoshiri NRosing T(2024)DRAM-Based Acceleration of Open Modification Search in Hyperdimensional SpaceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338284243:9(2592-2605)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3382842
Liu JXiao GWu FLiao XLi K(2023)AAPP: An Accelerative and Adaptive Path Planner for Robots on GPUIEEE Transactions on Computers10.1109/TC.2023.324827472:8(2336-2349)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TC.2023.3248274
Chen YXiao GLi KPiccialli FZomaya A(2022)fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC PlatformsACM Transactions on Parallel Computing10.1145/35127709:2(1-29)Online publication date: 11-Apr-2022
https://dl.acm.org/doi/10.1145/3512770
Feng SHe XChen KKe LZhang XBlaauw DMudge TDreslinski RSalapura VZahran MChong FTang L(2022)MeNDAProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527432(245-258)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527432
Chen JLin YSun KChen JMa CMao RWang Y(2022)GCIM: Toward Efficient Processing of Graph Convolutional Networks in 3D-Stacked MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319832041:11(3579-3590)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2022.3198320
Cho BJung JErez Mde Supinski BHall MGamblin T(2021)Accelerating bandwidth-bound deep learning inference with main-memory acceleratorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476146(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476146

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents