[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks

Published: 01 July 2020 Publication History

Abstract

Recurrent Neural Networks (RNNs) spend most of their execution time performing matrix-vector multiplication (MV-mul). Because the matrices in RNNs have poor reusability and the ever-increasing size of the matrices becomes too large to fit in the on-chip storage of mobile/IoT devices, the performance and energy efficiency of MV-mul is determined by those of main-memory DRAM. Therefore, computing MV-mul within DRAM draws much attention. However, previous studies lacked consideration for the matrix sparsity, the power constraints of DRAM devices, and concurrency in accessing DRAM from processors while performing MV-mul. We propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2× higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload.

References

[1]
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997.
[2]
D. Amodei, et al., “Deep speech 2: End-to-End speech recognition in english and mandarin,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 173–182.
[3]
A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. 38th IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 6645–6649.
[4]
J. Fowers, et al., “A configurable cloud-scale DNN processor for real-time AI,” in Proc. 45th ACM/IEEE Int. Symp. Comput. Archit., 2018, pp. 1–14.
[5]
N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. 44th ACM/IEEE Int. Symp. Comput. Archit., 2017, pp. 1–12.
[6]
Z. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu, “Towards memory friendly long-short term memory networks (LSTMs) on mobile GPUs,” in Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchit., 2018, pp. 162–174.
[7]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[8]
K. Cho, et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2014, pp. 1724–1734.
[9]
M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction,” in Proc. 22nd Int. Symp. High-Perform. Comput. Archit., 2016, pp. 64–76.
[10]
Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” in Proc. IEEE Int. Solid-State Circuits Conf., 2016, pp. 262–263.
[11]
S. Han, et al., “ESE: Efficient speech recognition engine with sparse LSTM on FPGA,” in Proc. 25th ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2017, pp. 75–84.
[12]
M. Horowitz, “Computing's energy problem (and what we can do about it),” in Proc. Int. Solid-State Circuits Conf., 2014, pp. 10–14.
[13]
H. Asghari-Moghaddam, Y. H. Son, J. Ahn, and N. S. Kim, “Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–13.
[14]
A. Farmahini-Farahani, J. Ahn, K. Morrow, and N. S. Kim, “NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules,” in Proc. 21st Int. Symp. High-Perform. Comput. Archit., 2015, pp. 283–295.
[15]
M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and efficient neural network acceleration with 3D memory,” in Proc. 22nd Int. Conf. Architect. Support Program. Lang. Operating Syst., 2017, pp. 751–764.
[16]
H. Shin, D. Kim, E. Park, S. Park, Y. Park, and S. Yoo, “McDRAM: Low latency and energy-efficient matrix computations in DRAM,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 37, no. 11, pp. 2613–2622, Nov. 2018.
[17]
S. Lee, H. Cho, Y. Son, Y. Ro, N. S. Kim, and J. Ahn, “Leveraging power-performance relationship of energy-efficient modern DRAM devices,” IEEE Access, vol. 6, pp. 31387–31398, 2018.
[18]
S. Han, et al., “EIE: Efficient inference engine on compressed deep neural network,” in Proc. 43rd ACM/IEEE Int. Symp. Comput. Archit., 2016, pp. 243–254.
[19]
S. Zhang, et al., “Cambricon-X: An accelerator for sparse neural networks,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., 2016, pp. 1–12.
[20]
F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, “Efficient SpMV Operation for large and highly sparse matrices using scalable multi way merge parallelization,” in Proc. 52th Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 347–358.
[21]
J. Doweck, et al., “Inside 6th-generation intel core: New microarchitecture code-named skylake,” IEEE Micro, vol. 37, no. 2, pp. 52–62, Mar./Apr. 2017.
[22]
D. Foley and J. Danskin, “Ultra-performance pascal GPU and NVLink interconnect,” IEEE Micro, vol. 37, no. 2, pp. 7–17, Mar./Apr. 2017.
[23]
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. 31st Int. Conf. Mach. Learn., 2014, pp. II-1764–II-1772.
[24]
Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-End speech recognition using deep RNN models and WFST-based decoding,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2015, pp. 167–174.
[25]
G. Saon, et al., “English conversational telephone speech recognition by humans and machines,” Interspeech, 2017, pp. 132–136.
[26]
K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane, “The CAPIO 2017 conversational speech recognition system,” 2018, arXiv:1801.00059.
[27]
C.-C. Chiu, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. 43rd IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 4774–4778.
[28]
N. E. Jerger, A. Kannan, Z. Li, and G. H. Loh, “NoC architectures for silicon interposer systems: Why pay for more wires when you can get them (from Your Interposer) for free?,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., 2014, pp. 458–470.
[29]
R. Oh, et al., “Design technologies for a 1.2V 2.4Gb/s/pin high capacity DDR4 SDRAM with TSVs,” in Proc. IEEE Symp. VLSI Circuits Digest Tech. Papers, 2014, pp. 1–2.
[30]
M. O'Connor, et al., “Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems,” in Proc. 50th Annu. IEEE/ACM Int. Symp. Microarchit., 2017, pp. 41–54.
[31]
JEDEC Solid State Technology Association, “Low power double data rate 4 (LPDDR4),” JESD209-4A, 2015.
[32]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. 40th IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.
[33]
S. Narang, G. Diamos, S. Sengupta, and E. Elsen, “Exploring sparsity in recurrent neural networks,” in Proc. 5th Int. Conf. Learn. Representations, Toulon, France, 2017. [Online]. Available: https://openreview.net/forum?id=BylSPv9gx
[34]
N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2009, pp. 1–11.
[35]
E. Lee, J. Chung, D. Jung, S. Lee, S. Li, and J. Ahn, “Work as a team or individual: Characterizing system-level impacts of main memory partitioning,” in Proc. IEEE Int. Symp. Workload Characterization, 2017, pp. 156–166.
[36]
NCSU, “FreePDK45,” 2011. [Online]. Available: https://www.eda.ncsu.edu/wiki/FreePDK45:Contents
[37]
S. Thoziyoor, J. Ahn, M. Monchiero, J. B. Brockman, and N. P. Jouppi, “A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies,” in Proc. 35th ACM/IEEE Int. Symp. Comput. Archit., 2008, pp. 51–62.
[38]
O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems,” in Proc. 35th ACM/IEEE Int. Symp. Comput. Archit., 2008, pp. 63–74.
[39]
J. Ahn, S. Li, S. O. Jouppi, “McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2013, pp. 74–85.
[40]
A. Limaye and T. Adegbija, “A workload characterization of the SPEC CPU2017 benchmark suite,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2018, pp. 149–158.
[41]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior,” in Proc. 7th Int. Conf. Architect. Support Program. Lang. Operating Syst., 2002, pp. 45–57.
[42]
“Deep speech 2 pytorch,” 2017. [Online]. Available: https://github.com/SeanNaren/deepspeech.pytorch
[43]
B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “CHoNDA: Near data acceleration with concurrent host access,” 2019,.
[44]
G. Diamos, et al., “Persistent RNNs: Stashing recurrent weights On-Chip,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2024–2033.
[45]
F. Zhu, J. Pool, M. Andersch, J. Appleyard, and F. Xie, “Sparse persistent RNNs: Squeezing Large recurrent networks on-chip,” in Proc. Int. Conf. Learn. Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HkxF5RgC-

Cited By

View all
  • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
  • (2024)DRAM-Based Acceleration of Open Modification Search in Hyperdimensional SpaceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338284243:9(2592-2605)Online publication date: 1-Sep-2024
  • (2023)AAPP: An Accelerative and Adaptive Path Planner for Robots on GPUIEEE Transactions on Computers10.1109/TC.2023.324827472:8(2336-2349)Online publication date: 1-Aug-2023
  • Show More Cited By

Index Terms

  1. MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Computers
      IEEE Transactions on Computers  Volume 69, Issue 7
      July 2020
      167 pages

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 01 July 2020

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
      • (2024)DRAM-Based Acceleration of Open Modification Search in Hyperdimensional SpaceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338284243:9(2592-2605)Online publication date: 1-Sep-2024
      • (2023)AAPP: An Accelerative and Adaptive Path Planner for Robots on GPUIEEE Transactions on Computers10.1109/TC.2023.324827472:8(2336-2349)Online publication date: 1-Aug-2023
      • (2022)fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC PlatformsACM Transactions on Parallel Computing10.1145/35127709:2(1-29)Online publication date: 11-Apr-2022
      • (2022)MeNDAProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527432(245-258)Online publication date: 18-Jun-2022
      • (2022)GCIM: Toward Efficient Processing of Graph Convolutional Networks in 3D-Stacked MemoryIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319832041:11(3579-3590)Online publication date: 1-Nov-2022
      • (2021)Accelerating bandwidth-bound deep learning inference with main-memory acceleratorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476146(1-14)Online publication date: 14-Nov-2021

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media