More Web Proxy on the site http://driver.im/

research-article

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

Author:

Hyungmin ChoAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 20, Issue 5s

Article No.: 53, Pages 1 - 20

https://doi.org/10.1145/3476984

Published: 17 September 2021 Publication History

Abstract

Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads.

In addition, the PEs in systolic arrays can be efficiently used only if the data items (tensors) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency.

Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.

References

[1]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA).

Digital Library

[2]

Yu Hsin Chen, Tien Ju Yang, Joel S. Emer, and Vivienne Sze. 2019. Eyeriss v2: A Flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292–308.

[3]

Hyungmin Cho, Pyeongseok Oh, Jiyoung Park, Wookeun Jung, and Jaejin Lee. 2019. FA3C: FPGA-accelerated deep reinforcement learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).

[5]

Hasan Genc et al. 2019. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. arXiv:1911.09925 (2019).

[6]

Sumanth Gudaparthi et al. 2019. Wire-aware architecture and dataflow for CNN accelerators. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO).

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).

[8]

Andrew Howard et al. 2019. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[9]

Benoit Jacob et al. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).

[10]

Norman P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA).

[11]

H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proccedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). https://doi.org/10.1145/3297858.3304028

Digital Library

[12]

Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[13]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2017. Rethinking NoCs for spatial neural network accelerators. In Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

Digital Library

[14]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via programmable interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS).

Digital Library

[15]

Renjie Liu. 2020. Higher accuracy on vision models with EfficientNet-Lite. https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models -with-efficientnet-lite.html.

[16]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A Flexible dataflow accelerator architecture for convolutional neural networks. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).

[17]

NVIDIA. 2017. NVDLA Deep Learning Accelerator.http://nvdla.org.

[18]

Dianne P. O’Leary. 1987. Systolic arrays for matrix transpose and other reorderings. IEEE Trans. Comput. 36, 1 (Jan. 1987), 117–122.

[19]

Junhao Pan and Deming Chen. 2021. Accelerate Non-Unit Stride Convolutions with Winograd Algorithms.

[20]

Angshuman Parashar et al. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).

[21]

Eric Qin et al. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). https://doi.org/10.1109/HPCA47549.2020.00015

[22]

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator simulator. arXiv:1811.02883 (2018).

[23]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR).

[24]

Yakun Sophia Shao and Others. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[25]

Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML).

[26]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS).

[27]

Xuechao Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference (DAC). https://doi.org/10.1145/3061639.3062207

Digital Library

[28]

Juan Yepez and Seok-Bum Ko. 2020. Stride 2 1-D, 2-D, and 3-D winograd for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 4 (2020), 853–863.

Cited By

Aizenberg IVasko A(2024)Frequency-Domain and Spatial-Domain MLMVN-Based Convolutional Neural NetworksAlgorithms10.3390/a1708036117:8(361)Online publication date: 17-Aug-2024
https://doi.org/10.3390/a17080361
Park MHwang SCho H(2024)BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic ArraysIEEE Transactions on Computers10.1109/TC.2024.344910373:12(2708-2721)Online publication date: Dec-2024
https://doi.org/10.1109/TC.2024.3449103
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1145/3604802
Show More Cited By

Index Terms

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
    2. Parallel architectures
      1. Systolic arrays
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

Hardware Architecture of Embedded Inference Accelerator and Analysis of Algorithms for Depthwise and Large-Kernel Convolutions
Computer Vision – ECCV 2020 Workshops
Abstract
In order to handle modern convolutional neural networks (CNNs) efficiently, a hardware architecture of CNN inference accelerator is proposed to handle depthwise convolutions and regular convolutions, which are both essential building blocks for ...
Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs
ICPP Workshops '24: Workshop Proceedings of the 53rd International Conference on Parallel Processing

Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision ...
PokerNet: Expanding Features Cheaply via Depthwise Convolutions
Abstract
Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models. However, it takes up most of the overall computational cost (usually more than 90%). This paper proposes a novel Poker module to expand ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 20, Issue 5s

Special Issue ESWEEK 2021, CASES 2021, CODES+ISSS 2021 and EMSOFT 2021

October 2021

1367 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3481713

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 17 September 2021

Accepted: 01 July 2021

Revised: 01 June 2021

Received: 01 April 2021

Published in TECS Volume 20, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Institute of Information & communications Technology Planning & Evaluation (IITP)
Korea government(MSIT), Research on CPU vulnerability detection and validation, Development of high speed encrypt ion data processing technology that guarantees privacy based hardware
IC Design Education Center(IDEC), Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)183
Downloads (Last 6 weeks)20

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aizenberg IVasko A(2024)Frequency-Domain and Spatial-Domain MLMVN-Based Convolutional Neural NetworksAlgorithms10.3390/a1708036117:8(361)Online publication date: 17-Aug-2024
https://doi.org/10.3390/a17080361
Park MHwang SCho H(2024)BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic ArraysIEEE Transactions on Computers10.1109/TC.2024.344910373:12(2708-2721)Online publication date: Dec-2024
https://doi.org/10.1109/TC.2024.3449103
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1145/3604802
Ye WZhou XZhou JChen CLi K(2023)Accelerating Attention Mechanism on FPGAs based on Efficient Reconfigurable Systolic ArrayACM Transactions on Embedded Computing Systems10.1145/354993722:6(1-22)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3549937
Jiang WYu HHa Y(2023)A High-Throughput Full-Dataflow MobileNetv2 Accelerator on Edge FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319824642:5(1532-1545)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TCAD.2022.3198246
Lo YLiu R(2023)Morphable CIM: Improving Operation Intensity and Depthwise Capability for SRAM-CIM Architecture2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247750(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247750
Qian Zhang SMcDanel BKung H(2022)FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00067(846-860)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00067
Yüzügüler ADimitriadis NFrossard P(2022)U-Boost NAS: Utilization-Boosted Differentiable Neural Architecture SearchComputer Vision – ECCV 202210.1007/978-3-031-19775-8_11(173-190)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-19775-8_11

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents