[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3566097.3567863acmconferencesArticle/Chapter ViewAbstractPublication PagesaspdacConference Proceedingsconference-collections
research-article
Public Access

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

Published: 31 January 2023 Publication History

Abstract

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

References

[1]
A. G. Howard, et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv:1704.04861, 2017.
[2]
F. Chollet, "Xception: Deep Learning With Depthwise Separable Convolutions," in Proc. CVPR, July 2017.
[3]
M. Tan and Q. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Proc. Int. Conf. on Machine Learning, vol. 97, pp. 6105--6114, June 2019.
[4]
M. Tan, et al., "MnasNet: Platform-Aware Neural Architecture Search for Mobile," in Proc. CVPR, June 2019.
[5]
X. Zhang, et al., "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," in Proc. CVPR, June 2018.
[6]
N. P. Jouppi et al., "In-datacenter Performance Analysis of a Tensor Processing Unit," in Proc. ISCA, pp. 1--12, Jun. 2017.
[7]
T. Moreau, et al., "A Hardware-Software Blueprint for Flexible Deep Learning Specialization," IEEE Micro, vol. 39, no. 5, pp. 8--16, 2019.
[8]
H. Esmaeilzadeh, et al., "VeriGOOD-ML: An Open-Source Flow for Automated ML Hardware Synthesis," in Proc. ICCAD, 2021.
[9]
H. Genc, et al., "Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures," arXiv:1911.09925, Nov. 2019.
[10]
D. Zhang, et al., "A full-stack accelerator search technique for vision applications," arXiv:2105.12842, May 2021.
[11]
S. Banerjee, et al., "A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration," arXiv preprint arXiv:2111.15024, Nov. 2021.
[12]
B. Liu, et al., "An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution," Electronics, vol. 8, Mar. 2019.
[13]
M. Baharani, et al., "DeepDive: An Integrative Algorithm/Architecture Co-Design for Deep Separable Convolutional Neural Networks," in Proc. GLSVLSI, pp. 247--252, June 2021.
[14]
D. Wu, et al., "A High-Performance CNN Processor Based on FPGA for MobileNets," in Proc. FPL, pp. 136--143, 2019.
[15]
L. Bai, et al., "A CNN Accelerator on FPGA Using Depthwise Separable Convolution," IEEE T. Circuits-II, vol. 65, pp. 1415--1419, Aug. 2018.
[16]
N. Vedula, et al., "X-Layer: Building Composable Pipelined Dataflows for Low-Rank Convolutions," in Proc. PACT, pp. 103--115, 2021.
[17]
Y.-H. Chen, et al., "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE JETCAS, vol. 9, pp. 292--308, Apr. 2019.
[18]
T. Chen, et al., "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning," in Proc. OSDI, pp. 578--594, Oct. 2018.
[19]
N. P. Jouppi, et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks," Communications of the ACM, vol. 63, pp. 67--78, July 2020.
[20]
M. Gurel, A Comparative Study between RTL and HLS for Image Processing Applications with FPGAs. University of California, San Diego, 2016.
[21]
"VTA Hardware Design Stack." https://github.com/pasqoc/incubator-tvm-vta.

Cited By

View all
  • (2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
  • (2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
  • (2024)A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processingElectronics Letters10.1049/ell2.1312560:3Online publication date: 8-Feb-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference
January 2023
807 pages
ISBN:9781450397834
DOI:10.1145/3566097
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IPSJ
  • IEEE CAS
  • IEEE CEDA
  • IEICE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning accelerator
  2. depthwise convolution
  3. lightweight CNN

Qualifiers

  • Research-article

Funding Sources

Conference

ASPDAC '23
Sponsor:

Acceptance Rates

ASPDAC '23 Paper Acceptance Rate 102 of 328 submissions, 31%;
Overall Acceptance Rate 466 of 1,454 submissions, 32%

Upcoming Conference

ASPDAC '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)258
  • Downloads (Last 6 weeks)27
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
  • (2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
  • (2024)A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processingElectronics Letters10.1049/ell2.1312560:3Online publication date: 8-Feb-2024
  • (2023)The Efficiency of Convolution on Gemmini Deep Learning Hardware Accelerator2023 IEEE AFRICON10.1109/AFRICON55910.2023.10293709(1-5)Online publication date: 20-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media