More Web Proxy on the site http://driver.im/

research-article

Public Access

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

Authors:

Susmita Dey Manasi,

Suvadeep Banerjee,

Abhijit Davare,

Anton A. Sorokin,

Steven M. Burns,

Desmond A. Kirkpatrick,

Sachin S. SapatnekarAuthors Info & Claims

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Pages 475 - 482

https://doi.org/10.1145/3566097.3567863

Published: 31 January 2023 Publication History

Abstract

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.

References

[1]

A. G. Howard, et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications," arXiv:1704.04861, 2017.

[2]

F. Chollet, "Xception: Deep Learning With Depthwise Separable Convolutions," in Proc. CVPR, July 2017.

[3]

M. Tan and Q. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Proc. Int. Conf. on Machine Learning, vol. 97, pp. 6105--6114, June 2019.

[4]

M. Tan, et al., "MnasNet: Platform-Aware Neural Architecture Search for Mobile," in Proc. CVPR, June 2019.

[5]

X. Zhang, et al., "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices," in Proc. CVPR, June 2018.

[6]

N. P. Jouppi et al., "In-datacenter Performance Analysis of a Tensor Processing Unit," in Proc. ISCA, pp. 1--12, Jun. 2017.

[7]

T. Moreau, et al., "A Hardware-Software Blueprint for Flexible Deep Learning Specialization," IEEE Micro, vol. 39, no. 5, pp. 8--16, 2019.

[8]

H. Esmaeilzadeh, et al., "VeriGOOD-ML: An Open-Source Flow for Automated ML Hardware Synthesis," in Proc. ICCAD, 2021.

[9]

H. Genc, et al., "Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures," arXiv:1911.09925, Nov. 2019.

[10]

D. Zhang, et al., "A full-stack accelerator search technique for vision applications," arXiv:2105.12842, May 2021.

[11]

S. Banerjee, et al., "A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration," arXiv preprint arXiv:2111.15024, Nov. 2021.

[12]

B. Liu, et al., "An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution," Electronics, vol. 8, Mar. 2019.

[13]

M. Baharani, et al., "DeepDive: An Integrative Algorithm/Architecture Co-Design for Deep Separable Convolutional Neural Networks," in Proc. GLSVLSI, pp. 247--252, June 2021.

[14]

D. Wu, et al., "A High-Performance CNN Processor Based on FPGA for MobileNets," in Proc. FPL, pp. 136--143, 2019.

[15]

L. Bai, et al., "A CNN Accelerator on FPGA Using Depthwise Separable Convolution," IEEE T. Circuits-II, vol. 65, pp. 1415--1419, Aug. 2018.

[16]

N. Vedula, et al., "X-Layer: Building Composable Pipelined Dataflows for Low-Rank Convolutions," in Proc. PACT, pp. 103--115, 2021.

[17]

Y.-H. Chen, et al., "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE JETCAS, vol. 9, pp. 292--308, Apr. 2019.

[18]

T. Chen, et al., "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning," in Proc. OSDI, pp. 578--594, Oct. 2018.

[19]

N. P. Jouppi, et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks," Communications of the ACM, vol. 63, pp. 67--78, July 2020.

Digital Library

[20]

M. Gurel, A Comparative Study between RTL and HLS for Image Processing Applications with FPGAs. University of California, San Diego, 2016.

[21]

"VTA Hardware Design Stack." https://github.com/pasqoc/incubator-tvm-vta.

Cited By

ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
Nabavinejad SReda SGuo T(2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00063
Wang FZhang CDeng YYang XYu SDou RWu NLiu L(2024)A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processingElectronics Letters10.1049/ell2.1312560:3Online publication date: 8-Feb-2024
https://doi.org/10.1049/ell2.13125
Show More Cited By

Index Terms

Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs
ICPP Workshops '24: Workshop Proceedings of the 53rd International Conference on Parallel Processing

Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision ...
3D Depthwise Convolution: Reducing Model Parameters in 3D Vision Tasks
Advances in Artificial Intelligence
Abstract
Standard 3D convolution operations usually require larger amounts of memory and computation cost than 2D convolution operations. The fact increases the difficulty of the development of deep neural nets in many 3D vision tasks. In this paper, we ...
Hardware Architecture of Embedded Inference Accelerator and Analysis of Algorithms for Depthwise and Large-Kernel Convolutions
Computer Vision – ECCV 2020 Workshops
Abstract
In order to handle modern convolutional neural networks (CNNs) efficiently, a hardware architecture of CNN inference accelerator is proposed to handle depthwise convolutions and regular convolutions, which are both essential building blocks for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

January 2023

807 pages

ISBN:9781450397834

DOI:10.1145/3566097

General Chair:
Atsushi Takahashi
Tokyo Institute of Technology

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IPSJ
IEEE CAS
IEEE CEDA
IEICE

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency
Internship at Intel Strategic CAD Labs

Conference

ASPDAC '23

Sponsor:

SIGDA

ASPDAC '23: 28th Asia and South Pacific Design Automation Conference

January 16 - 19, 2023

Tokyo, Japan

Acceptance Rates

ASPDAC '23 Paper Acceptance Rate 102 of 328 submissions, 31%;

Overall Acceptance Rate 466 of 1,454 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
479
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)29

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
Nabavinejad SReda SGuo T(2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00063
Wang FZhang CDeng YYang XYu SDou RWu NLiu L(2024)A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processingElectronics Letters10.1049/ell2.1312560:3Online publication date: 8-Feb-2024
https://doi.org/10.1049/ell2.13125
Gookyi DWilson MAhiadormey RAsiedu DDanquah PGyaang R(2023)The Efficiency of Convolution on Gemmini Deep Learning Hardware Accelerator2023 IEEE AFRICON10.1109/AFRICON55910.2023.10293709(1-5)Online publication date: 20-Sep-2023
https://doi.org/10.1109/AFRICON55910.2023.10293709

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten