More Web Proxy on the site http://driver.im/

research-article

XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

ACM Transactions on Reconfigurable Technology and Systems, Volume 17, Issue 2

Article No.: 20, Pages 1 - 24

https://doi.org/10.1145/3617836

Published: 13 March 2024 Publication History

Abstract

Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8× faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8× FPS improvement on the residual channel attention network and 3.1× on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.

References

[1]

G. Alok. 2020. Architecture apocalypse dream architecture for deep learning inference and compute-VERSAL AI core. In Proceedings of the Embedded World Conference.

[2]

Saeed Anwar, Salman Khan, and Nick Barnes. 2020. A deep journey into super-resolution: A survey. ACM Computing Surveys 53, 3 (2020), 1–34.

Digital Library

[3]

Syed Muhammad Arsalan Bashir, Yi Wang, Mahrukh Khan, and Yilong Niu. 2021. A comprehensive review of deep learning-based single image super-resolution. PeerJ Computer Science 7 (2021), e621.

[4]

Kartikeya Bhardwaj, Milos Milosavljevic, Liam O’Neil, Dibakar Gope, Ramon Matas, Alex Chalfin, Naveen Suda, Lingchuan Meng, and Danny Loh. 2022. Collapsible linear blocks for super-efficient super resolution. Proceedings of Machine Learning and Systems 4 (2022), 529–547.

[5]

Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5418.

[6]

Prasanth Chatarasi, Stephen Neuendorffer, Samuel Bayliss, Kees Vissers, and Vivek Sarkar. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the Xilinx AI engine. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, Los Alamitos, CA, 1–10.

[7]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 609–622.

Digital Library

[8]

Huipeng Deng, Jian Wang, Huafeng Ye, Shanlin Xiao, Xiangyu Meng, and Zhiyi Yu. 2021. 3D-VNPU: A flexible accelerator for 2D/3D CNNs on FPGA. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, Los Alamitos, CA, 181–185.

[9]

Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).

[10]

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107 (2018), 3–11.

[11]

Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx adaptive compute acceleration platform: VersalTM architecture. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 84–93.

Digital Library

[12]

Hongbo Gao, Bo Cheng, Jianqiang Wang, Keqiang Li, Jianhui Zhao, and Deyi Li. 2018. Object classification using CNN-based fusion of vision and LIDAR in autonomous vehicle environment. IEEE Transactions on Industrial Informatics 14, 9 (2018), 4224–4231.

[13]

Thomas Gulde, Dennis Ludl, and Cristóbal Curio. 2018. RoPose: CNN-based 2D pose estimation of industrial robots. In Proceedings of the 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE’18). IEEE, Los Alamitos, CA, 463–470.

Digital Library

[14]

Doyun Kim, Kyoung-Young Kim, Sangsoo Ko, and Sanghyuck Ha. 2019. A simple method to reduce off-chip memory accesses on convolutional neural networks. arXiv preprint arXiv:1901.09614 (2019).

[15]

Gang Li, Zejian Liu, Fanrong Li, and Jian Cheng. 2022. Block convolution: Toward memory-efficient inference of large-scale CNNs on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 5 (2022), 1436–1447.

[16]

Xiaofang Li, Yirui Wu, Wen Zhang, Ruichao Wang, and Feng Hou. 2020. Deep learning methods in real-time image super-resolution: A survey. Journal of Real-Time Image Processing 17, 6 (2020), 1885–1909.

Digital Library

[17]

Wenqi Lou, Lei Gong, Chao Wang, Zidong Du, and Zhou Xuehai. 2022. OctCNN: A high throughput FPGA accelerator for CNNs using octave convolution algorithm. IEEE Transactions on Computers 71, 8 (2022), 1847–1859.

[18]

Sangkug Lym, Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee, and Mattan Erez. 2019. DeLTA: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 293–303.

[19]

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4040–4048.

[20]

Huiyu Mo, Wenping Zhu, Wenjing Hu, Guangbin Wang, Qiang Li, Ang Li, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. 9.2 A 28nm 12.1 TOPS/W Dual-Mode CNN processor using effective-weight-based convolution and error-compensation-based prediction. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC’21), Vol. 64. IEEE, Los Alamitos, CA, 146–148.

[21]

K. B. Pranav and J. Manikandan. 2020. Design and evaluation of a real-time pedestrian detection system for autonomous vehicles. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC’20). IEEE, Los Alamitos, CA, 155–159.

[22]

Muhamad Dwisnanto Putro, Duy-Linh Nguyen, and Kang-Hyun Jo. 2020. Fast eye detector using CPU based lightweight convolutional neural network. In Proceedings of the 2020 20th International Conference on Control, Automation, and Systems (ICCAS’20). IEEE, Los Alamitos, CA, 12–16.

[23]

Murad Qasaimeh, Joseph Zambreno, and Phillip H. Jones. 2021. An efficient hardware architecture for sparse convolution using linear feedback shift registers. In Proceedings of the 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’21). IEEE, Los Alamitos, CA, 250–257.

[24]

Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 97–106.

Digital Library

[25]

S. Sudha, K. B. Jayanthi, C. Rajasekaran, and T. Sunder. 2019. Segmentation of RoI in medical images using CNN-A comparative study. In Proceedings of the 2019 IEEE Region 10 Conference (TENCON’19). IEEE, Los Alamitos, CA, 767–771.

[26]

Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Improving accuracy and efficiency through AutoML and model scaling. arXiv preprint arXiv:1905.11946 (2019).

[27]

Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. 2020. FadNet: A fast and accurate network for disparity estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA’20). IEEE, Los Alamitos, CA, 101–107.

[28]

Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, Los Alamitos, CA, 136–143.

[29]

AMD Xilinx. 2021. UltraScale Architecture DSP Slice: User Guide (UG579). Retrieved September 19, 2023 from https://docs.xilinx.com/v/u/en-US/ug579-ultrascale-dsp

[30]

AMD Xilinx. 2021. VCK190 Evaluation Board User Guide (UG1366). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1366-vck190-eval-bd

[31]

AMD Xilinx. 2021. Versal ACAP AI Engine Programming Environment User Guide (UG1076). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1076-ai-engine-environment

[32]

AMD Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual (AM004). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/am004-versal-dsp-engine

[33]

AMD Xilinx. 2021. Vitis AI Library User Guide (UG1354). Retrieved September 19, 2023 from https://docs.xilinx.com/r/1.4.1-English/ug1354-xilinx-ai-sdk/ZCU102-Evaluation-Kit

[34]

AMD Xilinx. 2021. Vitis AI Tool. Retrieved September 19, 2023 from https://github.com/Xilinx/Vitis-AI

[35]

AMD Xilinx. 2021. Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration

[36]

AMD Xilinx. 2022. DPUCVDX8G for Versal ACAPs Product Guide. Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/pg389-dpucvdx8g

[37]

AMD Xilinx. 2022. DPUCZDX8G for Zynq UltraScale+ MPSoCs. Product Guide. Retrieved September 19, 2023 fromhttps://docs.xilinx.com/r/en-US/pg338-dpu

[38]

Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, and Qingmin Liao. 2019. Deep learning for single image super-resolution: A brief review. IEEE Transactions on Multimedia 21, 12 (2019), 3106–3121.

Digital Library

[39]

Tianyu Zhang, Dong Li, Hong Wang, Yunzhi Li, Xiang Ma, Wei Luo, Yu Wang, Yang Huang, Yi Li, Yu Zhang, Xinlin Yang, Xijie Jia, Qiang Lin, Lu Tian, Fan Jiang, Dongliang Xie, Hong Luo, and Yi Shan. 2022. A-U3D: A unified 2D/3D CNN accelerator on the versal platform for disparity estimation. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL’22). 123–129.

[40]

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 286–301.

Digital Library

[41]

Hongyu Zhu, Chao Xie, Yeqi Fei, and Huanjie Tao. 2021. Attention mechanisms in CNN-based single image super-resolution: A brief review and a new perspective. Electronics 10, 10 (2021), 1187.

Cited By

Kim SKim SKim H(2024)Efficient I/O Performance-Focused Scheduling in High-Performance ComputingApplied Sciences10.3390/app14211004314:21(10043)Online publication date: 4-Nov-2024
https://doi.org/10.3390/app142110043
Nozaki AKojima TNakamura HTakase H(2024)A Study on Number Theoretic Transform Acceleration on AMD AI Engine2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00060(325-331)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00060
Zhou LZhang ZXie M(2024)Tensor Extreme Learning Network for Hyperspectral Image Classification2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653316(697-700)Online publication date: 10-May-2024
https://doi.org/10.1109/CISCE62493.2024.10653316

Index Terms

XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

Recommendations

Xilinx Adaptive Compute Acceleration Platform: Versal^TM Architecture
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

In this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional FPGA programmable fabric, software programmable processors and software programmable ...
Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC
ICDSP '18: Proceedings of the 2nd International Conference on Digital Signal Processing

Today, Convolution Neural Networks (CNN) is adopted by various application areas such as computer vision, speech recognition, and natural language processing. Due to a massive amount of computing for CNN, CNN running on an embedded platform may not meet ...
Network-on-Chip Programmable Platform in Versal^TM ACAP Architecture
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

This paper outlines the Network-on-Chip (NoC) on Xilinx's next generation Versal-architecture. It is a hardened NoC that is present in Xilinx's next-generation 7nm architecture devices. These devices include many other new hardened features that make up ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 17, Issue 2

June 2024

464 pages

EISSN:1936-7414

DOI:10.1145/3613550

Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2024

Online AM: 13 September 2023

Accepted: 11 August 2023

Revised: 28 April 2023

Received: 07 December 2022

Published in TRETS Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,173
Total Downloads

Downloads (Last 12 months)1,043
Downloads (Last 6 weeks)83

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim SKim SKim H(2024)Efficient I/O Performance-Focused Scheduling in High-Performance ComputingApplied Sciences10.3390/app14211004314:21(10043)Online publication date: 4-Nov-2024
https://doi.org/10.3390/app142110043
Nozaki AKojima TNakamura HTakase H(2024)A Study on Number Theoretic Transform Acceleration on AMD AI Engine2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00060(325-331)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00060
Zhou LZhang ZXie M(2024)Tensor Extreme Learning Network for Hyperspectral Image Classification2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653316(697-700)Online publication date: 10-May-2024
https://doi.org/10.1109/CISCE62493.2024.10653316

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents