[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

Published: 13 March 2024 Publication History

Abstract

Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck, we adopt several techniques to improve data reuse and reduce I/O requirements. An arithmetic logic unit is further proposed that can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1,653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8× faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4,050 FPS. We propose a tilling strategy to achieve feature-map-stationary for high-definition CNN with the accelerator, achieving 3.8× FPS improvement on the residual channel attention network and 3.1× on super-efficient super-resolution. This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end performance of 10.1 FPS with all the optimizations.

References

[1]
G. Alok. 2020. Architecture apocalypse dream architecture for deep learning inference and compute-VERSAL AI core. In Proceedings of the Embedded World Conference.
[2]
Saeed Anwar, Salman Khan, and Nick Barnes. 2020. A deep journey into super-resolution: A survey. ACM Computing Surveys 53, 3 (2020), 1–34.
[3]
Syed Muhammad Arsalan Bashir, Yi Wang, Mahrukh Khan, and Yilong Niu. 2021. A comprehensive review of deep learning-based single image super-resolution. PeerJ Computer Science 7 (2021), e621.
[4]
Kartikeya Bhardwaj, Milos Milosavljevic, Liam O’Neil, Dibakar Gope, Ramon Matas, Alex Chalfin, Naveen Suda, Lingchuan Meng, and Danny Loh. 2022. Collapsible linear blocks for super-efficient super resolution. Proceedings of Machine Learning and Systems 4 (2022), 529–547.
[5]
Jia-Ren Chang and Yong-Sheng Chen. 2018. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5418.
[6]
Prasanth Chatarasi, Stephen Neuendorffer, Samuel Bayliss, Kees Vissers, and Vivek Sarkar. 2020. Vyasa: A high-performance vectorizing compiler for tensor convolutions on the Xilinx AI engine. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, Los Alamitos, CA, 1–10.
[7]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 609–622.
[8]
Huipeng Deng, Jian Wang, Huafeng Ye, Shanlin Xiao, Xiangyu Meng, and Zhiyi Yu. 2021. 3D-VNPU: A flexible accelerator for 2D/3D CNNs on FPGA. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, Los Alamitos, CA, 181–185.
[9]
Vincent Dumoulin and Francesco Visin. 2016. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 (2016).
[10]
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107 (2018), 3–11.
[11]
Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx adaptive compute acceleration platform: VersalTM architecture. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 84–93.
[12]
Hongbo Gao, Bo Cheng, Jianqiang Wang, Keqiang Li, Jianhui Zhao, and Deyi Li. 2018. Object classification using CNN-based fusion of vision and LIDAR in autonomous vehicle environment. IEEE Transactions on Industrial Informatics 14, 9 (2018), 4224–4231.
[13]
Thomas Gulde, Dennis Ludl, and Cristóbal Curio. 2018. RoPose: CNN-based 2D pose estimation of industrial robots. In Proceedings of the 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE’18). IEEE, Los Alamitos, CA, 463–470.
[14]
Doyun Kim, Kyoung-Young Kim, Sangsoo Ko, and Sanghyuck Ha. 2019. A simple method to reduce off-chip memory accesses on convolutional neural networks. arXiv preprint arXiv:1901.09614 (2019).
[15]
Gang Li, Zejian Liu, Fanrong Li, and Jian Cheng. 2022. Block convolution: Toward memory-efficient inference of large-scale CNNs on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 5 (2022), 1436–1447.
[16]
Xiaofang Li, Yirui Wu, Wen Zhang, Ruichao Wang, and Feng Hou. 2020. Deep learning methods in real-time image super-resolution: A survey. Journal of Real-Time Image Processing 17, 6 (2020), 1885–1909.
[17]
Wenqi Lou, Lei Gong, Chao Wang, Zidong Du, and Zhou Xuehai. 2022. OctCNN: A high throughput FPGA accelerator for CNNs using octave convolution algorithm. IEEE Transactions on Computers 71, 8 (2022), 1847–1859.
[18]
Sangkug Lym, Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee, and Mattan Erez. 2019. DeLTA: GPU performance model for deep learning applications with in-depth memory system traffic analysis. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 293–303.
[19]
Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2016. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4040–4048.
[20]
Huiyu Mo, Wenping Zhu, Wenjing Hu, Guangbin Wang, Qiang Li, Ang Li, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. 9.2 A 28nm 12.1 TOPS/W Dual-Mode CNN processor using effective-weight-based convolution and error-compensation-based prediction. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC’21), Vol. 64. IEEE, Los Alamitos, CA, 146–148.
[21]
K. B. Pranav and J. Manikandan. 2020. Design and evaluation of a real-time pedestrian detection system for autonomous vehicles. In Proceedings of the 2020 Zooming Innovation in Consumer Technologies Conference (ZINC’20). IEEE, Los Alamitos, CA, 155–159.
[22]
Muhamad Dwisnanto Putro, Duy-Linh Nguyen, and Kang-Hyun Jo. 2020. Fast eye detector using CPU based lightweight convolutional neural network. In Proceedings of the 2020 20th International Conference on Control, Automation, and Systems (ICCAS’20). IEEE, Los Alamitos, CA, 12–16.
[23]
Murad Qasaimeh, Joseph Zambreno, and Phillip H. Jones. 2021. An efficient hardware architecture for sparse convolution using linear feedback shift registers. In Proceedings of the 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’21). IEEE, Los Alamitos, CA, 250–257.
[24]
Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, and Chunyuan Zhang. 2018. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 97–106.
[25]
S. Sudha, K. B. Jayanthi, C. Rajasekaran, and T. Sunder. 2019. Segmentation of RoI in medical images using CNN-A comparative study. In Proceedings of the 2019 IEEE Region 10 Conference (TENCON’19). IEEE, Los Alamitos, CA, 767–771.
[26]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Improving accuracy and efficiency through AutoML and model scaling. arXiv preprint arXiv:1905.11946 (2019).
[27]
Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. 2020. FadNet: A fast and accurate network for disparity estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA’20). IEEE, Los Alamitos, CA, 101–107.
[28]
Di Wu, Yu Zhang, Xijie Jia, Lu Tian, Tianping Li, Lingzhi Sui, Dongliang Xie, and Yi Shan. 2019. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, Los Alamitos, CA, 136–143.
[29]
AMD Xilinx. 2021. UltraScale Architecture DSP Slice: User Guide (UG579). Retrieved September 19, 2023 from https://docs.xilinx.com/v/u/en-US/ug579-ultrascale-dsp
[30]
AMD Xilinx. 2021. VCK190 Evaluation Board User Guide (UG1366). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1366-vck190-eval-bd
[31]
AMD Xilinx. 2021. Versal ACAP AI Engine Programming Environment User Guide (UG1076). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1076-ai-engine-environment
[32]
AMD Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual (AM004). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/am004-versal-dsp-engine
[33]
AMD Xilinx. 2021. Vitis AI Library User Guide (UG1354). Retrieved September 19, 2023 from https://docs.xilinx.com/r/1.4.1-English/ug1354-xilinx-ai-sdk/ZCU102-Evaluation-Kit
[34]
AMD Xilinx. 2021. Vitis AI Tool. Retrieved September 19, 2023 from https://github.com/Xilinx/Vitis-AI
[35]
AMD Xilinx. 2021. Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393). Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration
[36]
AMD Xilinx. 2022. DPUCVDX8G for Versal ACAPs Product Guide. Retrieved September 19, 2023 from https://docs.xilinx.com/r/en-US/pg389-dpucvdx8g
[37]
AMD Xilinx. 2022. DPUCZDX8G for Zynq UltraScale+ MPSoCs. Product Guide. Retrieved September 19, 2023 fromhttps://docs.xilinx.com/r/en-US/pg338-dpu
[38]
Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, and Qingmin Liao. 2019. Deep learning for single image super-resolution: A brief review. IEEE Transactions on Multimedia 21, 12 (2019), 3106–3121.
[39]
Tianyu Zhang, Dong Li, Hong Wang, Yunzhi Li, Xiang Ma, Wei Luo, Yu Wang, Yang Huang, Yi Li, Yu Zhang, Xinlin Yang, Xijie Jia, Qiang Lin, Lu Tian, Fan Jiang, Dongliang Xie, Hong Luo, and Yi Shan. 2022. A-U3D: A unified 2D/3D CNN accelerator on the versal platform for disparity estimation. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL’22). 123–129.
[40]
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 286–301.
[41]
Hongyu Zhu, Chao Xie, Yeqi Fei, and Huanjie Tao. 2021. Attention mechanisms in CNN-based single image super-resolution: A brief review and a new perspective. Electronics 10, 10 (2021), 1187.

Cited By

View all
  • (2024)Efficient I/O Performance-Focused Scheduling in High-Performance ComputingApplied Sciences10.3390/app14211004314:21(10043)Online publication date: 4-Nov-2024
  • (2024)A Study on Number Theoretic Transform Acceleration on AMD AI Engine2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00060(325-331)Online publication date: 16-Dec-2024
  • (2024)Tensor Extreme Learning Network for Hyperspectral Image Classification2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653316(697-700)Online publication date: 10-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 2
June 2024
464 pages
EISSN:1936-7414
DOI:10.1145/3613550
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2024
Online AM: 13 September 2023
Accepted: 11 August 2023
Revised: 28 April 2023
Received: 07 December 2022
Published in TRETS Volume 17, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ACAP
  2. acceleration
  3. AI Engine
  4. ALU engine
  5. CNN
  6. FPGA
  7. hardware heterogeneous architecture
  8. Versal

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,043
  • Downloads (Last 6 weeks)83
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient I/O Performance-Focused Scheduling in High-Performance ComputingApplied Sciences10.3390/app14211004314:21(10043)Online publication date: 4-Nov-2024
  • (2024)A Study on Number Theoretic Transform Acceleration on AMD AI Engine2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00060(325-331)Online publication date: 16-Dec-2024
  • (2024)Tensor Extreme Learning Network for Hyperspectral Image Classification2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653316(697-700)Online publication date: 10-May-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media