[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Deep Learning Inferencing with High-performance Hardware Accelerators

Published: 15 June 2023 Publication History

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively.

References

[1]
T. Aarrestad, V. Loncar, et al. 2021. Fast convolutional neural networks on FPGAs with hls4ml. Machine Learning: Science and Technology 2, 4 (2021), 1–25.
[2]
Martín Abadi, Ashish Agarwal, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.
[5]
Google Cloud. 2022. Cloud TPU system architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.
[6]
Elliot Delaye. 2018. Integrating AI into your accelerated cloud applications. Xilinx. https://www.xilinx.com/video/fpga/integrating-ai-into-accelerated-cloud-applications.html.
[7]
L. Deng and D. Yu. 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7, 33–34 (2014), 1–99.
[8]
R. DiCecco, G. Lacey, et al. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks In. International Conference on Field-Programmable Technology (FPT’16).
[9]
J. Duarte, S. Han, et al. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13 (2018), 7–27.
[10]
M. Egmont-Petersen, D. de Ridder, and H. Handels. 2002. Image processing with neural networks - A review. Pattern Recognition 35, 10 (2002), 2279–2301.
[11]
C. Farabet, B. Martini, et al. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In IEEE International Symposium on Circuits and Systems.
[12]
C. Farabet, C. Poulet, et al. 2009. CNP: An FPGA-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications (FPL’09).
[13]
J. Fowers, G. Brown, et al. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12).
[14]
X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (PMLR’11), 14.
[15]
[16]
[21]
Intel. 2022. Intel® FPGA deep learning acceleration suite enables Intel FPGAs for accelerated AI optimized for performance, power, and cost.https://dl.dell.com/manuals/common/deep_learning_inferencing_intel_fpga-pt2.pdf.
[22]
Intel. 2022. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/current/introduction-rush-creek.html#vjb1508359354353.
[23]
[25]
Y. Jia, E. Shelhamer, and Y. Bengio. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia (MM’14), 22.
[26]
P. Judd, J. Albericio, et al. 2015. Reduced-precision strategies for bounded memory in deep neural nets. CoRR (2015). http://arxiv.org/abs/1511.05236.
[27]
L. Kljucaric and A. D. George. 2019. Deep-learning inferencing with high-performance hardware accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’19), 1–7.
[28]
L. Kljucaric, A. Johnson, and A. D. George. 2020. Architectural analysis of deep learning on edge accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–7.
[29]
Y. Kochura, Y. Gordienko, et al. 2018. Batch size influence on performance of graphic and tensor processing units during training and inference phases. arXiv (2018). arXiv:1812.11731.
[30]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, 2 (2012), 84–90.
[31]
S. Lai, L. Jin, and W. Yang. 2017. Toward high-performance online HCCR: A CNN approach with dropdistortion, path signature and spatial stochastic max-pooling. Pattern Recognition Letters 89 (2017), 60–66.
[32]
N. D. Lane, S. Bhattacharya, et al. 2017. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing 16, 3 (2017), 82–88.
[33]
D. Langerman, A. Johnson, et al. 2020. Beyond floating-point ops: CNN performance prediction with critical datapath length. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–9.
[34]
Y. LeCun, L. Bottou, et al. 1998. Gradient-based Learning Applied to Document Recognition. IEEE.
[35]
S. Lee and C. Lee. 2020. Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications 79 (2020), 34195–34207.
[36]
W. Liu, J. Wei, and Q. Meng. 2020. Comparisions on KNN, SVM, BP and the CNN for handwritten digit recognition. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA’20), 587–590.
[37]
P. Mattson, H. Tang, et al. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16.
[38]
Mipsology. 2022. ZEBRA ACCELERATES MACHINE LEARNING INFERENCE EVERYWHERE. https://mipsology.com/.
[39]
T. P. Morgan. 2018. TEASING OUT THE BANG FOR THE BUCK OF INFERENCE ENGINES. https://www.nextplatform.com/2018/10/12/teasing-out-the-bang-for-the-buck-of-inference-engines/.
[43]
NVIDIA. 2022. NVIDIA ampere GPU architecture tuning guide. https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html.
[47]
K. Ovtcharov, O. Ruwase, et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft. https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.
[48]
B. Pang, E. Nijkamp, and Y. Nian Wu. 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227–248.
[49]
S. Wang and P. Kanwar. 2022. BFloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
[50]
M. Sandler, A. Howard, et al. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520.
[51]
C. Szegedy, W. Liu, et al. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1–9.
[53]
L. Tsochatzidis, L. Costaridou, and I. Pratikakis. 2019. Deep learning for breast cancer diagnosis from mammograms-A comparative study. Journal of Imaging 5, 37 (2019), 1–11.
[54]
N. Wang, J. Cho, et al. 2018. Training deep neural networks with 8-bit floating point numbers. In The Conference on Neural Information Processing Systems (NIPS’18). https://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.
[55]
Y. Wang, G.-Y. Wei, and D. Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv (2019). arXiv:1907.10701.
[56]
P. J. Werbos. 1990. Backpropagation through Time: What It Does and How to Do It. IEEE.
[57]
X. xie, X. Hu, et al. 2020. NNBench-X: A benchmarking methodology for neural network accelerator designs. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 11–15.
[59]
Xilinx. 2022. Alveo U200 and U250 data center accelerator cards data sheet. https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.
[61]
[62]
C.-T. Yang, J.-C. Liu, et al. 2020. Performance benchmarking of deep learning framework on Intel Xeon Phi. Journal of Supercomputing 77, 3 (2020), 2486–2510.
[63]
M. Y. T. Yip, G. Lim, et al. 2020. Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy. NPJ Digital Medicine 3, 40 (2020), 1–12.
[64]
C. Zhang, P. Li, et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15).
[65]
Z. Zhong, L. Jin, and Z. Xie. 2015. High performance offline handwritten chinese character recognition using GoogLeNet and directional feature maps In. International Conference on Document Analysis and Recognition (ICDAR’15), 13.

Cited By

View all
  • (2025)Chirped apodized fiber Bragg gratings inverse design via deep learningOptics & Laser Technology10.1016/j.optlastec.2024.111766181(111766)Online publication date: Feb-2025
  • (2025)Transformative laboratory medicine enabled by microfluidic automation and artificial intelligenceBiosensors and Bioelectronics10.1016/j.bios.2024.117046271(117046)Online publication date: Mar-2025
  • (2024)Designing Deep Learning Models on FPGA with Multiple Heterogeneous EnginesACM Transactions on Reconfigurable Technology and Systems10.1145/361587017:1(1-30)Online publication date: 27-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 4
August 2023
481 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3596215
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2023
Online AM: 02 May 2023
Accepted: 14 April 2023
Revised: 02 March 2023
Received: 08 February 2022
Published in TIST Volume 14, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural networks
  2. machine learning
  3. FPGA
  4. inference

Qualifiers

  • Research-article

Funding Sources

  • SHREC industry and agency members and by the IUCRC Program of the National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)804
  • Downloads (Last 6 weeks)75
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Chirped apodized fiber Bragg gratings inverse design via deep learningOptics & Laser Technology10.1016/j.optlastec.2024.111766181(111766)Online publication date: Feb-2025
  • (2025)Transformative laboratory medicine enabled by microfluidic automation and artificial intelligenceBiosensors and Bioelectronics10.1016/j.bios.2024.117046271(117046)Online publication date: Mar-2025
  • (2024)Designing Deep Learning Models on FPGA with Multiple Heterogeneous EnginesACM Transactions on Reconfigurable Technology and Systems10.1145/361587017:1(1-30)Online publication date: 27-Jan-2024
  • (2024)Implementing an Integrated Neural Network for Real-Time Position Reconstruction in Emission Tomography With Monolithic ScintillatorsIEEE Transactions on Radiation and Plasma Medical Sciences10.1109/TRPMS.2024.33784218:5(501-510)Online publication date: May-2024
  • (2024)Exploiting Processor Heterogeneity to Improve Throughput and Reduce Latency for Deep Neural Network Inference2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00012(37-48)Online publication date: 13-Nov-2024
  • (2024)Accelerating Native Inference Model Performance in Edge Devices using TensorRT2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS)10.1109/RAICS61201.2024.10690032(1-7)Online publication date: 16-May-2024
  • (2024)Decentralized Identity Management and Privacy-Enhanced Federated Learning for Automotive Systems: A Novel Framework2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC61049.2024.10551371(1-6)Online publication date: 22-May-2024
  • (2024)Fast prototyping of Quantized neural networks on an FPGA edge computing device with Brevitas and FINN2024 Fifteenth International Conference on Ubiquitous and Future Networks (ICUFN)10.1109/ICUFN61752.2024.10625618(238-240)Online publication date: 2-Jul-2024
  • (2024)Nanophotonic structure inverse design for switching application using deep learningScientific Reports10.1038/s41598-024-72125-414:1Online publication date: 10-Sep-2024
  • (2024)A deep learning method for empirical spectral prediction and inverse design of all-optical nonlinear plasmonic ring resonator switchesScientific Reports10.1038/s41598-024-56522-314:1Online publication date: 9-Mar-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media