[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Flydeling: Streamlined Performance Models for Hardware Acceleration of CNNs through System Identification

Published: 18 July 2023 Publication History

Abstract

The introduction of deep learning algorithms, such as Convolutional Neural Networks (CNNs) in many near-sensor embedded systems, opens new challenges in terms of energy efficiency and hardware performance. An emerging solution to address these challenges is to use tailored heterogeneous hardware accelerators combining processing elements of different architectural natures such as Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), or Application Specific Integrated Circuit (ASIC). To progress towards heterogeneity, a great asset would be an automated design space exploration tool that chooses, for each accelerated partition of a CNN, the most appropriate architecture considering available resources. To feed such a design space exploration process, models are required that provide very fast yet precise evaluations of alternative architectures or alternative forms of CNNs. Quick configuration estimation could be achieved with few parameters from representative input sequences. This article studies a solution called flydeling (as a contraction of flyweight modeling) for obtaining these models by inspiring from the black-box System Identification (SI) domain. We refer to models derived using the proposed approach as flyweight models (flydels).
A methodology is proposed to generate these flydels, using CNN properties as predictor features together with SI techniques with a stochastic excitation input at a feature map dimensions level. For an embedded CPU-FPGA-GPU heterogeneous platform, it is demonstrated that it is possible to learn these Key Performance Indicators (KPIs) flydels at an early design stage and from high-level application features. For latency, energy, and resource utilization, flydels obtain estimation errors varying between 5% and 10% with less model parameters compared to state-of-the-art solutions and are built automatically from platform measurements.

References

[1]
K. Abdelouahab, M. Pelcat, J. Sérot, C. Bourrasset, and F. Berry. 2017. Tactics to directly map CNN graphs on embedded FPGAs. IEEE Embed. Syst. Lett. 9, 4 (Aug.2017), 113–116. DOI:
[2]
G. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Joint Computer Conferences (AFIPS’67). 482–485. DOI:
[3]
Mohammed Bakiri, Christophe Guyeux, Jean-François Couchot, and Abdelkrim Kamel Oudjida. 2018. Survey on hardware implementation of random number generators on FPGA: Theory and experimental analyses. Comput. Sci. Rev. 27 (Feb.2018), 135–153. DOI:
[4]
Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, and Abdessamad Ait El Cadi. 2021. Performance prediction for convolutional neural networks on edge GPUs. In 18th ACM International Conference on Computing Frontiers. ACM. DOI:
[5]
Robert A. Bridges, Neena Imam, and Tiffany M. Mintz. 2016. Understanding GPU power: A survey of profiling, modeling, and simulation methods. Comput. Surv. 49, 3 (Dec.2016), 1–27. DOI:
[6]
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2019. Once-for-all: Train one network and specialize it for efficient deployment. DOI:
[7]
Walther Carballo-Hernández, Maxime Pelcat, and François Berry. 2021. Why is FPGA-GPU heterogeneity the best option for embedded deep neural networks?arxiv:2102.01343 [cs.AR].
[8]
Franck Cassez, René Rydhof Hansen, and Mads Chr Olesen. 2012. What is a timing anomaly? In 12th International Workshop on Worst-Case Execution Time Analysis. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[9]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759 [cs.NE].
[10]
Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In Conference on Artificial Neural Networks and Machine Learning (ICANN’14). Springer International Publishing, 281–290. DOI:
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE. DOI:
[12]
Nicolas Derumigny, Fabian Gruber, Théophile Bastian, Christophe Guillon, Louis-Noel Pouchet, and Fabrice Rastello. 2020. From micro-OPs to abstract resources: Constructing a simpler CPU performance model through microbenchmarking. arxiv:2012.11473 [cs.AR].
[13]
Łukasz Dudziak, Thomas Chau, Mohamed S. Abdelfattah, Royson Lee, Hyeji Kim, and Nicholas D. Lane. 2020. BRP-NAS: Prediction-based NAS using GCNs. DOI:
[14]
Lieven Eeckhout. 2010. Computer architecture performance evaluation methods. Synth. Lect. Comput. Archit. 5, 1 (Dec.2010), 1–145. DOI:
[15]
Y. Frayman, B. F. Rolfe, and G. I. Webb. 2002. Solving regression problems using competitive ensemble models. Austral. Joint Conf. Artif. Intell. 2557 (Nov.2002), 511–522. DOI:
[16]
Eva García-Martín, Crefeda Faviola Rodrigues, Graham Riley, and Håkan Grahn. 2019. Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 134 (Dec.2019), 75–88. DOI:
[17]
Asghar Ghasemi and Saleh Zahediasl. 2012. Normality tests for statistical analysis: A guide for non-statisticians. Int. J. Endocrinol. Metabol. 10, 2 (Dec.2012), 486–489. DOI:
[18]
S. Goel, R. Kedia, M. Balakrishnan, and R. Sen. 2020. INFER: INterFerence-aware estimation of runtime for concurrent CNN execution on DPUs. In International Conference on Field Programmable Technology (FPT). Retrieved from https://www.cse.iitd.ac.in/kedia/research/DPU_runtime_FPT_short_paper.pdf.
[19]
Chen Guo, Yue lan Liu, and Xuan Jiao. 2018. Study on the influence of variable stride scale change on image recognition in CNN. Multim. Tools Applic. 78, 21 (Nov.2018), 30027–30037. DOI:
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. DOI:
[21]
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. In Conference on Computer Vision and Pattern Recognition (CVPR’16). arXiv:1602.07360v4.
[22]
Haris Javaid, Aleksander Ignjatovic, and Sri Parameswaran. 2010. Fidelity metrics for estimation models. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’10). IEEE. DOI:
[23]
Lukas Jünger, Niko Zurstraßen, Tim Kogel, Holger Keding, and Rainer Leupers. 2020. AMAIX: A generic analytical model for deep learning accelerators. In Lecture Notes in Computer Science. Springer International Publishing, 36–51. DOI:
[24]
Karel J. Keesman. 2011. System Identification. Springer London. DOI:
[25]
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Massachusetts Institute of Technology (MIT) and New York University (NYU). Retrieved from https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf.
[26]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NIPS).
[27]
Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. 1999. Object Recognition with Gradient Based Learning. Springer.
[28]
B. C. Lee and D. M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS Oper. Syst. Rev. (Oct.2006). DOI:
[29]
Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. 2021. HELP: Hardware-adaptive efficient latency prediction for NAS via meta-learning. DOI:
[30]
K. Levenberg. 1944. A method for the solution of certain non-linear problems in least squares. Quart. Appl. Math. 2, 2 (1944), 164–168. DOI:
[31]
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-Sun Seo. 2020. Performance modeling for CNN inference accelerators on FPGA. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 39, 4 (Apr.2020), 843–856. DOI:
[32]
H. M. Makrani, F. Farahmand, H. Sayadi, S. Bondi, S. M. P. Dinakarrao, H. Homayoun, and S. Rafatirad. 2019. Pyramid: Machine learning framework to estimate the optimal timing and resource usage of a high-level synthesis design. In 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE. DOI:
[33]
D. W. Marquardt. 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Indust. Appl. Math. 11, 2 (1963), 431. DOI:
[34]
João Mendes-Moreira, Carlos Soares, Alípio Mário Jorge, and Jorge Freire De Sousa. 2012. Ensemble approaches for regression. Comput. Surv. 45, 1 (Nov.2012), 1–40. DOI:
[35]
Hitoshi Nagasaka, Naoya Maruyama, Akira Nukada, Toshio Endo, and Satoshi Matsuoka. 2010. Statistical power modeling of GPU kernels using performance counters. In International Conference on Green Computing. IEEE. DOI:
[36]
Yehya Nasser, Jordane Lorandel, Jean-Christophe Prevotet, and Maryline Helard. 2021. RTL to transistor level power modeling and estimation techniques for FPGA and ASIC: A survey. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 40, 3 (Mar.2021), 479–493. DOI:
[37]
E. Y. Oh, W. G Han, E. J. Yang, J. H. Jeong, L. Lemi, and C. H. Youn. 2018. Energy-efficient task partitioning for CNN-based object detection in heterogeneous computing environment. In International Conference on Information and Communication Technology Convergence (ICTC’18). DOI:
[38]
K. O’Neal and P. Brisk. 2018. Predictive modeling for CPU, GPU, and FPGA performance and power consumption: A survey. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI’18). IEEE. DOI:
[39]
K. O’Neal, P. Brisk, E. Shriver, and M. Kishinevsky. 2017. HALWPE: Hardware-assisted light weight performance estimation for GPUs. In 54th ACM/EDAC/IEEE Design Automation Conference (DAC). DOI:
[40]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In NIPS 2017 Workshop. Retrieved from https://openreview.net/forum?id=BJJsrmfCZ.
[41]
Maxime Pelcat, Alexandre Mercat, Karol Desnos, Luca Maggiani, Yanzhou Liu, Julien Heulot, Jean-François Nezan, Wassim Hamidouche, Daniel Ménard, and Shuvra S. Bhattacharyya. 2017. Reproducible evaluation of system efficiency with a model of architecture: From theory to practice. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 37, 10 (2017), 2050–2063.
[42]
Andy D. Pimentel. 2017. Exploring exploration: A tutorial introduction to embedded systems design space exploration. IEEE Design Test 34, 1 (Feb.2017), 77–90. DOI:
[43]
A. Ranganathan. 2004. The Levenberg-Marquardt algorithm. Tutor. LM Algor. 11, 1 (June2004), 101–110. DOI:
[44]
Adam Seewald, Ulrik Pagh Schultz, Emad Ebeid, and Henrik Skov Midtiby. 2019. Coarse-grained computation-oriented energy modeling for heterogeneous parallel embedded systems. Int. J. Parallel Program. (Nov.2019). DOI:
[45]
I. Segey and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In 32nd International Conference on International Conference on Machine Learning (ICML’15). 448–456. Retrieved from https://arxiv.org/abs/1502.03167v3.
[46]
A. Shahid, M. Fahad, R. R. Manumachu, and A. Lastovetsky. 2019. Energy of computing on multicore CPUs: Predictive models and energy conservation law. In International Conference on Parallel Computing Technologies. DOI:
[47]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. DOI:
[48]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. DOI:
[49]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2018. MnasNet: Platform-aware neural architecture search for mobile. In Conference on Computer Vision and Pattern Recognition. DOI:
[50]
Y. Tu, S. Sadiq, Y. Tao, M. L. Shyu, and S. C. Chen. 2019. A power efficient neural network implementation on heterogeneous FPGA and GPU devices. In IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). DOI:
[51]
K. Vanishree, A. George, S. Gunisetty, S. Subramanian, S. Kashyap, and M. Purnaprajna. 2020. CoIn: Accelerated CNN co-inference through data partitioning on heterogeneous devices. In 6th International Conference on Advanced Computing and Communication Systems (ICACCS’20). DOI:
[52]
Delia Velasco-Montero, Jorge Fernandez-Berni, Ricardo Carmona-Galan, and Angel Rodriguez-Vazquez. 2020. PreVIous: A methodology for prediction of visual inference performance on IoT devices. IEEE Internet Things J. 7, 10 (Oct.2020), 9227–9240. DOI:
[53]
Haifeng Wang and Dejin Hu. 2005. Comparison of SVM and LS-SVM for regression. In International Conference on Neural Networks and Brain. IEEE. DOI:
[54]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr.2009), 65–76. DOI:
[55]
Nan Wu, Yuan Xie, and Cong Hao. 2021. IronMan: GNN-assisted design space exploration in high-level synthesis via reinforcement learning. arxiv:2102.08138 [cs.AR].
[56]
Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Conference on Computer Vision and Pattern Recognition (CVPR’17). arxiv:1611.05128 [cs.CV].
[57]
L. Zhou, H. Wen, R. Teodorescu, and David H. C. Du. 2019. Distributing deep neural networks with containerized partitions at the edge. In 2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge’19). USENIX Association, Renton, WA. Retrieved from https://www.usenix.org/conference/hotedge19/presentation/zhou.
[58]
K. J. Åström and E. Pieter. 1970. System Identification. Retrieved from https://portal.research.lu.se/portal/files/48192977/TFRT_7011.pdf.

Cited By

View all
  • (2023)Automatic CNN Model Partitioning for GPU/FPGA-based Embedded Heterogeneous Accelerators using Geometric ProgrammingJournal of Signal Processing Systems10.1007/s11265-023-01898-095:10(1203-1218)Online publication date: 2-Nov-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 8, Issue 3
September 2023
140 pages
ISSN:2376-3639
EISSN:2376-3647
DOI:10.1145/3592472
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023
Online AM: 12 May 2023
Accepted: 04 April 2023
Revised: 04 February 2023
Received: 12 May 2022
Published in TOMPECS Volume 8, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Heterogeneous computing
  2. Convolutional Neural Networks
  3. model performance estimation

Qualifiers

  • Research-article

Funding Sources

  • European Union’s Horizon 2020
  • Marie Skłodowska-Curie

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)95
  • Downloads (Last 6 weeks)10
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Automatic CNN Model Partitioning for GPU/FPGA-based Embedded Heterogeneous Accelerators using Geometric ProgrammingJournal of Signal Processing Systems10.1007/s11265-023-01898-095:10(1203-1218)Online publication date: 2-Nov-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media