More Web Proxy on the site http://driver.im/

research-article

Open access

Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC

Authors:

Cristina Silvano,

Andrea Carlo Ornstein,

Surinder-Pal Singh,

Giuseppe DesoliAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 2

Article No.: 11, Pages 1 - 25

https://doi.org/10.1145/3379933

Published: 29 May 2020 Publication History

All formats PDF

Abstract

Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power/performance scalability, Pareto analysis, mapping adaptivity, and accelerator utilization for the Orlando architecture mapping the VGG-16, Tiny-Yolo(v2), and MobileNet topologies.

References

[1]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2015. A high-throughput neural network accelerator. IEEE Micro 35, 3 (May 2015), 24--32.

Digital Library

[2]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.

[3]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, Los Alamitos, CA, 609--622.

Digital Library

[4]

Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (Jan. 2017), 127--138.

[5]

Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, and Nitin Chawla. 2018. A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications. In Proceedings of Embedded World 2018.

[6]

Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-Pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, et al. 2017. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference (ISSCC’17). IEEE, Los Alamitos, CA, 238--239.

[7]

Ahmet Erdem, Cristina Silvano, Thomas Boesch, Andrea C. Ornstein, Surinder-Pal Singh, and Giuseppe Desoli. 2018. Design space exploration for orlando ultra low-power convolutional neural network SoC. In Proceedings of the 29th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’18). IEEE, Los Alamitos, CA, 1--7.

[8]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 751--764.

Digital Library

[9]

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. TANGRAM: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, New York, NY, 807--820.

Digital Library

[10]

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, Los Alamitos, CA, 152--159.

[11]

Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385.

[13]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.

[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’12). 1106--1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.

Digital Library

[15]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 461--475.

Digital Library

[16]

Chris Leary and Todd Wang. 2017. XLA: TensorFlow, Compiled. Retrieved July 17, 2019 from https://www.tensorflow.org/xla. -->

[17]

Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, and Xiaowei Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In Proceedings of the 2018 Design, Automation, and Test in Europe Conference and Exhibition (DATE’18). IEEE, Los Alamitos, CA, 343--348.

[18]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 553--564.

[19]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 304--315.

[20]

Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. arxiv:1612.08242.

[21]

Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907.

[22]

Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883.

[23]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Los Alamitos, CA, 1--12.

[24]

C. Silvano, W. Fornaciari, G. Palermo, V. Zaccaria, F. Castro, M. Martinez, S. Bocchio, et al. 2011. MULTICUBE: Multi-objective design space exploration of multi-core architectures. In VLSI 2010 Annual Symposium, N. Voros, A. Mukherjee, N. Sklavos, K. Masselos, and M. Huebner (Eds.). Springer Netherlands, Dordrecht, 47--63.

[25]

J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim. 2016. 14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems. In Proceedings of the 2016 IEEE International Solid-State Circuits Conference (ISSCC’16). IEEE, Los Alamitos, CA, 264--265.

[26]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.

[27]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.

[28]

S. I. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Los Alamitos, CA, 40--47.

[29]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). ACM, New York, NY, 1--6.

[30]

XLA Team et al. 2019. XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations.

[31]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, 161--170.

Digital Library

[32]

Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). IEEE, Los Alamitos, CA, 282--292.

Cited By

Indirli FOrnstein ADesoli GBuschini ASilvano CZaccaria V(2024)Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization SpaceProceedings of the 2024 10th International Conference on Computer Technology Applications10.1145/3674558.3674562(20-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3674558.3674562
Perri SZambelli CIelmini DSilvano C(2024)Digital In-Memory Computing to Accelerate Deep Learning Inference on the Edge2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00037(130-133)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00037
Fiorin LSilvano C(2024)MEPAD: A Memory-Efficient Parallelized Direct Convolution Algorithm for Deep Neural NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_12(167-181)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69766-1_12
Show More Cited By

Index Terms

Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Neural networks
  2. Embedded and cyber-physical systems
    1. System on a chip
2. Hardware
  1. Very large scale integration design
    1. On-chip resource management

Recommendations

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
SystemC-based HW/SW co-simulation platform for system-on-chip (SoC) design space exploration

The development of digital designs today is much more complex than before, as they now impose more severe demands and require greater number of functionalities to be conceived. The current approach, based on the register transfer level (RTL) design ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 2

June 2020

169 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3403597

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020

Online AM: 07 May 2020

Accepted: 01 January 2020

Revised: 01 November 2019

Received: 01 March 2019

Published in TACO Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
911
Total Downloads

Downloads (Last 12 months)238
Downloads (Last 6 weeks)36

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Indirli FOrnstein ADesoli GBuschini ASilvano CZaccaria V(2024)Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization SpaceProceedings of the 2024 10th International Conference on Computer Technology Applications10.1145/3674558.3674562(20-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3674558.3674562
Perri SZambelli CIelmini DSilvano C(2024)Digital In-Memory Computing to Accelerate Deep Learning Inference on the Edge2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00037(130-133)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00037
Fiorin LSilvano C(2024)MEPAD: A Memory-Efficient Parallelized Direct Convolution Algorithm for Deep Neural NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_12(167-181)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69766-1_12
Oudrhiri ATaly EBain NMunier AGuizzetti RUrard P(2023)Performance Modeling and Estimation of a Configurable Output Stationary Neural Network Accelerator2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00018(89-97)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00018

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents