[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Public Access

Binary Precision Neural Network Manycore Accelerator

Published: 05 April 2021 Publication History

Abstract

This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR, Patch-Select, and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.

References

[1]
T. Abtahi et al. 2018. Accelerating convolutional neural network with FFT on embedded hardware. IEEE Trans. VLSI Syst. 26, 9 (Sept. 2018), 1737--1749.
[2]
Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference. 483--485.
[3]
Kota Ando, Kodai Ueyoshi, Kentaro Orimo, Haruyoshi Yonekawa, Shimpei Sato, Hiroki Nakahara, Masayuki Ikebe, Tetsuya Asai, Shinya Takamaeda-Yamazaki, Tadahiro Kuroda, et al. 2017. Brein memory: A 13-layer 4.2 k neuron/0.8 m synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm cmos. In Proceedings of the 2017 Symposium on VLSI Circuits. IEEE, C24--C25.
[4]
Renzo Andri et al. 2017. YodaNN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2017), 48--60.
[5]
Javad Birjandtalab et al. 2016. A non-EEG biosignals dataset for assessment and visualization of neurological status. In Proceedings of the 2016 IEEE International Workshop on Signal Processing Systems (SiPS’16). IEEE, 110--114.
[6]
Yunji Chen et al. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 609--622.
[7]
Philip Colangelo, Randy Huang, Enno Luebbers, Martin Margala, and Kevin Nealis. 2017. Fine-grained acceleration of binary neural networks using intel® xeon® processor with integrated fpga. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 135--135.
[8]
Matthieu Courbariaux et al. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.
[9]
Matthieu Courbariaux et al. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv:1602.02830. Retrieved from https://arxiv.org/abs/1602.02830.
[10]
Yann N. Dauphin and Yoshua Bengio. 2013. Big neural networks waste capacity. arXiv:1301.3583. Retrieved from https://arxiv.org/abs/1301.3583.
[11]
Lei Deng, Zhe Zou, Xin Ma, Ling Liang, Guanrui Wang, Xing Hu, Liu Liu, Jing Pei, Guoqi Li, and Yuan Xie. 2018. Fast object tracking on a many-core neural network chip. Front. Neurosci. 12 (2018), 841.
[12]
Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, Ernest Bassous, and Andre R. LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circ. 9, 5 (1974), 256--268.
[13]
Dustin Franklin. 2017. Nvidia jetson tx2 delivers twice the intelligence to the edge. In NVIDIA Accelerated Computing| Parallel Forall.
[14]
Yunchao Gong et al. 2014. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115. Retrieved from https://arxiv.org/abs/1412.6115.
[15]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Arch. News 44, 3 (2016), 243--254.
[16]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[18]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861.
[19]
H. Ren et al. 2020, in press. End-to-end scalable and low power multi-modal CNN for respiratory-related symptoms detection. In Proceedings of the 2020 IEEE 33rd International System-on-Chip Conference (SOCC’20).
[20]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.
[21]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167. Retrieved from https://arxiv.org/abs/1502.03167.
[22]
Mojan Javaheripi, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. 2020. AdaNS: Adaptive non-uniform sampling for automated design of compact DNNs. IEEE J. Select. Top. Sign. Process. 14, 4 (2020), 750--764.
[23]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.
[24]
Phil Knag, Gregory K. Chen, Raghavan Kumar, Huseyin Ekin Sumbul, and Ram Krishnamurthy. 2020. Energy efficient compute near memory binary neural network circuits. US Patent App. 16/697,616.
[25]
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740--753.
[26]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[27]
M. Hosseini, H. Ren, H. Rashid, A. Mazumder, B. Prakash, and T. Mohsenin. 2020. Neural networks for pulmonary disease diagnosis using auditory and demographic information. In Proceedings of the 3rd epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery (epiDAMIK’20). ACM, 1--5.
[28]
Ali Mirzaeian, Houman Homayoun, and Avesta Sasan. 2019. Tcd-npe: A re-configurable and efficient neural processing engine, powered by novel temporal-carry-deferring macs. In Proceedings of the 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig’10). IEEE, 1--8.
[29]
Ali Mirzaeian, Houman Homayoun, and Avesta Sasan. 2020. NESTA: Hamming weight compression-based neural proc. engineali mirzaeian. In Proceedings of the 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC’20). IEEE, 530--537.
[30]
Eriko Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT’16). IEEE.
[31]
A. Page, N. Attaran, C. Shea, H. Homayoun, and T. Mohsenin. 2016. Low-power manycore accelerator for personalized biomedical applications. In Proceedings of the 26th Edition on Great Lakes Symposium on VLSI (GLSVLSI’16). ACM, New York, NY, 63--68.
[32]
Adam Page, Ali Jafari, Colin Shea, and Tinoosh Mohsenin. 2017. SPARCNet: A hardware accelerator for efficient deployment of sparse convolutional networks. J. Emerg. Technol. Comput. Syst. 13, 3, Article 31 (May 2017), 32 pages.
[33]
Jinhwan Park and Wonyong Sung. 2016. FPGA based implementation of deep neural networks using on-chip memory only. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 1011--1015.
[34]
Xiaochen Peng, Minkyu Kim, Xiaoyu Sun, Shihui Yin, Titash Rakshit, Ryan M. Hatcher, Jorge A. Kittl, Jae-sun Seo, and Shimeng Yu. 2019. Inference engine benchmarking across technological platforms from CMOS to RRAM. In Proceedings of the International Symposium on Memory Systems. 471--479.
[35]
Bharat Prakash et al. 2020. Guiding safe reinforcement learning policies using structured language constraints. In Proceedings of the SafeAI Workshop from the 34th AAAI Conference on Artificial Intelligence. AAAI.
[36]
Mohammad Rastegari et al. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 525--542.
[37]
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers (ISWC’12). IEEE, 108--109.
[38]
Mohammad Samragh, Mojan Javaheripi, and Farinaz Koushanfar. 2020. EncoDeep: Realizing bit-flexible encoding for deep neural networks. ACM Trans. Embed. Comput. Syst. 19, 6 (2020).
[39]
Colin Shea and Tinoosh Mohsenin. 2019. Heterogeneous scheduling of deep neural networks for low-power real-time designs. ACM J. Emerg. Technol. Comput. Syst. 15, 4 (2019), 1--31.
[40]
Aidin Shiri, Arnab Neelim Mazumder, Bharat Prakash, Nitheesh Kumar Manjunath, Houman Homayoun, Avesta Sasan, Nicholas R. Waytowich, and Tinoosh Mohsenin. 2020. Energy-efficient hardware for language guided reinforcement learning. In Proceedings of the 2020 on Great Lakes Symposium on VLSI. 131--136.
[41]
Linghao Song, You Wu, Xuehai Qian, Hai Li, and Yiran Chen. 2019. ReBNN: In-situ acceleration of binarized neural networks in ReRAM using complementary resistive cell. CCF Trans. High Perf. Comput. 1, 3 (2019).
[42]
Krishnaiyan Thulasiraman and Madisetti N. S. Swamy. 1992. Graphs: Theory and Algorithms. Wiley Online Library.
[43]
Yaman Umuroglu et al. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the Special Interest Group on Design Automation (SIGDA’17). ACM.
[44]
Stylianos I. Venieris et al. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE.
[45]
Ganesh Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 2861--2865.
[46]
Yizhi Wang, Jun Lin, and Zhongfeng Wang. 2017. An energy-efficient architecture for binary weight convolutional neural networks. IEEE Trans. VLSI Syst. 26, 2 (2017), 280--293.
[47]
Shihui Yin, Zhewei Jiang, Jae-Sun Seo, and Mingoo Seok. 2020. XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circ. 55, 6 (2020), 1733--1743.
[48]
Haruyoshi Yonekawa and Hiroki Nakahara. 2017. On-chip memory based binarized convolutional deep neural network applying batch normalization free technique on an FPGA. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). IEEE, 98--105.
[49]
Chen Zhang et al. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.
[50]
Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv:1711.07128. Retrieved from https://arxiv.org/abs/1711.07128.
[51]
Ritchie Zhao et al. 2017. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 15--24.
[52]
Zhenhua Zhu, Hanbo Sun, Yujun Lin, Guohao Dai, Lixue Xia, Song Han, Yu Wang, and Huazhong Yang. 2019. A configurable multi-precision CNN computing framework based on single bit RRAM. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1--6.

Cited By

View all
  • (2024)Reg-Tune: A Regression-Focused Fine-Tuning Approach for Profiling Low Energy Consumption and LatencyACM Transactions on Embedded Computing Systems10.1145/362338023:3(1-28)Online publication date: 11-May-2024
  • (2023)Exploiting Kernel Compression on BNNs2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137052(1-6)Online publication date: Apr-2023
  • (2023)NanoFlowNet: Real-time Dense Optical Flow on a Nano Quadcopter2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161258(1996-2003)Online publication date: 29-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 17, Issue 2
Hardware and Algorithms for Efficient Machine Learning
April 2021
360 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3446841
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 05 April 2021
Accepted: 01 September 2020
Revised: 01 August 2020
Received: 01 May 2020
Published in JETC Volume 17, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ASIC
  2. BiNMAC
  3. CPU-GPU
  4. binarized neural network
  5. deep learning
  6. low-power manycore accelerator

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)181
  • Downloads (Last 6 weeks)20
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reg-Tune: A Regression-Focused Fine-Tuning Approach for Profiling Low Energy Consumption and LatencyACM Transactions on Embedded Computing Systems10.1145/362338023:3(1-28)Online publication date: 11-May-2024
  • (2023)Exploiting Kernel Compression on BNNs2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137052(1-6)Online publication date: Apr-2023
  • (2023)NanoFlowNet: Real-time Dense Optical Flow on a Nano Quadcopter2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161258(1996-2003)Online publication date: 29-May-2023
  • (2022)Coordinated Batching and DVFS for DNN Inference on GPU AcceleratorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.314461433:10(2496-2508)Online publication date: 1-Oct-2022
  • (2021)A Flexible Multichannel EEG Artifact Identification Processor using Depthwise-Separable Convolutional Neural NetworksACM Journal on Emerging Technologies in Computing Systems10.1145/342747117:2(1-21)Online publication date: 15-Apr-2021
  • (2021)Improving Autonomous Nano-Drones Performance via Automated End-to-End Optimization and Deployment of DNNsIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2021.312625911:4(548-562)Online publication date: Dec-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media