[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3289602.3293904acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Public Access

REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs

Published: 20 February 2019 Publication History

Abstract

Deep neural networks (DNNs), as the basis of object detection, will play a key role in the development of future autonomous systems with full autonomy. The autonomous systems have special requirements of real-time, energy-e cient implementations of DNNs on a power-budgeted system. Two research thrusts are dedicated to per- formance and energy e ciency enhancement of the inference phase of DNNs. The first one is model compression techniques while the second is e cient hardware implementations. Recent researches on extremely-low-bit CNNs such as binary neural network (BNN) and XNOR-Net replace the traditional oating point operations with bi- nary bit operations, signi cantly reducing memory bandwidth and storage requirement, whereas suffering non-negligible accuracy loss and waste of digital signal processing (DSP) blocks on FPGAs. To overcome these limitations, this paper proposes REQ-YOLO, a resource aware, systematic weight quantization framework for object detection, considering both algorithm and hardware resource aspects in object detection. We adopt the block-circulant matrix method and propose a heterogeneous weight quantization using Alternative Direction Method of Multipliers (ADMM), an e ective optimization technique for general, non-convex optimization problems. To achieve real-time, highly efficient implementations on FPGA, we present the detailed hardware implementation of block circulant matrices on CONV layers and de- velop an e cient processing element (PE) structure supporting the heterogeneous weight quantization, CONV data ow and pipelining techniques, design optimization, and a template-based automatic synthesis framework to optimally exploit hardware resource. Experimental results show that our proposed REQ-YOLO framework can signi cantly compress the YOLO model while introducing very small accuracy degradation. The related codes are here: https://github.com/Anonymous788/heterogeneous_ADMM_YOLO.

References

[1]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fusedlayer cnn accelerators. In Microarchitecture (micro), 2016 49th annual ieee/acm international symposium on. IEEE, 1--12.
[2]
Yakoub Bazi and Farid Melgani. 2018. Convolutional svm networks for object detection in uav imagery. Ieee transactions on geoscience and remote sensing, 56, 6, 3107--3118.
[3]
Brahim Betkaoui, David B Thomas, and Wayne Luk. {n. d.} Comparing performance and energy efficiency of fpgas and gpus for high productivity computing. In Ieee fpt'10.
[4]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and trends? in machine learning, 3, 1, 1--122.
[5]
Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. 2016. A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision. Springer, 354--370.
[6]
Yun-Nan Chang and Keshab K Parhi. 2003. An efficient pipelined fft architecture. Ieee transactions on circuits and systems ii: analog and digital signal processing, 50, 6, 322--325.
[7]
Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. Ieee journal of solid-state circuits, 52, 1, 127--138.
[8]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19, 90, 297--301.
[9]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, 3123--3131.
[10]
Caiwen Ding et al. 2017. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th annual ieee/acm international symposium on microarchitecture (micro). ACM, 395--408.
[11]
DJI. {n. d.} http://www.cse.cuhk.edu.hk/byu/2018-DAC-HDC. ().
[12]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, 88, 2, 303--338.
[13]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the ieee international conference on computer vision, 1440--1448.
[14]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the ieee conference on computer vision and pattern recognition, 580--587.
[15]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2016. From model to fpga: software-hardware co-design for efficient neural network acceleration. In Hot chips 28 symposium (hcs), 2016 ieee. IEEE, 1--27.
[16]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. Eie: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd international symposium on computer architecture. IEEE Press, 243--254.
[17]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Arxiv preprint arxiv:1510.00149.
[18]
Song Han et al. 2017. Ese: efficient speech recognition engine with sparse lstm on fpga. In Fpga. ACM, 75--84.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition, 770--778.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision. Springer, 346--361.
[21]
Donald R High and Noah Ryan Kapner. 2018. Apparatus and method for providing unmanned delivery vehicles with expressions. US Patent App. 15/638,960. (Jan. 2018).
[22]
Sergey Ioffe. 2017. Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in neural information processing systems, 1945--1953.
[23]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. Arxiv preprint arxiv:1502.03167.
[24]
Rong Jin. 2017. Deep learning at alibaba. In Proceedings of the 26th international joint conference on artificial intelligence. AAAI Press, 11--16.
[25]
Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. Arxiv preprint arxiv:1412.6980.
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.
[27]
Yann LeCun. 2015. Lenet-5, convolutional neural networks. Url: http://yann.lecun. com/exdb/lenet.
[28]
Yun Liang et al. {n. d.} High-level Synthesis: Productivity, Performance, and Software Constraints. Jece'12.
[29]
Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International conference on machine learning, 2849--2858.
[30]
Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. 2015. Neural networks with few multiplications. Arxiv preprint arxiv:1510.03009.
[31]
Liqiang Lu and Yun Liang. {n. d.} SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs. In Dac'18.
[32]
Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. {n. d.} Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. In Fccm'17.
[33]
Jing Ma, Li Chen, and Zhiyong Gao. 2017. Hardware implementation and optimization of tiny-yolo network. In International forum on digital tv and wireless multimedia communications. Springer, 224--234.
[34]
Hiroki Nakahara, Haruyoshi Yonekawa, Tomoya Fujii, and Shimpei Sato. 2018. A lightweight yolov2: a binarized cnn with a parallel support vector regression for an fpga. In Proceedings of the 2018 acm/sigda international symposium on field-programmable gate arrays. ACM, 31--40.
[35]
Victor Pan. 2012. Structured matrices and polynomials: unified superfast algorithms. Springer Science & Business Media.
[36]
Jiantao Qiu et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 acm/sigda international symposium on field-programmable gate arrays. ACM, 26--35.
[37]
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision. Springer, 525--542.
[38]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: unified, real-time object detection. In Proceedings of the ieee conference on computer vision and pattern recognition, 779--788.
[39]
Joseph Redmon and Ali Farhadi. 2017. Yolo9000: better, faster, stronger. Arxiv.
[40]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: an incremental improvement. Arxiv preprint arxiv:1804.02767.
[41]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91--99.
[42]
Sayed Ahmad Salehi, Rasoul Amirfattahi, and Keshab K Parhi. 2013. Pipelined architectures for real-valued fft and hermitian-symmetric ifft with real datapaths. Ieee transactions on circuits and systems ii: express briefs, 60, 8, 507--511.
[43]
Julius Orion Smith. 2007. Mathematics of the discrete fourier transform (dft): with audio applications. Julius Smith.
[44]
Trieu. {n. d.} https://github.com/thtrieu/darkflow. ().
[45]
Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: a framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 acm/sigda international symposium on field-programmable gate arrays. ACM, 65--74.
[46]
Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling Efficient LSTM Using Structured Compression Techniques on FPGAs. In Fpga'18.
[47]
Shuo Wang and Yun Liang. {n. d.} A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs using OpenCL Model. In Dac'17.
[48]
Shuo Wang, Yun Liang, and Wei Zhang. {n. d.} FlexCL: An Analytical Performance Model for OpenCL Workloads on Flexible FPGAs. In Dac'17.
[49]
Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. {n. d.} TGPA: Tile-grained Pipeline Architecture for Low Latency CNN Inference. In Iccad'18.
[50]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Proceedings of the 54th annual design automation conference 2017. ACM, 29.
[51]
WeiWen, ChunpengWu, YandanWang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, 2074--2082.
[52]
Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Chen. 2016. Quantized convolutional neural networks for mobile devices. In Computer vision and pattern recognition, 2016. cvpr 2016. ieee conference on.
[53]
Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. {n.d.} Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. In Dac'17.
[54]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 acm/sigda international symposium on field-programmable gate arrays. ACM, 161--170.
[55]
Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. 2017. Theoretical properties for neural networks with weight matrices of low displacement rank. In International conference on machine learning, 4082--4090.

Cited By

View all
  • (2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
  • (2024)Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00244(2440-2449)Online publication date: 3-Jan-2024
  • (2024)EDCompress: Energy-Aware Model Compression for DataflowsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317294135:1(208-220)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2019
360 pages
ISBN:9781450361378
DOI:10.1145/3289602
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. admm
  2. compression
  3. fpga
  4. object detection
  5. yolo

Qualifiers

  • Research-article

Funding Sources

Conference

FPGA '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)578
  • Downloads (Last 6 weeks)80
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
  • (2024)Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00244(2440-2449)Online publication date: 3-Jan-2024
  • (2024)EDCompress: Energy-Aware Model Compression for DataflowsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.317294135:1(208-220)Online publication date: Jan-2024
  • (2024)A Low-Latency FPGA Accelerator for YOLOv3-Tiny With Flexible Layerwise Mapping and DataflowIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.333594971:3(1158-1171)Online publication date: Mar-2024
  • (2024)Enhancing Real-time Inference Performance for Time-Critical Software-Defined Vehicles2024 IEEE International Conference on Mobility, Operations, Services and Technologies (MOST)10.1109/MOST60774.2024.00019(101-113)Online publication date: 1-May-2024
  • (2024)Improvement and Hardware Design of Image Denoising Algorithm Based on Deep Learning2024 9th International Conference on Integrated Circuits and Microsystems (ICICM)10.1109/ICICM63644.2024.10814594(671-676)Online publication date: 25-Oct-2024
  • (2024)Gelan-SE: Squeeze and Stimulus Attention Based Target Detection Network for Gelan ArchitectureIEEE Access10.1109/ACCESS.2024.346272512(182259-182273)Online publication date: 2024
  • (2024)Near-Edge Computing Aware Object Detection: A ReviewIEEE Access10.1109/ACCESS.2023.334754812(2989-3011)Online publication date: 2024
  • (2024)Global to multi‐scale local architecture with hardwired CNN for 1‐ms tomato defect detectionIET Image Processing10.1049/ipr2.1308418:8(2078-2092)Online publication date: 19-Mar-2024
  • (2024)EfficientBioAI: making bioimaging AI models efficient in energy and latencyNature Methods10.1038/s41592-024-02167-z21:3(368-369)Online publication date: 24-Jan-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media