Early-Stage Neural Network Hardware Performance Analysis
<p>Area vs. bitwidth for a <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> PE with a single input and output channel. All of the weights and activations use the same bitwidth and the accumulator width is four bits larger, which is enough to store the result. The quadratic fit is <math display="inline"><semantics> <mrow> <mi>A</mi> <mo>=</mo> <mn>12.39</mn> <msup> <mi>b</mi> <mn>2</mn> </msup> <mo>+</mo> <mn>86.07</mn> <mi>b</mi> <mo>−</mo> <mn>14.02</mn> </mrow> </semantics></math> with a goodness of fit <math display="inline"><semantics> <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>0.9999877</mn> </mrow> </semantics></math>, where <span class="html-italic">A</span> is the area and <span class="html-italic">b</span> is the bitwidth of the PE.</p> "> Figure 2
<p>Our <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> kernel 8-bit processing engine (PE) layout using TSMC 28 nm technology. The carry-save adder can fit 12-bit numbers, which is large enough to store the output of the convolution.</p> "> Figure 3
<p>Area vs. BOPS for a <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> PE with a single input and output channel and variable bitwidth. The linear fit is <math display="inline"><semantics> <mrow> <mi>A</mi> <mo>=</mo> <mn>1.694</mn> <mi>B</mi> <mo>+</mo> <mn>153.46</mn> </mrow> </semantics></math> with a goodness of fit <math display="inline"><semantics> <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>0.998</mn> </mrow> </semantics></math>, where <span class="html-italic">A</span> is the area and <span class="html-italic">B</span> is BOPS.</p> "> Figure 4
<p>Area vs. BOPS for a <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> PE with variable input (n) and output (m) feature dimensions, and variable bitwidth. Weights and activations use the same bitwidth and the accumulator width is set to <math display="inline"><semantics> <mrow> <msub> <mo form="prefix">log</mo> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mn>9</mn> <mi>m</mi> <mo>)</mo> </mrow> <mo>·</mo> <msub> <mi>b</mi> <mi>w</mi> </msub> <mo>·</mo> <msub> <mi>b</mi> <mi>a</mi> </msub> </mrow> </semantics></math>.</p> "> Figure 5
<p>Per-layer memory access pattern.</p> "> Figure 6
<p>SRAM area as a function of memory bits. The data was taken from Synopsys 28 nm Educational Design Kit SRAM specifications. (<b>a</b>) Single-port RAM area (<span class="html-italic">A</span>) vs. amount of data bits (<span class="html-italic">B</span>). The linear fit is <math display="inline"><semantics> <mrow> <mi>A</mi> <mo>=</mo> <mn>2.94</mn> <mi>B</mi> <mo>+</mo> <mn>3065</mn> </mrow> </semantics></math> with a goodness of fit <math display="inline"><semantics> <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>0.986</mn> </mrow> </semantics></math>. (<b>b</b>) Dual-port RAM area (<span class="html-italic">A</span>) vs. amount of data bits (<span class="html-italic">B</span>). The linear fit is <math display="inline"><semantics> <mrow> <mi>A</mi> <mo>=</mo> <mn>4.16</mn> <mi>B</mi> <mo>+</mo> <mn>3535</mn> </mrow> </semantics></math> with a goodness of fit <math display="inline"><semantics> <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>=</mo> <mn>0.916</mn> </mrow> </semantics></math>.</p> "> Figure 7
<p>Roofline example. In the case of <math display="inline"><semantics> <mrow> <mi>App</mi> <mn>1</mn> </mrow> </semantics></math>, memory bandwidth prevents the program from achieving its expected performance. In the case of <math display="inline"><semantics> <mrow> <mi>App</mi> <mn>2</mn> </mrow> </semantics></math>, the same happens due to limited computational resources. Finally, <math display="inline"><semantics> <mrow> <mi>App</mi> <mn>3</mn> </mrow> </semantics></math> represents a program that could achieve its maximum performance on a given system.</p> "> Figure 8
<p>OPS roofline: <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> kernel, input and output have 256 features of <math display="inline"><semantics> <mrow> <mn>14</mn> <mo>×</mo> <mn>14</mn> </mrow> </semantics></math> pixels, 1 mm<math display="inline"><semantics> <msup> <mrow/> <mn>2</mn> </msup> </semantics></math> accelerator with an 800-MHz frequency, and a DDR of <math display="inline"><semantics> <mrow> <mn>2.4</mn> </mrow> </semantics></math> GHz with 64-bit data bus.</p> "> Figure 9
<p>OPS roofline: <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> kernel, input and output have 64 features of <math display="inline"><semantics> <mrow> <mn>56</mn> <mo>×</mo> <mn>56</mn> </mrow> </semantics></math> pixels, 6 mm<math display="inline"><semantics> <msup> <mrow/> <mn>2</mn> </msup> </semantics></math> accelerator with with an 100-MHz frequency, and a DDR of <math display="inline"><semantics> <mrow> <mn>2.4</mn> </mrow> </semantics></math> GHz with 64-bit data bus.</p> "> Figure 10
<p>All-to-all topology with <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>×</mo> <mi>m</mi> </mrow> </semantics></math> processing elements.</p> "> Figure 11
<p>Systolic array of PEs.</p> "> Figure 12
<p>Area (<span class="html-italic">A</span>) vs. BOPS (<span class="html-italic">B</span>) for a systolic array of <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> PEs with variable input (n) and output (m) feature dimensions, and variable bitwidth. Weights and activations use the same bitwidth and the accumulator width is set to <math display="inline"><semantics> <mrow> <msub> <mo form="prefix">log</mo> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mn>9</mn> <mi>m</mi> <mo>)</mo> </mrow> <mo>·</mo> <msub> <mi>b</mi> <mi>w</mi> </msub> <mo>·</mo> <msub> <mi>b</mi> <mi>a</mi> </msub> </mrow> </semantics></math>.</p> "> Figure 13
<p>ResNet-18 roofline analysis for all layers. Red dots are the performance required by the layer, and green dots are the equivalent performance using partial-sum computation. The blue curves connect points corresponding to the same layer and they are only displayed for convenience.</p> "> Figure 14
<p>VGG-16 on Eyeriss [<a href="#B20-sustainability-13-00717" class="html-bibr">20</a>] hardware. Red dots are the performance required by the layer, and green dots are the equivalent performance using partial-sum computation. The blue curves connect points corresponding to the same layer and they are only displayed for convenience.</p> ">
Abstract
:1. Introduction
1.1. Contribution
1.2. Related Work
2. Method
2.1. The Impact of Quantization on Hardware Implementation
2.2. Data Path
2.3. Communication
2.4. Local Memory
2.5. Roofline Analysis
Roofline Analysis Examples
3. Results
3.1. Experimental Methodology
3.2. System-Level Design Methodology
3.3. Evaluation of Eyeriss Architecture
4. Discussion
4.1. Conclusions
4.2. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ASIC | Application-Specific Integrated Circuit |
CAD | Computer-Aided Design |
CNN | Convolutional Neural Network |
DDR | Double Data Rate (Memory) |
DL | Deep Learning |
EDA | Electronic Design Automation |
FLOPS | Floating point Operations |
FMA | Fused Multiply-Add |
FPGA | Field Programmable Gate Array |
GOPS | Giga Operations |
HDL | Hardware Description Language |
IC | Integrated Circuit |
IP | Intellectual Property |
MAC | Multiply Accumulate |
NN | Neural Network |
OPS | Operations |
PE | Processing Engine |
RAM | Random Access Memory |
SoC | System on a Chip |
SRAM | Static Random Access Memory |
TOPS | Tera Operations |
TSMC | Taiwan Semiconductor Manufacturing Company |
VLSI | Very Large-Scale Integration |
References
- Qi, W.; Su, H.; Aliverti, A. A Smartphone-Based Adaptive Recognition and Real-Time Monitoring System for Human Activities. IEEE Trans. Hum.-Mach. Syst. 2020, 50, 414–423. [Google Scholar] [CrossRef]
- Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.C.; Ferrigno, G.; Momi, E.D. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw. 2020, 131, 291–299. [Google Scholar] [CrossRef] [PubMed]
- Su, H.; Qi, W.; Yang, C.; Sandoval, J.; Ferrigno, G.; Momi, E.D. Deep Neural Network Approach in Robot Tool Dynamics Identification for Bilateral Teleoperation. IEEE Robot. Autom. Lett. 2020, 5, 2943–2949. [Google Scholar] [CrossRef]
- Su, H.; Qi, W.; Hu, Y.; Karimi, H.R.; Ferrigno, G.; De Momi, E. An Incremental Learning Framework for Human-like Redundancy Optimization of Anthropomorphic Manipulators. IEEE Trans. Ind. Inform. 2020. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10726–10734. [Google Scholar]
- Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Ridnik, T.; Lawen, H.; Noy, A.; Friedman, I. TResNet: High Performance GPU-Dedicated Architecture. arXiv 2020, arXiv:2003.13630. [Google Scholar]
- Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-oriented approximation of convolutional neural networks. arXiv 2016, arXiv:1604.03168. [Google Scholar]
- Yang, J.; Shen, X.; Xing, J.; Tian, X.; Li, H.; Deng, B.; Huang, J.; Hua, X.S. Quantization Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Jin, Q.; Yang, L.; Liao, Z. Towards Efficient Training for Neural Network Quantization. arXiv 2019, arXiv:1912.10207. [Google Scholar]
- Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Zhao, X.; Wang, Y.; Cai, X.; Liu, C.; Zhang, L. Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.A.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar]
- Raihan, M.A.; Goli, N.; Aamodt, T.M. Modeling deep learning accelerator enabled GPUs. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 79–92. [Google Scholar]
- Chen, Y.H.; Yang, T.J.; Emer, J.; Sze, V. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 292–308. [Google Scholar] [CrossRef] [Green Version]
- Jiao, Y.; Han, L.; Jin, R.; Su, Y.J.; Ho, C.; Yin, L.; Li, Y.; Chen, L.; Chen, Z.; Liu, L.; et al. A 12 nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS. In Proceedings of the IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 136–140. [Google Scholar]
- Abts, D.; Ross, J.; Sparling, J.; Wong-VanHaren, M.; Baker, M.; Hawkins, T.; Bell, A.; Thompson, J.; Kahsai, T.; Kimmell, G.; et al. Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 20 May–3 June 2020; pp. 145–158. [Google Scholar]
- Jouppi, N.P.; Yoon, D.H.; Kurian, G.; Li, S.; Patil, N.; Laudon, J.; Young, C.; Patterson, D. A domain-specific supercomputer for training deep neural networks. Commun. ACM 2020, 63, 67–78. [Google Scholar] [CrossRef]
- Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef] [Green Version]
- Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016; pp. 243–254. [Google Scholar]
- Rivas-Gomez, S.; Pena, A.J.; Moloney, D.; Laure, E.; Markidis, S. Exploring the Vision Processing Unit as Co-Processor for Inference. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018. [Google Scholar] [CrossRef]
- Reddi, V.J.; Cheng, C.; Kanter, D.; Mattson, P.; Schmuelling, G.; Wu, C.J.; Anderson, B.; Breughe, M.; Charlebois, M.; Chou, W.; et al. MLPerf Inference Benchmark. arXiv 2019, arXiv:1911.02549. [Google Scholar]
- Baskin, C.; Liss, N.; Zheltonozhskii, E.; Bronstein, A.M.; Mendelson, A. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 21–25 May 2018; pp. 162–169. [Google Scholar]
- Ankit, A.; Hajj, I.E.; Chalamalasetti, S.R.; Ndu, G.; Foltin, M.; Williams, R.S.; Faraboschi, P.; Hwu, W.W.; Strachan, J.P.; Roy, K.; et al. PUMA: A Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’19, Providence, RI, USA, 13–17 April 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 715–731. [Google Scholar] [CrossRef]
- Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA’17, Monterey, CA, USA, 22–24 February 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 65–74. [Google Scholar] [CrossRef] [Green Version]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. How to Evaluate Deep Neural Network Processors: TOPS/W (Alone) Considered Harmful. IEEE Solid-State Circuits Mag. 2020, 12, 28–41. [Google Scholar] [CrossRef]
- Lee, J.; Won, T.; Lee, T.K.; Lee, H.; Gu, G.; Hong, K. Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network. arXiv 2020, arXiv:2001.06268. [Google Scholar]
- Baskin, C.; Schwartz, E.; Zheltonozhskii, E.; Liss, N.; Giryes, R.; Bronstein, A.M.; Mendelson, A. UNIQ: Uniform Noise Injection for Non-Uniform Quantization of Neural Networks. arXiv 2018, arXiv:1804.10969. [Google Scholar]
- Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
- McMahon, F.H. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range; Technical Report; Lawrence Livermore National Lab.: Livermore, CA, USA, 1986. [Google Scholar]
- Wang, L.; Zhan, J.; Gao, W.; Jiang, Z.; Ren, R.; He, X.; Luo, C.; Lu, G.; Li, J. BOPS, Not FLOPS! A New Metric and Roofline Performance Model For Datacenter Computing. arXiv 2018, arXiv:1801.09212. [Google Scholar]
- Parashar, A.; Raina, P.; Shao, Y.S.; Chen, Y.H.; Ying, V.A.; Mukkara, A.; Venkatesan, R.; Khailany, B.; Keckler, S.W.; Emer, J. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Madison, WI, USA, 24–26 March 2019; pp. 304–315. [Google Scholar]
- Wu, Y.N.; Sze, V. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Westminster, CO, USA, 4–7 November 2019. [Google Scholar]
- Mishra, A.; Nurvitadhi, E.; Cook, J.J.; Marr, D. WRPN: Wide Reduced-Precision Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Jiang, Z.; Li, J.; Zhan, J. The Pitfall of Evaluating Performance on Emerging AI Accelerators. arXiv 2019, arXiv:1911.02987. [Google Scholar]
- Shafiee, A.; Nag, A.; Muralimanohar, N.; Balasubramonian, R.; Strachan, J.P.; Hu, M.; Williams, R.S.; Srikumar, V. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA’16, Seoul, Korea, 18–22 June 2016; pp. 14–26. [Google Scholar] [CrossRef]
- Morcel, R.; Hajj, H.; Saghir, M.A.R.; Akkary, H.; Artail, H.; Khanna, R.; Keshavamurthy, A. FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-constrained FPGAs. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2019, 12, 6:1–6:27. [Google Scholar] [CrossRef]
- Wang, E.; Davis, J.J.; Cheung, P.Y.; Constantinides, G.A. LUTNet: Rethinking Inference in FPGA Soft Logic. arXiv 2019, arXiv:1904.00938. [Google Scholar]
- Baskin, C.; Chmiel, B.; Zheltonozhskii, E.; Banner, R.; Bronstein, A.M.; Mendelson, A. CAT: Compression-Aware Training for bandwidth reduction. arXiv 2019, arXiv:1909.11481. [Google Scholar]
- Chmiel, B.; Baskin, C.; Banner, R.; Zheltonozhskii, E.; Yermolin, Y.; Karbachevsky, A.; Bronstein, A.M.; Mendelson, A. Feature Map Transform Coding for Energy-Efficient CNN Inference. arXiv 2019, arXiv:1905.10830. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NY, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.; Yan, J. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Cavigelli, L.; Rutishauser, G.; Benini, L. EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 723–734. [Google Scholar] [CrossRef]
Multiplier | Gates | Cells | Area | Power | |||
---|---|---|---|---|---|---|---|
Internal | Switching | Leakage | Dynamic | ||||
Floating-Point | 40,090 | 17,175 | 11,786 | 2.76 | 1.31 | 0.43 | 10.53 |
Fixed-Point | 5065 | 1726 | 1489 | 0.49 | 0.32 | 0.04 | 1.053 |
32-Bit | 32-Bit | 16-Bit | 8-Bit | |
---|---|---|---|---|
Float | Fixed | Quant. | Quant. | |
PEs | 9 | 60 | 220 | 683 |
32-Bit | 32-Bit | 16-Bit | 8-Bit | |
---|---|---|---|---|
Float | Fixed | Quant. | Quant. | |
GOPS/s | 1568 | 5408 | ||
OPS/bit |
32-Bit | 32-Bit | 16-Bit | 8-Bit | 4-Bit | |
---|---|---|---|---|---|
Float | Fixed | Quant. | Quant. | Quant. | |
GOPS/s | 1296 | 3969 | |||
OPS/bit |
Language | Verilog HDL |
Logic Simulation | ModelSim 19.1 |
Synthesis | Synopsys Design Compiler 2017.09-SP3 |
Place and route | Cadence Innovus 2019.11 |
Layer | Latency | Latency from Roofline |
---|---|---|
[ms] | [ms] | |
conv1-1 | 7.7 | 158.9 (+1963.6%) |
conv1-2 | 165.2 | 191.4 (+15.9%) |
conv2-1 | 82.6 | 117.3 (+42%) |
conv2-2 | 165.2 | 165.2 |
conv3-1 | 82.6 | 82.6 |
conv3-2 | 165.2 | 165.2 |
conv3-3 | 165.2 | 165.2 |
conv4-1 | 82.6 | 84.2 |
conv4-2 | 165.2 | 165.2 |
conv4-3 | 165.2 | 165.2 |
conv5-1 | 41.3 | 120.9 (+192.7%) |
conv5-2 | 41.3 | 120.9 (+192.7%) |
conv5-3 | 41.3 | 120.9 (+192.7%) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karbachevsky, A.; Baskin, C.; Zheltonozhskii, E.; Yermolin, Y.; Gabbay, F.; Bronstein, A.M.; Mendelson, A. Early-Stage Neural Network Hardware Performance Analysis. Sustainability 2021, 13, 717. https://doi.org/10.3390/su13020717
Karbachevsky A, Baskin C, Zheltonozhskii E, Yermolin Y, Gabbay F, Bronstein AM, Mendelson A. Early-Stage Neural Network Hardware Performance Analysis. Sustainability. 2021; 13(2):717. https://doi.org/10.3390/su13020717
Chicago/Turabian StyleKarbachevsky, Alex, Chaim Baskin, Evgenii Zheltonozhskii, Yevgeny Yermolin, Freddy Gabbay, Alex M. Bronstein, and Avi Mendelson. 2021. "Early-Stage Neural Network Hardware Performance Analysis" Sustainability 13, no. 2: 717. https://doi.org/10.3390/su13020717
APA StyleKarbachevsky, A., Baskin, C., Zheltonozhskii, E., Yermolin, Y., Gabbay, F., Bronstein, A. M., & Mendelson, A. (2021). Early-Stage Neural Network Hardware Performance Analysis. Sustainability, 13(2), 717. https://doi.org/10.3390/su13020717