Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

References

[1]

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. Ghemawat, S., et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.

Google Scholar

[2]

Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E. and Moshovos, A., 2016 Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[3]

Adolf, R., Rama, S., Reagen, B., Wei, G.Y. and Brooks, D., 2016, September. Fathom: reference workloads for modern deep learning methods. IEEE Int'l Symp. on Workload Characterization (IISWC).

Google Scholar

[4]

Asanović, K. 2002. Programmable Neurocomputing, in The Handbook of Brain Theory and Neural Networks: Second Edition, M. A. Arbib (Ed.), MIT Press, ISBN 0-262-01197-2, November 2002.

Google Scholar

[5]

Asanović, K. 1998. Asanović, K., Beck, Johnson, J., Wawrzynek, J., Kingsbury, B. and Morgan, N., November 1998. Training Neural Networks with Spert-II. Chapter 11 in Parallel Architectures for Artificial Networks: Paradigms and Implementations, N. Sundararajan and P. Saratchandran (Eds.), IEEE Computer Society Press, ISBN 0-8186-8399-6. https://people.eecs.berkeley.edu/~krste/papers/annbook.pdf

Google Scholar

[6]

Barroso, L.A. and Hölzle, U., 2007. The case for energy-proportional computing. IEEE Computer, vol. 40.

Digital Library

Google Scholar

[7]

Barr, J. September 29, 2016, New P2 Instance Type for Amazon EC2 -- Up to 16 GPUs. https://aws.amazon.com/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/

Google Scholar

[8]

Brooks, D. November 4, 2016. Private communication.

Google Scholar

[9]

Caulfield, A.M., Chung, E.S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.Y. and Lo, D.2016. A Cloud-Scale Acceleration Architecture. MICRO-49 conference.

Google Scholar

[10]

Cavigelli, L., Gschwend, D., Mayer, C., Willi, S., Muheim, B. and Benini, L., 2015, May. Origami: A convolutional network accelerator. Proc. 25th edition on Great Lakes Symp. on VLSI.

Digital Library

Google Scholar

[11]

Chen, Y.H., Emer, J. and Sze, V., 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[12]

Chen, Y., Chen, T., Xu, Z., Sun, N., and Teman, O., 2016. DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning, Research Highlight, CACM, 59(11).

Digital Library

Google Scholar

[13]

Chi, P., Li, S., Qi, Z., Gu, P., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y. and Xie, Y., 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[14]

Clark, J. October 26, 2015, Google Turning Its Lucrative Web Search Over to AI Machines. Bloomberg Technology, http://www.bloomberg.com.

Google Scholar

[15]

Dally, W. February 9, 2016. High Performance Hardware for Machine Learning, Cadence ENN Summit.

Google Scholar

[16]

Dean, J. and Barroso, L.A., 2013. The tail at scale. CACM, 56(2).

Digital Library

Google Scholar

[17]

Dean, J. July 7, 2016 Large-Scale Deep Learning with TensorFlow for Building Intelligent Systems, ACM Webinar.

Google Scholar

[18]

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P., 2015, July. Deep Learning with Limited Numerical Precision. ICML.

Digital Library

Google Scholar

[19]

Hammerstrom, D., 1990, June. A VLSI architecture for high-performance, low-cost, on-chip learning. 1990 IJCNN Int'l Joint Conference on Neural Networks.

Google Scholar

[20]

Han, S.; Pool, J.; Tran, J.; and Dally, W., 2015. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems.

Digital Library

Google Scholar

[21]

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016. EIE: efficient inference engine on compressed deep neural network. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[22]

He, K., Zhang, X., Ren, S. and Sun, J., 2016. Identity mappings in deep residual networks. Also in arXiv preprint arXiv:1603.05027.

Google Scholar

[23]

Hennessy, J.L. and Patterson, D.A., 2018. Computer architecture: a quantitative approach, 6th edition, Elsevier.

Google Scholar

[24]

Hölzle, U. and Barroso, L., 2009. The datacenter as a computer. Morgan and Claypool.

Google Scholar

[25]

Ienne, P., Cornu, T. and Kuhn, G., 1996. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1).

Digital Library

Google Scholar

[26]

Intel, 2016, Intel® Xeon® Processor E5-4669 v3, http://ark.intel.com/products/85766/Intel-Xeon-Processor-E5-4669-v3-45M-Cache-2_10-GHz.

Google Scholar

[27]

Jouppi, N. May 18, 2016. Google supercharges machine learning tasks with TPU custom chip. https://cloudplatform.googleblog.com.

Google Scholar

[28]

Keutzer, K., 2016. If I could only design one circuit...: technical perspective. CACM, 59(11).

Digital Library

Google Scholar

[29]

Kim, D., Kung, J.H., Chai, S., Yalamanchili, S. and Mukhopadhyay, S., 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[30]

Krizhevsky, A., Sutskever, I. and Hinton, G., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.

Digital Library

Google Scholar

[31]

Kung, H.T. and Leiserson, C.E., 1980. Algorithms for VLSI processor arrays. Introduction to VLSI systems.

Google Scholar

[32]

Lange, K.D., 2009. Identifying shades of green: The SPECpower benchmarks. IEEE Computer, 42(3).

Digital Library

Google Scholar

[33]

Larabel, M. March 10, 2016, Google Looks To Open Up StreamExecutor To Make GPGPU Programming Easier, Phoronix, https://www.phoronix.com/scan.php?page=news_item&px=Google-StreamExec-Parallel.

Google Scholar

[34]

LiKamWa, R., Hou, Y., Gao, J., Polansky, M. and Zhong, L., 2016. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[35]

Liu, S., Du, Z.D., Tao, J.H., Han, D., Luo, T., Xie, Y., Chen, Y. and Chen, T., 2016. Cambricon: An instruction set architecture for neural networks. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[36]

Metz, C. September 26, 2016, Microsoft Bets Its Future On A Reprogrammable Computer Chip, Wired Magazine, https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-fly/

Google Scholar

[37]

Nvidia, January 2015. Tesla K80 GPU Accelerator. Board Specification https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf.

Google Scholar

[38]

Nvidia, 2016. Tesla GPU Accelerators For Servers. http://www.nvidia.com/object/tesla-servers.html.

Google Scholar

[39]

Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., February 2, 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper. www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/

Google Scholar

[40]

Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., 2015, August. Toward accelerating deep learning at scale using specialized hardware in the datacenter. 2015 IEEE Hot Chips 27 Symp.

Google Scholar

[41]

Patterson, D.A. and Ditzel, D.R., 1980. The case for the reduced instruction set computer. ACM SIGARCH Computer Architecture News, 8(6), pp. 25--33.

Digital Library

Google Scholar

[42]

Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., Burger, D. 2016. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. CACM, 59(11).

Digital Library

Google Scholar

[43]

Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A., 2013, June. Convolution engine: balancing efficiency & flexibility in specialized computing. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[44]

Ramacher, U., Beichter, J., Raab, W., Anlauf, J., Bruels, N., Hachmann, U. and Wesseling, M., 1991. Design of a 1st Generation Neurocomputer. In VLSI Design of Neural Networks. Springer US.

Google Scholar

[45]

Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.Y. and Brooks, D., 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[46]

Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.

Google Scholar

[47]

Ross, J., Phelps, A., 2015. Computing Convolutions Using a Neural Network Processor, Patent Application No. 62/164,902.

Google Scholar

[48]

Ross, J., 2015. Prefetching Weights for a Neural Network Processor, Patent Application No. 62/164,981.

Google Scholar

[49]

Ross, J., Thorson, G., 2015. Rotating Data for Neural Network Computations, Patent Application No. 62/164,908.

Google Scholar

[50]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C., 2015. Imagenet large scale visual recognition challenge. Int'l Journal of Computer Vision, 115(3).

Digital Library

Google Scholar

[51]

Schurman, E. and Brutlag, J., 2009, June. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. In Velocity Web Performance and Operations Conference.

Google Scholar

[52]

Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S. and Srikumar, V., 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[53]

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587).

Google Scholar

[54]

Smith, J.E., 1982, April. Decoupled access/execute computer architectures. Proc. Int'l Symp. on Computer Architecture.

Digital Library

Google Scholar

[55]

Steinberg, D., 2015. Full-Chip Simulations, Keys to Success. Proc. Synopsys Users Group (SNUG) Silicon Valley 2015.

Google Scholar

[56]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. Proc. IEEE Conference on Computer Vision and Pattern Recognition.

Google Scholar

[57]

Thorson, G., Clark, C., Luu, D., 2015. Vector Computation Unit in a Neural Network Processor, Patent Application No. 62/165,022.

Google Scholar

[58]

Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. CACM, 52(4).

Digital Library

Google Scholar

[59]

Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. September 26, 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, http://arxiv.org/abs/1609.08144.

Google Scholar

[60]

Young, C., 2015. Batch Processing in a Neural Network Processor, Patent Application No. 62/165,020.

Google Scholar

[61]

Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B. and Cong, J., 2015, February. Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Digital Library

Google Scholar

Cited By

View all

Xiaohui WChenyang WQi WHongmin YHengshan Y(2024)Systolic array-based CNN accelerator soft error approximate fault tolerance designScientific Insights and Discoveries Review10.59782/sidr.v6i1.1926(256-267)Online publication date: 14-Oct-2024
https://doi.org/10.59782/sidr.v6i1.192
Limas Sierra RGuerrero-Balaguera JCondia JSonza Reorda M(2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
https://doi.org/10.3390/electronics13030578
Park JShin JKim RAn SLee SKim JOh JJeong YKim SJeong YLee S(2024)Accelerating Strawberry Ripeness Classification Using a Convolution-Based Feature Extractor along with an Edge AI ProcessorElectronics10.3390/electronics1302034413:2(344)Online publication date: 13-Jan-2024
https://doi.org/10.3390/electronics13020344
Show More Cited By

Index Terms

In-Datacenter Performance Analysis of a Tensor Processing Unit
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Deep neural networks accelerators with focus on tensor processors
Abstract
The massive amount of data and the problem of processing them is one of the main challenges of the digital age, and the development of artificial intelligence and machine learning can be useful in solving this problem. Using deep neural networks ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 45, Issue 2

ISCA'17

May 2017

715 pages

ISSN:0163-5964

DOI:10.1145/3140659

Editor:
Babak Falsafi
Interim

Issue’s Table of Contents

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
June 2017
736 pages
ISBN:9781450348928
DOI:10.1145/3079856

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2017

Published in SIGARCH Volume 45, Issue 2

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3,045
Total Citations
View Citations
40,883
Total Downloads

Downloads (Last 12 months)8,811
Downloads (Last 6 weeks)1,131

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Xiaohui WChenyang WQi WHongmin YHengshan Y(2024)Systolic array-based CNN accelerator soft error approximate fault tolerance designScientific Insights and Discoveries Review10.59782/sidr.v6i1.1926(256-267)Online publication date: 14-Oct-2024
https://doi.org/10.59782/sidr.v6i1.192
Limas Sierra RGuerrero-Balaguera JCondia JSonza Reorda M(2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
https://doi.org/10.3390/electronics13030578
Park JShin JKim RAn SLee SKim JOh JJeong YKim SJeong YLee S(2024)Accelerating Strawberry Ripeness Classification Using a Convolution-Based Feature Extractor along with an Edge AI ProcessorElectronics10.3390/electronics1302034413:2(344)Online publication date: 13-Jan-2024
https://doi.org/10.3390/electronics13020344
Sui BShen JSun CWang JZheng ZGuo W(2024)MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-Core Processor2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546765(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546765
Padmanabha Iyer AGuan MDai YPan RGandhi SNetravali RWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Improving DNN Inference Throughput Using Practical, Per-Input Compute AdaptationProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695978(624-639)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695978
Xu ZCai SVarma T MVenugopalan SZhai S(2024)SkipWriter: LLM-Powered Abbreviated Writing on TabletsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676423(1-13)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676423
Nie CMaghakian JLiu ZSchiavoni VEdinger JCao JJin Z(2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700767
Yun SNam HKyung KPark JKim BKwon YLee EAhn J(2024)CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding LayersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656595(338-351)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656595
Kogge PVap JPepple D(2024)Preparing for Future Heterogeneous Systems Using Migrating ThreadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643801(15-22)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3642961.3643801
Tsan BDatta AIzenov YRusu F(2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
https://doi.org/10.1145/3639321
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures