[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3079856.3080246acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

In-Datacenter Performance Analysis of a Tensor Processing Unit

Published: 24 June 2017 Publication History

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

References

[1]
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. Ghemawat, S., et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.
[2]
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E. and Moshovos, A., 2016 Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. Proc. Int'l Symp. on Computer Architecture.
[3]
Adolf, R., Rama, S., Reagen, B., Wei, G.Y. and Brooks, D., 2016, September. Fathom: reference workloads for modern deep learning methods. IEEE Int'l Symp. on Workload Characterization (IISWC).
[4]
Asanović, K. 2002. Programmable Neurocomputing, in The Handbook of Brain Theory and Neural Networks: Second Edition, M. A. Arbib (Ed.), MIT Press, ISBN 0-262-01197-2, November 2002.
[5]
Asanović, K. 1998. Asanović, K., Beck, Johnson, J., Wawrzynek, J., Kingsbury, B. and Morgan, N., November 1998. Training Neural Networks with Spert-II. Chapter 11 in Parallel Architectures for Artificial Networks: Paradigms and Implementations, N. Sundararajan and P. Saratchandran (Eds.), IEEE Computer Society Press, ISBN 0-8186-8399-6. https://people.eecs.berkeley.edu/~krste/papers/annbook.pdf
[6]
Barroso, L.A. and Hölzle, U., 2007. The case for energy-proportional computing. IEEE Computer, vol. 40.
[7]
Barr, J. September 29, 2016, New P2 Instance Type for Amazon EC2 -- Up to 16 GPUs. https://aws.amazon.com/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/
[8]
Brooks, D. November 4, 2016. Private communication.
[9]
Caulfield, A.M., Chung, E.S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.Y. and Lo, D.2016. A Cloud-Scale Acceleration Architecture. MICRO-49 conference.
[10]
Cavigelli, L., Gschwend, D., Mayer, C., Willi, S., Muheim, B. and Benini, L., 2015, May. Origami: A convolutional network accelerator. Proc. 25th edition on Great Lakes Symp. on VLSI.
[11]
Chen, Y.H., Emer, J. and Sze, V., 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. Proc. Int'l Symp. on Computer Architecture.
[12]
Chen, Y., Chen, T., Xu, Z., Sun, N., and Teman, O., 2016. DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning, Research Highlight, CACM, 59(11).
[13]
Chi, P., Li, S., Qi, Z., Gu, P., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y. and Xie, Y., 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory. Proc. Int'l Symp. on Computer Architecture.
[14]
Clark, J. October 26, 2015, Google Turning Its Lucrative Web Search Over to AI Machines. Bloomberg Technology, http://www.bloomberg.com.
[15]
Dally, W. February 9, 2016. High Performance Hardware for Machine Learning, Cadence ENN Summit.
[16]
Dean, J. and Barroso, L.A., 2013. The tail at scale. CACM, 56(2).
[17]
Dean, J. July 7, 2016 Large-Scale Deep Learning with TensorFlow for Building Intelligent Systems, ACM Webinar.
[18]
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P., 2015, July. Deep Learning with Limited Numerical Precision. ICML.
[19]
Hammerstrom, D., 1990, June. A VLSI architecture for high-performance, low-cost, on-chip learning. 1990 IJCNN Int'l Joint Conference on Neural Networks.
[20]
Han, S.; Pool, J.; Tran, J.; and Dally, W., 2015. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems.
[21]
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016. EIE: efficient inference engine on compressed deep neural network. Proc. Int'l Symp. on Computer Architecture.
[22]
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Identity mappings in deep residual networks. Also in arXiv preprint arXiv:1603.05027.
[23]
Hennessy, J.L. and Patterson, D.A., 2018. Computer architecture: a quantitative approach, 6th edition, Elsevier.
[24]
Hölzle, U. and Barroso, L., 2009. The datacenter as a computer. Morgan and Claypool.
[25]
Ienne, P., Cornu, T. and Kuhn, G., 1996. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1).
[26]
Intel, 2016, Intel® Xeon® Processor E5-4669 v3, http://ark.intel.com/products/85766/Intel-Xeon-Processor-E5-4669-v3-45M-Cache-2_10-GHz.
[27]
Jouppi, N. May 18, 2016. Google supercharges machine learning tasks with TPU custom chip. https://cloudplatform.googleblog.com.
[28]
Keutzer, K., 2016. If I could only design one circuit...: technical perspective. CACM, 59(11).
[29]
Kim, D., Kung, J.H., Chai, S., Yalamanchili, S. and Mukhopadhyay, S., 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. Proc. Int'l Symp. on Computer Architecture.
[30]
Krizhevsky, A., Sutskever, I. and Hinton, G., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.
[31]
Kung, H.T. and Leiserson, C.E., 1980. Algorithms for VLSI processor arrays. Introduction to VLSI systems.
[32]
Lange, K.D., 2009. Identifying shades of green: The SPECpower benchmarks. IEEE Computer, 42(3).
[33]
Larabel, M. March 10, 2016, Google Looks To Open Up StreamExecutor To Make GPGPU Programming Easier, Phoronix, https://www.phoronix.com/scan.php?page=news_item&px=Google-StreamExec-Parallel.
[34]
LiKamWa, R., Hou, Y., Gao, J., Polansky, M. and Zhong, L., 2016. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. Proc. Int'l Symp. on Computer Architecture.
[35]
Liu, S., Du, Z.D., Tao, J.H., Han, D., Luo, T., Xie, Y., Chen, Y. and Chen, T., 2016. Cambricon: An instruction set architecture for neural networks. Proc. Int'l Symp. on Computer Architecture.
[36]
Metz, C. September 26, 2016, Microsoft Bets Its Future On A Reprogrammable Computer Chip, Wired Magazine, https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-fly/
[37]
Nvidia, January 2015. Tesla K80 GPU Accelerator. Board Specification https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf.
[38]
Nvidia, 2016. Tesla GPU Accelerators For Servers. http://www.nvidia.com/object/tesla-servers.html.
[39]
Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., February 2, 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper. www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/
[40]
Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., 2015, August. Toward accelerating deep learning at scale using specialized hardware in the datacenter. 2015 IEEE Hot Chips 27 Symp.
[41]
Patterson, D.A. and Ditzel, D.R., 1980. The case for the reduced instruction set computer. ACM SIGARCH Computer Architecture News, 8(6), pp. 25--33.
[42]
Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., Burger, D. 2016. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. CACM, 59(11).
[43]
Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A., 2013, June. Convolution engine: balancing efficiency & flexibility in specialized computing. Proc. Int'l Symp. on Computer Architecture.
[44]
Ramacher, U., Beichter, J., Raab, W., Anlauf, J., Bruels, N., Hachmann, U. and Wesseling, M., 1991. Design of a 1st Generation Neurocomputer. In VLSI Design of Neural Networks. Springer US.
[45]
Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.Y. and Brooks, D., 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. Proc. Int'l Symp. on Computer Architecture.
[46]
Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.
[47]
Ross, J., Phelps, A., 2015. Computing Convolutions Using a Neural Network Processor, Patent Application No. 62/164,902.
[48]
Ross, J., 2015. Prefetching Weights for a Neural Network Processor, Patent Application No. 62/164,981.
[49]
Ross, J., Thorson, G., 2015. Rotating Data for Neural Network Computations, Patent Application No. 62/164,908.
[50]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C., 2015. Imagenet large scale visual recognition challenge. Int'l Journal of Computer Vision, 115(3).
[51]
Schurman, E. and Brutlag, J., 2009, June. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. In Velocity Web Performance and Operations Conference.
[52]
Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S. and Srikumar, V., 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. Proc. Int'l Symp. on Computer Architecture.
[53]
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587).
[54]
Smith, J.E., 1982, April. Decoupled access/execute computer architectures. Proc. Int'l Symp. on Computer Architecture.
[55]
Steinberg, D., 2015. Full-Chip Simulations, Keys to Success. Proc. Synopsys Users Group (SNUG) Silicon Valley 2015.
[56]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. Proc. IEEE Conference on Computer Vision and Pattern Recognition.
[57]
Thorson, G., Clark, C., Luu, D., 2015. Vector Computation Unit in a Neural Network Processor, Patent Application No. 62/165,022.
[58]
Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. CACM, 52(4).
[59]
Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. September 26, 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, http://arxiv.org/abs/1609.08144.
[60]
Young, C., 2015. Batch Processing in a Neural Network Processor, Patent Application No. 62/165,020.
[61]
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B. and Cong, J., 2015, February. Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Cited By

View all
  • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
  • (2025)DiMO-CNN: Deep Learning Toolkit-Accelerated Analytical Modeling and Optimization of CNN Hardware and DataflowIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342941944:1(251-265)Online publication date: Jan-2025
  • (2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. In-Datacenter Performance Analysis of a Tensor Processing Unit

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
    June 2017
    736 pages
    ISBN:9781450348928
    DOI:10.1145/3079856
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 June 2017

    Check for updates

    Author Tags

    1. CNN
    2. DNN
    3. GPU
    4. LSTM
    5. MLP
    6. RNN
    7. TPU
    8. TensorFlow
    9. accelerator
    10. deep learning
    11. domain-specific architecture
    12. neural network

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ISCA '17
    Sponsor:

    Acceptance Rates

    ISCA '17 Paper Acceptance Rate 54 of 322 submissions, 17%;
    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8,551
    • Downloads (Last 6 weeks)1,064
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)AFHRE: An Accurate and Fast Hardware Resources Estimation Method for Convolutional Accelerator with Systolic Array Structure on FPGAElectronics10.3390/electronics1401016814:1(168)Online publication date: 3-Jan-2025
    • (2025)DiMO-CNN: Deep Learning Toolkit-Accelerated Analytical Modeling and Optimization of CNN Hardware and DataflowIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342941944:1(251-265)Online publication date: Jan-2025
    • (2025)A Deep Investigation on Stealthy DVFS Fault Injection Attacks at DNN Hardware AcceleratorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.342636444:1(39-51)Online publication date: Jan-2025
    • (2025)Two-Dimensional Materials for Brain-Inspired Computing HardwareChemical Reviews10.1021/acs.chemrev.4c00631Online publication date: 2-Jan-2025
    • (2025)UIC: A unified and scalable chip integrating neuromorphic computation and general purpose processorMicroelectronics Journal10.1016/j.mejo.2024.106449155(106449)Online publication date: Jan-2025
    • (2025)Balanced segmentation of CNNs for multi-TPU inferenceThe Journal of Supercomputing10.1007/s11227-024-06605-981:1Online publication date: 1-Jan-2025
    • (2024)Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCsIACR Communications in Cryptology10.62056/a3txommolOnline publication date: 9-Apr-2024
    • (2024)Perspective Chapter: Dynamic Timing Enhanced Computing for Microprocessor and Deep Learning AcceleratorsDeep Learning - Recent Findings and Research10.5772/intechopen.113296Online publication date: 29-May-2024
    • (2024)SoKProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699091(3403-3422)Online publication date: 14-Aug-2024
    • (2024)BiEProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694679(62978-62992)Online publication date: 21-Jul-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media