[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
tutorial
Open access

In-Datacenter Performance Analysis of a Tensor Processing Unit

Published: 24 June 2017 Publication History

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

References

[1]
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. Ghemawat, S., et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.
[2]
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E. and Moshovos, A., 2016 Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. Proc. Int'l Symp. on Computer Architecture.
[3]
Adolf, R., Rama, S., Reagen, B., Wei, G.Y. and Brooks, D., 2016, September. Fathom: reference workloads for modern deep learning methods. IEEE Int'l Symp. on Workload Characterization (IISWC).
[4]
Asanović, K. 2002. Programmable Neurocomputing, in The Handbook of Brain Theory and Neural Networks: Second Edition, M. A. Arbib (Ed.), MIT Press, ISBN 0-262-01197-2, November 2002.
[5]
Asanović, K. 1998. Asanović, K., Beck, Johnson, J., Wawrzynek, J., Kingsbury, B. and Morgan, N., November 1998. Training Neural Networks with Spert-II. Chapter 11 in Parallel Architectures for Artificial Networks: Paradigms and Implementations, N. Sundararajan and P. Saratchandran (Eds.), IEEE Computer Society Press, ISBN 0-8186-8399-6. https://people.eecs.berkeley.edu/~krste/papers/annbook.pdf
[6]
Barroso, L.A. and Hölzle, U., 2007. The case for energy-proportional computing. IEEE Computer, vol. 40.
[7]
Barr, J. September 29, 2016, New P2 Instance Type for Amazon EC2 -- Up to 16 GPUs. https://aws.amazon.com/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/
[8]
Brooks, D. November 4, 2016. Private communication.
[9]
Caulfield, A.M., Chung, E.S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil, S., Humphrey, M., Kaur, P., Kim, J.Y. and Lo, D.2016. A Cloud-Scale Acceleration Architecture. MICRO-49 conference.
[10]
Cavigelli, L., Gschwend, D., Mayer, C., Willi, S., Muheim, B. and Benini, L., 2015, May. Origami: A convolutional network accelerator. Proc. 25th edition on Great Lakes Symp. on VLSI.
[11]
Chen, Y.H., Emer, J. and Sze, V., 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. Proc. Int'l Symp. on Computer Architecture.
[12]
Chen, Y., Chen, T., Xu, Z., Sun, N., and Teman, O., 2016. DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning, Research Highlight, CACM, 59(11).
[13]
Chi, P., Li, S., Qi, Z., Gu, P., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y. and Xie, Y., 2016. PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory. Proc. Int'l Symp. on Computer Architecture.
[14]
Clark, J. October 26, 2015, Google Turning Its Lucrative Web Search Over to AI Machines. Bloomberg Technology, http://www.bloomberg.com.
[15]
Dally, W. February 9, 2016. High Performance Hardware for Machine Learning, Cadence ENN Summit.
[16]
Dean, J. and Barroso, L.A., 2013. The tail at scale. CACM, 56(2).
[17]
Dean, J. July 7, 2016 Large-Scale Deep Learning with TensorFlow for Building Intelligent Systems, ACM Webinar.
[18]
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P., 2015, July. Deep Learning with Limited Numerical Precision. ICML.
[19]
Hammerstrom, D., 1990, June. A VLSI architecture for high-performance, low-cost, on-chip learning. 1990 IJCNN Int'l Joint Conference on Neural Networks.
[20]
Han, S.; Pool, J.; Tran, J.; and Dally, W., 2015. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems.
[21]
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A. and Dally, W.J., 2016. EIE: efficient inference engine on compressed deep neural network. Proc. Int'l Symp. on Computer Architecture.
[22]
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Identity mappings in deep residual networks. Also in arXiv preprint arXiv:1603.05027.
[23]
Hennessy, J.L. and Patterson, D.A., 2018. Computer architecture: a quantitative approach, 6th edition, Elsevier.
[24]
Hölzle, U. and Barroso, L., 2009. The datacenter as a computer. Morgan and Claypool.
[25]
Ienne, P., Cornu, T. and Kuhn, G., 1996. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1).
[26]
Intel, 2016, Intel® Xeon® Processor E5-4669 v3, http://ark.intel.com/products/85766/Intel-Xeon-Processor-E5-4669-v3-45M-Cache-2_10-GHz.
[27]
Jouppi, N. May 18, 2016. Google supercharges machine learning tasks with TPU custom chip. https://cloudplatform.googleblog.com.
[28]
Keutzer, K., 2016. If I could only design one circuit...: technical perspective. CACM, 59(11).
[29]
Kim, D., Kung, J.H., Chai, S., Yalamanchili, S. and Mukhopadhyay, S., 2016. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. Proc. Int'l Symp. on Computer Architecture.
[30]
Krizhevsky, A., Sutskever, I. and Hinton, G., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.
[31]
Kung, H.T. and Leiserson, C.E., 1980. Algorithms for VLSI processor arrays. Introduction to VLSI systems.
[32]
Lange, K.D., 2009. Identifying shades of green: The SPECpower benchmarks. IEEE Computer, 42(3).
[33]
Larabel, M. March 10, 2016, Google Looks To Open Up StreamExecutor To Make GPGPU Programming Easier, Phoronix, https://www.phoronix.com/scan.php?page=news_item&px=Google-StreamExec-Parallel.
[34]
LiKamWa, R., Hou, Y., Gao, J., Polansky, M. and Zhong, L., 2016. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. Proc. Int'l Symp. on Computer Architecture.
[35]
Liu, S., Du, Z.D., Tao, J.H., Han, D., Luo, T., Xie, Y., Chen, Y. and Chen, T., 2016. Cambricon: An instruction set architecture for neural networks. Proc. Int'l Symp. on Computer Architecture.
[36]
Metz, C. September 26, 2016, Microsoft Bets Its Future On A Reprogrammable Computer Chip, Wired Magazine, https://www.wired.com/2016/09/microsoft-bets-future-chip-reprogram-fly/
[37]
Nvidia, January 2015. Tesla K80 GPU Accelerator. Board Specification https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf.
[38]
Nvidia, 2016. Tesla GPU Accelerators For Servers. http://www.nvidia.com/object/tesla-servers.html.
[39]
Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., February 2, 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper. www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/
[40]
Ovtcharov, K., Ruwase, O., Kim, J.Y., Fowers, J., Strauss, K. and Chung, E.S., 2015, August. Toward accelerating deep learning at scale using specialized hardware in the datacenter. 2015 IEEE Hot Chips 27 Symp.
[41]
Patterson, D.A. and Ditzel, D.R., 1980. The case for the reduced instruction set computer. ACM SIGARCH Computer Architecture News, 8(6), pp. 25--33.
[42]
Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., Burger, D. 2016. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. CACM, 59(11).
[43]
Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A., 2013, June. Convolution engine: balancing efficiency & flexibility in specialized computing. Proc. Int'l Symp. on Computer Architecture.
[44]
Ramacher, U., Beichter, J., Raab, W., Anlauf, J., Bruels, N., Hachmann, U. and Wesseling, M., 1991. Design of a 1st Generation Neurocomputer. In VLSI Design of Neural Networks. Springer US.
[45]
Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.Y. and Brooks, D., 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. Proc. Int'l Symp. on Computer Architecture.
[46]
Ross, J., Jouppi, N., Phelps, A., Young, C., Norrie, T., Thorson, G., Luu, D., 2015. Neural Network Processor, Patent Application No. 62/164,931.
[47]
Ross, J., Phelps, A., 2015. Computing Convolutions Using a Neural Network Processor, Patent Application No. 62/164,902.
[48]
Ross, J., 2015. Prefetching Weights for a Neural Network Processor, Patent Application No. 62/164,981.
[49]
Ross, J., Thorson, G., 2015. Rotating Data for Neural Network Computations, Patent Application No. 62/164,908.
[50]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C., 2015. Imagenet large scale visual recognition challenge. Int'l Journal of Computer Vision, 115(3).
[51]
Schurman, E. and Brutlag, J., 2009, June. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. In Velocity Web Performance and Operations Conference.
[52]
Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S. and Srikumar, V., 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. Proc. Int'l Symp. on Computer Architecture.
[53]
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587).
[54]
Smith, J.E., 1982, April. Decoupled access/execute computer architectures. Proc. Int'l Symp. on Computer Architecture.
[55]
Steinberg, D., 2015. Full-Chip Simulations, Keys to Success. Proc. Synopsys Users Group (SNUG) Silicon Valley 2015.
[56]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. Proc. IEEE Conference on Computer Vision and Pattern Recognition.
[57]
Thorson, G., Clark, C., Luu, D., 2015. Vector Computation Unit in a Neural Network Processor, Patent Application No. 62/165,022.
[58]
Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. CACM, 52(4).
[59]
Wu, Y., Schuster, M., Chen, Z., Le, Q., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. September 26, 2016, Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, http://arxiv.org/abs/1609.08144.
[60]
Young, C., 2015. Batch Processing in a Neural Network Processor, Patent Application No. 62/165,020.
[61]
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B. and Cong, J., 2015, February. Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Cited By

View all
  • (2024)Systolic array-based CNN accelerator soft error approximate fault tolerance designScientific Insights and Discoveries Review10.59782/sidr.v6i1.1926(256-267)Online publication date: 14-Oct-2024
  • (2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
  • (2024)Accelerating Strawberry Ripeness Classification Using a Convolution-Based Feature Extractor along with an Edge AI ProcessorElectronics10.3390/electronics1302034413:2(344)Online publication date: 13-Jan-2024
  • Show More Cited By

Index Terms

  1. In-Datacenter Performance Analysis of a Tensor Processing Unit

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 45, Issue 2
    ISCA'17
    May 2017
    715 pages
    ISSN:0163-5964
    DOI:10.1145/3140659
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
      June 2017
      736 pages
      ISBN:9781450348928
      DOI:10.1145/3079856
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 June 2017
    Published in SIGARCH Volume 45, Issue 2

    Check for updates

    Author Tags

    1. CNN
    2. DNN
    3. GPU
    4. LSTM
    5. MLP
    6. RNN
    7. TPU
    8. TensorFlow
    9. accelerator
    10. deep learning
    11. domain-specific architecture
    12. neural network

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8,811
    • Downloads (Last 6 weeks)1,131
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Systolic array-based CNN accelerator soft error approximate fault tolerance designScientific Insights and Discoveries Review10.59782/sidr.v6i1.1926(256-267)Online publication date: 14-Oct-2024
    • (2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
    • (2024)Accelerating Strawberry Ripeness Classification Using a Convolution-Based Feature Extractor along with an Edge AI ProcessorElectronics10.3390/electronics1302034413:2(344)Online publication date: 13-Jan-2024
    • (2024)MACO: Exploring GEMM Acceleration on a Loosely-Coupled Multi-Core Processor2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546765(1-6)Online publication date: 25-Mar-2024
    • (2024)Improving DNN Inference Throughput Using Practical, Per-Input Compute AdaptationProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695978(624-639)Online publication date: 4-Nov-2024
    • (2024)SkipWriter: LLM-Powered Abbreviated Writing on TabletsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676423(1-13)Online publication date: 13-Oct-2024
    • (2024)Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous ClustersProceedings of the 25th International Middleware Conference10.1145/3652892.3700767(299-312)Online publication date: 2-Dec-2024
    • (2024)CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding LayersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656595(338-351)Online publication date: 30-May-2024
    • (2024)Preparing for Future Heterogeneous Systems Using Migrating ThreadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643801(15-22)Online publication date: 2-Mar-2024
    • (2024)Approximate SketchesProceedings of the ACM on Management of Data10.1145/36393212:1(1-24)Online publication date: 26-Mar-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media