[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3468044.3468053acmotherconferencesArticle/Chapter ViewAbstractPublication PagesheartConference Proceedingsconference-collections
research-article

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Published: 21 June 2021 Publication History

Abstract

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100 (A100), claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with a particular focus on empirically quantifying our derived performance expectations. We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

References

[1]
Mark Bohr. 2007. A 30 Year Retrospective on Dennard's MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter (2007).
[2]
Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance & Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society.
[3]
Asanovic et al. 2006. The Landscape of Parallel Computing Research: A View from Berkeley, 2006. (2006).
[4]
Anzt et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse and Batched Computations. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE.
[5]
Bureddy et al. 2012. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In European MPI Users' Group Meeting. Springer.
[6]
Che et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). IEEE.
[7]
Canis et al. 2013. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Transactions on Embedded Computing Systems (TECS) (2013), 1--27.
[8]
Domke et al. 2019. Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.
[9]
Domke et al. 2020. Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws? arXiv preprint arXiv:2010.14373 (2020).
[10]
Jia et al. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbench-marking. arXiv preprint arXiv:1804.06826 (2018).
[11]
Karp et al. 2020. High-Performance Spectral Element Methods on Field-Programmable Gate Arrays. arXiv preprint arXiv:2010.13463 (2020).
[12]
Li et al. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE.
[13]
Matsuoka et al. 2016. From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era. In Proceedings of the ACM International Conference on Computing Frontiers.
[14]
Martineau et al. 2018. Benchmarking the NVIDIA V100 GPU and Tensor Cores. In European Conference on Parallel Processing. Springer.
[15]
Markidis et al. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE.
[16]
Podobas et al. 2017. Evaluating high-level design strategies on FPGAs for high-performance computing. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE.
[17]
Podobas et al. 2020. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access (2020).
[18]
Schuman et al. 2017. A Survey of Neuromorphic Computing and Neural Networks in Hardware. arXiv preprint arXiv:1705.06963 (2017).
[19]
Tsai et al. 2020. Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations. arXiv preprint arXiv:2008.08478 (2020).
[20]
Ukidave et al. 2015. NUPAR: A Benchmark Suite for Modern GPU Architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering.
[21]
Williams et al. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM(2009).
[22]
Wong et al. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE.
[23]
Wang et al. 2019. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).
[24]
Wang et al. 2020. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE.
[25]
Michael J Flynn. 1972. Some Computer Organizations and Their Effectiveness. IEEE transactions on computers (1972).
[26]
Laszlo Gyongyosi and Sandor Imre. 2019. A survey on quantum computing technology. Computer Science Review (2019).
[27]
Xinxin Mei and Xiaowen Chu. 2016. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems (2016).
[28]
Everett H Phillips and Massimiliano Fatica. 2010. Implementing the Himeno benchmark with CUDA on GPU clusters. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE.
[29]
Artur Podobas and Mats Brorsson. 2016. Empowering openmp with automatically generated hardware. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.
[30]
Robert R Schaller. 1997. Moore's law: past, present and future. IEEE spectrum (1997).
[31]
Toshio Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Hot Chips.

Cited By

View all
  • (2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
  • (2024)Tensor Network State Algorithms on AI AcceleratorsJournal of Chemical Theory and Computation10.1021/acs.jctc.4c0080020:20(8897-8910)Online publication date: 14-Oct-2024
  • (2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
  • Show More Cited By

Index Terms

  1. Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          HEART '21: Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies
          June 2021
          76 pages
          ISBN:9781450385497
          DOI:10.1145/3468044
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          In-Cooperation

          • German Research Foundation: German Research Foundation

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 21 June 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          Conference

          HEART '21

          Acceptance Rates

          Overall Acceptance Rate 22 of 50 submissions, 44%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)153
          • Downloads (Last 6 weeks)19
          Reflects downloads up to 13 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
          • (2024)Tensor Network State Algorithms on AI AcceleratorsJournal of Chemical Theory and Computation10.1021/acs.jctc.4c0080020:20(8897-8910)Online publication date: 14-Oct-2024
          • (2024)Analyzing the impact of CUDA versions on GPU applicationsParallel Computing10.1016/j.parco.2024.103081120(103081)Online publication date: Jun-2024
          • (2024)Prediction of Fps Using Ensembling Approach for Benchmarking Gaming SystemsMachine Vision and Augmented Intelligence10.1007/978-981-97-4359-9_36(365-378)Online publication date: 15-Dec-2024
          • (2023)Early Stage Vehicle Aerodynamics Development using a GPU based LBM CFD SolverSAE Technical Paper Series10.4271/2023-01-0560Online publication date: 18-Apr-2023
          • (2023)Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00024(115-127)Online publication date: 1-Oct-2023
          • (2022)Integration of Single-Port Memory (ISPM) for Multiprecision Computation in Systolic-Array-Based AcceleratorsElectronics10.3390/electronics1110158711:10(1587)Online publication date: 16-May-2022
          • (2022)SNS's not a synthesizerProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527444(847-859)Online publication date: 18-Jun-2022
          • (2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
          • (2022)Irregular alignment of arbitrarily long DNA sequences on GPUThe Journal of Supercomputing10.1007/s11227-022-05007-z79:8(8699-8728)Online publication date: 26-Dec-2022
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media