Article

Free access

A scalable cross-platform infrastructure for application performance tuning using hardware counters

Authors:

P. MucciAuthors Info & Claims

SC '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing

Pages 42 - es

Published: 01 November 2000 Publication History

PDF eReader

Abstract

The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count “events”, which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has proposed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware. The PAPI library has been or is in the process of being implemented on all major HPC platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools.

References

[1]

Stephan Andersson, Ron Bell, John Hague, Holger Holthoff, Peter Mayes, Jun Nakano, Danny Shieh, and Jim Tuccillo. POWER3 Introduction and Tuning Guide. IBM, October 1998. http://www.redbooks.ibm.com]]

Google Scholar

[2]

Rudolph Berrendorf and Heinz Ziegler. PCL - the Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors, Version 1.3. http://www.fz-juelich.de/zam/PCL/]]

Google Scholar

[3]

Mark Brehob, Travis Doom, Richard Enbody, William H Moore, Sherry Q. Moore, Ron Sass, Charles Severance, "Beyond RISC - The Post-RISC Architecture", Michigan State University Department of Computer Science, Technical Report CPS-96-11, March 1996.]]

Google Scholar

[4]

David Cortesi, Origin 2000 and Onyx2 Performance Tuning and Optimization Guide. Document Number 007-3430-002, Silicon Graphics Inc., 1998. http://techpubs.sgi.com/]]

Google Scholar

[5]

Luiz DeRose and Daniel A. Reed. "SvPablo: A Multi-Language Performance Analysis System", Proceedings of the 1999 International Conference on Parallel Processing, September 1999, pp. 311-318.]]

Digital Library

Google Scholar

[6]

Luiz DeRose, Ying Zhang, and Daniel A. Reed. "SvPablo: A Multi-Language Performance Analysis System", 10 th International Conference on Computer Performance Evaluation - Modeling Techniques and Tools - Performance Tools'98, pp. 352-355. Palma de Mallorca, Spain, September 1998. http://vibes.cs.uiuc.edu/]]

Digital Library

Google Scholar

[7]

John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach, second edition. Morgan Kaufmann, 1996.]]

Digital Library

Google Scholar

[8]

Curtis L. Janssen. The Visual Profiler, Version 0.4, October 1999. http://aros.ca.sandia.gov/~cljanss/perf/vprof/doc/README.html]]

Google Scholar

[9]

Luc Smolders. System and Kernel Thread Performance Monitor API Reference Guide, Version 0.5. IBM RS/6000 Division, May 1999.]]

Google Scholar

Cited By

View all

Javanmard MAhmad ZKong MPouchet LChowdhury RHarrison RMars JTang LXue JWu P(2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377916
Dey MNazari AZajic APrvulovic MOskin MInoue K(2018)TEMProfProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00076(881-893)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00076
Millán EWolovick NPiccoli MGarino CBringa E(2017)Performance analysis and comparison of cellular automata GPU implementationsCluster Computing10.1007/s10586-017-0850-320:3(2763-2777)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s10586-017-0850-3
Show More Cited By

Index Terms

A scalable cross-platform infrastructure for application performance tuning using hardware counters

Recommendations

OpenMP application tuning using hardware performance counters
WOMPAT'03: Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming

Hardware counter events on some popular architectures were investigated with the purpose of detecting bottle-necks of particular interest to shared memory programming, such as OpenMP. A fully portable test suite was written in OpenMP, accessing the ...
Pro Smartphone Cross-Platform Development: iPhone, Blackberry, Windows Mobile and Android Development and Distribution
Android Application Development for the Intel Platform

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SC '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing

November 2000

889 pages

ISBN:0780398025

Conference Chair:
Louis Turcotte
Rose-Hulman Institute of Technology

In-Cooperation

SIAM: Society for Industrial and Applied Mathematics

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 November 2000

Check for updates

Qualifiers

Article

Conference

SC '00

Sponsor:

SIGARCH
IEEE-CS

SC '00: International Conference for High Performance Computing, Networking, Storage and Analysis

November 4 - 10, 2000

Texas, Dallas, USA

Acceptance Rates

SC '00 Paper Acceptance Rate 62 of 179 submissions, 35%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

77
Total Citations
View Citations
526
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)10

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Javanmard MAhmad ZKong MPouchet LChowdhury RHarrison RMars JTang LXue JWu P(2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377916
Dey MNazari AZajic APrvulovic MOskin MInoue K(2018)TEMProfProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00076(881-893)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00076
Millán EWolovick NPiccoli MGarino CBringa E(2017)Performance analysis and comparison of cellular automata GPU implementationsCluster Computing10.1007/s10586-017-0850-320:3(2763-2777)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s10586-017-0850-3
Venkataraman SYang ZFranklin MRecht BStoica IArgyraki KIsaacs R(2016)ErnestProceedings of the 13th Usenix Conference on Networked Systems Design and Implementation10.5555/2930611.2930635(363-378)Online publication date: 16-Mar-2016
https://dl.acm.org/doi/10.5555/2930611.2930635
Zellweger GLin DRoscoe TCui HLau FBansal SZhong L(2016)So many performance events, so little timeProceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/2967360.2967375(1-9)Online publication date: 4-Aug-2016
https://dl.acm.org/doi/10.1145/2967360.2967375
Breslow APorter LTiwari ALaurenzano MCarrington LTullsen DSnavely A(2016)The case for colocation of high performance computing workloadsConcurrency and Computation: Practice & Experience10.1002/cpe.318728:2(232-251)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1002/cpe.3187
Haydel NMadey GGesing SDakkak Ade Gonzalo STaylor IHwu WAnjum APapadopoulos G(2015)Enhancing the usability and utilization of accelerated architectures via dockerProceedings of the 8th International Conference on Utility and Cloud Computing10.5555/3233397.3233456(361-367)Online publication date: 7-Dec-2015
https://dl.acm.org/doi/10.5555/3233397.3233456
Tang YYou RKan HTithi JGanapathi PChowdhury R(2015)Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiencyACM SIGPLAN Notices10.1145/2858788.268851450:8(205-214)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688514
Tang YYou RKan HTithi JGanapathi PChowdhury RCohen AGrove D(2015)Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiencyProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688514(205-214)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688514
Lin JNukada AMatsuoka SBalaji PXu C(2015)Modeling gather and scatter with hardware performance counters for Xeon PhiProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.59(713-716)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.59
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

OpenMP application tuning using hardware performance counters

Pro Smartphone Cross-Platform Development: iPhone, Blackberry, Windows Mobile and Android Development and Distribution

Android Application Development for the Intel Platform