[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions

Published: 05 December 2017 Publication History

Abstract

Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.

Supplementary Material

TACO1404-43 (taco1404-43.pdf)
Slide deck associated with this paper

References

[1]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1, 37--51.
[2]
Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos. 2016. Reliable and efficient performance monitoring in Linux. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE, 396--408.
[3]
Andi Drebes, Jean-Baptiste Bréjon, Antoniu Pop, Karine Heydemann, and Albert Cohen. 2016. Language-centric performance analysis of OpenMP programs with aftermath. In International Workshop on OpenMP. Springer, 237--250.
[4]
Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation. ACM, 15--23.
[5]
Juan Gonzalez, Judit Gimenez, and Jesus Labarta. 2010. Performance data extrapolation in parallel codes. In IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS’10). IEEE, 155--163.
[6]
Matthias Hauswirth, Amer Diwan, Peter F. Sweeney, and Michael C. Mozer. 2005. Automating vertical profiling. In ACM SIGPLAN Notices, Vol. 40. ACM, 281--296.
[7]
W. Korn, Patricia J. Teller, and G. Castillo. 2001. Just how accurate are performance counters? In IEEE International Conference on Performance, Computing, and Communications. IEEE, 303--310.
[8]
Elizaveta Levina and Peter Bickel. 2001. The earth mover’s distance is the mallows distance: Some insights from statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). Vol. 2. IEEE, 251--256.
[9]
Robert Lim, David Carrillo-Cisneros, W. Alkowaileet, and I. Scherson. 2014. Computationally efficient multiplexing of events on hardware counters. In Linux Symposium. Citeseer, 101.
[10]
Wiplove Mathur and Jeanine Cook. 2003. Toward accurate performance evaluation using hardware counters. In ITEA Modeling and Simulation Workshop.
[11]
Wiplove Mathur and Jeanine Cook. 2005. Improved estimation for software multiplexing of performance counters. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 23--32.
[12]
Tipp Moseley, Neil Vachharajani, and William Jalby. 2011. Hardware performance monitoring for the rest of us: A position and survey. Network and Parallel Computing 293--312.
[13]
Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference, Vol. 710.
[14]
Todd Mytkowicz, Peter F. Sweeney, Matthias Hauswirth, and Amer Diwan. 2007. Time interpolation: So many metrics, so few registers. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 286--300.
[15]
OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface version 4.0. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.
[16]
Ofir Pele and Michael Werman. 2008. A linear time histogram metric for improved sift matching. Computer Vision (ECCV’08), 495--508.
[17]
Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2009. Hierarchical task-based programming With StarSs. International Journal on High Performance Computing Architecture 23, 3, 284--299.
[18]
Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 53.
[19]
Polyvios Pratikakis, Hans Vandierendonck, Spyros Lyberis, and Dimitrios S. Nikolopoulos. 2011. A programming model for deterministic task parallelism. In Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. ACM, 7--12.
[20]
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40, 2, 99--121.
[21]
Vincent M. Weaver and Sally A. McKee. 2008. Can hardware performance counters be trusted?. In IEEE International Symposium on Workload Characterization (IISWC’08). IEEE, 141--150.
[22]
Dmitrijs Zaparanuks, Milan Jovic, and Matthias Hauswirth. 2009. Accuracy of performance counter measurements. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 23--32.

Cited By

View all
  • (2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
  • (2023)Strategies and software support for the management of hardware performance countersSoftware: Practice and Experience10.1002/spe.323653:10(1928-1957)Online publication date: 17-Jul-2023
  • (2022)Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platformsReal-Time Systems10.1007/s11241-022-09382-x58:3(235-274)Online publication date: 1-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 4
December 2017
600 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3154814
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2017
Accepted: 01 October 2017
Revised: 01 September 2017
Received: 01 June 2017
Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hardware event monitoring
  2. hardware event multiplexing
  3. performance monitoring counters
  4. task-parallel performance analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Royal Academy of Engineering (RAEng)
  • EPSRC
  • European Commission H2020-FETHPC-2014

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
  • (2023)Strategies and software support for the management of hardware performance countersSoftware: Practice and Experience10.1002/spe.323653:10(1928-1957)Online publication date: 17-Jul-2023
  • (2022)Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platformsReal-Time Systems10.1007/s11241-022-09382-x58:3(235-274)Online publication date: 1-Sep-2022
  • (2021)BayesPerf: minimizing performance monitoring errors using Bayesian statisticsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446739(832-844)Online publication date: 19-Apr-2021
  • (2021)MUCHProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441931(511-520)Online publication date: 22-Mar-2021
  • (2020)HRM: Merging Hardware Event Monitors for Improved Timing Analysis of Complex MPSoCsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3013051(1-1)Online publication date: 2020
  • (2020)E-WarP: A System-wide Framework for Memory Bandwidth Profiling and Management2020 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS49844.2020.00039(345-357)Online publication date: Dec-2020
  • (2018)Automated Analysis of Task-Parallel Execution Behavior Via Artificial Neural Networks2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00105(647-656)Online publication date: May-2018
  • (2017)Accurate and Complete Hardware Profiling for OpenMPScaling OpenMP for Exascale Performance and Portability10.1007/978-3-319-65578-9_18(266-280)Online publication date: 17-Aug-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media