More Web Proxy on the site http://driver.im/

research-article

Open access

Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions

Authors:

Antoniu PopAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 4

Article No.: 43, Pages 1 - 26

https://doi.org/10.1145/3148054

Published: 05 December 2017 Publication History

Abstract

Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.

Supplementary Material

TACO1404-43 (taco1404-43.pdf)

Slide deck associated with this paper

Download
2.78 MB

References

[1]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1, 37--51.

Digital Library

[2]

Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos. 2016. Reliable and efficient performance monitoring in Linux. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE, 396--408.

Digital Library

[3]

Andi Drebes, Jean-Baptiste Bréjon, Antoniu Pop, Karine Heydemann, and Albert Cohen. 2016. Language-centric performance analysis of OpenMP programs with aftermath. In International Workshop on OpenMP. Springer, 237--250.

[4]

Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation. ACM, 15--23.

Digital Library

[5]

Juan Gonzalez, Judit Gimenez, and Jesus Labarta. 2010. Performance data extrapolation in parallel codes. In IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS’10). IEEE, 155--163.

Digital Library

[6]

Matthias Hauswirth, Amer Diwan, Peter F. Sweeney, and Michael C. Mozer. 2005. Automating vertical profiling. In ACM SIGPLAN Notices, Vol. 40. ACM, 281--296.

Digital Library

[7]

W. Korn, Patricia J. Teller, and G. Castillo. 2001. Just how accurate are performance counters? In IEEE International Conference on Performance, Computing, and Communications. IEEE, 303--310.

[8]

Elizaveta Levina and Peter Bickel. 2001. The earth mover’s distance is the mallows distance: Some insights from statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). Vol. 2. IEEE, 251--256.

[9]

Robert Lim, David Carrillo-Cisneros, W. Alkowaileet, and I. Scherson. 2014. Computationally efficient multiplexing of events on hardware counters. In Linux Symposium. Citeseer, 101.

[10]

Wiplove Mathur and Jeanine Cook. 2003. Toward accurate performance evaluation using hardware counters. In ITEA Modeling and Simulation Workshop.

[11]

Wiplove Mathur and Jeanine Cook. 2005. Improved estimation for software multiplexing of performance counters. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 23--32.

Digital Library

[12]

Tipp Moseley, Neil Vachharajani, and William Jalby. 2011. Hardware performance monitoring for the rest of us: A position and survey. Network and Parallel Computing 293--312.

Digital Library

[13]

Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference, Vol. 710.

[14]

Todd Mytkowicz, Peter F. Sweeney, Matthias Hauswirth, and Amer Diwan. 2007. Time interpolation: So many metrics, so few registers. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 286--300.

Digital Library

[15]

OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface version 4.0. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.

[16]

Ofir Pele and Michael Werman. 2008. A linear time histogram metric for improved sift matching. Computer Vision (ECCV’08), 495--508.

Digital Library

[17]

Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2009. Hierarchical task-based programming With StarSs. International Journal on High Performance Computing Architecture 23, 3, 284--299.

Digital Library

[18]

Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 53.

Digital Library

[19]

Polyvios Pratikakis, Hans Vandierendonck, Spyros Lyberis, and Dimitrios S. Nikolopoulos. 2011. A programming model for deterministic task parallelism. In Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. ACM, 7--12.

Digital Library

[20]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40, 2, 99--121.

Digital Library

[21]

Vincent M. Weaver and Sally A. McKee. 2008. Can hardware performance counters be trusted?. In IEEE International Symposium on Workload Characterization (IISWC’08). IEEE, 141--150.

[22]

Dmitrijs Zaparanuks, Milan Jovic, and Matthias Hauswirth. 2009. Accuracy of performance counter measurements. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 23--32.

Cited By

Liu TGuo JHuang B(2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3629525
Carnà SMarotta RPellegrini AQuaglia F(2023)Strategies and software support for the management of hardware performance countersSoftware: Practice and Experience10.1002/spe.323653:10(1928-1957)Online publication date: 17-Jul-2023
https://doi.org/10.1002/spe.3236
Sohal PTabish RDrepper UMancuso R(2022)Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platformsReal-Time Systems10.1007/s11241-022-09382-x58:3(235-274)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s11241-022-09382-x
Show More Cited By

Index Terms

Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Performance
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

FUSE: Front-End User Framework for O/S Abstraction of Hardware Accelerators
FCCM '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines

SoCs can be implemented on a single FPGA, offering designers a unique opportunity for Embedded Systems. Instead of defining a fixed architecture early in the design process, the reconfigurable platform allows architectural redesign to meet the system's ...
On the reliability of hardware event monitors in MPSoCs for critical domains
SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing

Performance Monitoring Units (PMUs) are at the heart of most-advanced timing analysis techniques to control and bound the impact of contention in Commercial Off-The-Shelf (COTS) SoCs with shared resources (e.g. GPUs and multicore CPUs). In this paper, ...
Overhead Analysis for Performance Monitoring Counters Multiplexing
Supercomputing
Abstract
To analyze the efficiency of supercomputer functioning, it is useful to collect information from performance monitoring counters available in all modern processors. However, the ability to obtain such data is very limited—usually no more than 4 ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 4

December 2017

600 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3154814

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2017

Accepted: 01 October 2017

Revised: 01 September 2017

Received: 01 June 2017

Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Royal Academy of Engineering (RAEng)
EPSRC
European Commission H2020-FETHPC-2014

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
533
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)4

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu TGuo JHuang B(2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3629525
Carnà SMarotta RPellegrini AQuaglia F(2023)Strategies and software support for the management of hardware performance countersSoftware: Practice and Experience10.1002/spe.323653:10(1928-1957)Online publication date: 17-Jul-2023
https://doi.org/10.1002/spe.3236
Sohal PTabish RDrepper UMancuso R(2022)Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platformsReal-Time Systems10.1007/s11241-022-09382-x58:3(235-274)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s11241-022-09382-x
Banerjee SJha SKalbarczyk ZIyer RSherwood TBerger EKozyrakis C(2021)BayesPerf: minimizing performance monitoring errors using Bayesian statisticsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446739(832-844)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446739
Vilardell SSerra IMezzetti EAbella JCazorla FHung CHong JBechini ASong E(2021)MUCHProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441931(511-520)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3441931
Vilardell SSerra ISantalla RMezzett EAbella iCazorla F(2020)HRM: Merging Hardware Event Monitors for Improved Timing Analysis of Complex MPSoCsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.3013051(1-1)Online publication date: 2020
https://doi.org/10.1109/TCAD.2020.3013051
Sohal PTabish RDrepper UMancuso R(2020)E-WarP: A System-wide Framework for Memory Bandwidth Profiling and Management2020 IEEE Real-Time Systems Symposium (RTSS)10.1109/RTSS49844.2020.00039(345-357)Online publication date: Dec-2020
https://doi.org/10.1109/RTSS49844.2020.00039
Neill RDrebes APop A(2018)Automated Analysis of Task-Parallel Execution Behavior Via Artificial Neural Networks2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00105(647-656)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00105
Neill RDrebes APop A(2017)Accurate and Complete Hardware Profiling for OpenMPScaling OpenMP for Exascale Performance and Portability10.1007/978-3-319-65578-9_18(266-280)Online publication date: 17-Aug-2017
https://doi.org/10.1007/978-3-319-65578-9_18

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents