More Web Proxy on the site http://driver.im/

research-article

The Cilkprof Scalability Profiler

Authors:

Tao B. Schardl,

Bradley C. Kuszmaul,

I-Ting Angelina Lee,

William M. Leiserson,

Charles E. LeisersonAuthors Info & Claims

SPAA '15: Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures

Pages 89 - 100

https://doi.org/10.1145/2755573.2755603

Published: 13 June 2015 Publication History

Abstract

Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprof's usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprof's output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the application's native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.

Digital Library

[2]

T. E. Anderson and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In SIGMETRICS, pp. 115--125, 1990.

Digital Library

[3]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, pp. 72--81, 2008.

Digital Library

[4]

C. Bienia and K. Li. Characteristics of workloads using the pipeline programming model. In EAMA ISCA-10 Workshop, pp. 161--171, 2010.

Digital Library

[5]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999.

Digital Library

[6]

D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In FDDO-4, 2001.

[7]

H. Brunst, M. Winkler, W. E. Nagel, and H.-C. Hoppe. Performance optimization for large scale computing: The scalable VAMPIR approach. In ICCS, pp. 751--760, 2001.

Digital Library

[8]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.

Digital Library

[9]

A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT: An OpenMP tools application programming interface for performance analysis. In IWOMP, pp. 171--185, 2013.

[10]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pp. 212--223, 1998.

Digital Library

[11]

S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebootingtextttgprof for the multicore age. In PLDI, pp. 458--469, 2011.

Digital Library

[12]

J. R. Gilbert, G. L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation and experiments. SIAM J. Sci. Comput., 19(6):2091--2110, 1998.

Digital Library

[13]

S. L. Graham, P. B. Kessler, and M. K. McKusick.textttgprof: A call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pp. 120--126, 1982.

Digital Library

[14]

Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pp. 145--156, 2010.

Digital Library

[15]

C. A. R. Hoare. Algorithm 63: Partition; Algorithm 64: Quicksort; and Algorithm 65: Find. CACM, 4(7):321--322, 1961.

Digital Library

[16]

Intel Corporation. Intel Cilk Plus language specification. Document Number 324396-001US\@. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf, 2010.

[17]

Intel Corporation. Intrinsics for low overhead tool annotations. Document Number 326357-001US\@. Available from https://www.cilkplus.org/open_specification/intrinsics-low-overhead-tool-annotations-v10, 2011.

[18]

Intel Corporation. Download Intel Cilk Plus software development kit. https://software.intel.com/en-us/articles/download-intel-cilk-plus-software-development-kit/, 2012.

[19]

Intel Corporation. CilkPlus/LLVM. http://cilkplus.github.io/, 2013.

[20]

Intel Corporation. Intel Cilk Plus. https://software.intel.com/en-us/intel-cilk-plus, 2015.

[21]

Intel Corporation. Intel VTune Amplifier XE 2015. http://software.intel.com/en-us/intel-vtune-amplifier-xe, 2015.

[22]

High efficiency video coding. Standard H.265, ITU, 2014.

[23]

D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Kismet: Parallel speedup estimates for serial programs. In OOPSLA, 2011.

Digital Library

[24]

D. Jeon, S. Garcia, C. Louie, S. K. Venkata, and M. B. Taylor. Kremlin: Like gprof, but for parallelization. In PPoPP, pp. 293--294, 2011.

Digital Library

[25]

A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. The Vampir performance analysis tool-set. In Tools for High Performance Computing, pp. 139--155, 2008.

[26]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, p. 75, 2004.

Digital Library

[27]

C. E. Leiserson. The Cilk++ concurrency platform. J. Supercomputing, 51(3):244--257, 2010.

Digital Library

[28]

C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In SPAA, pp. 303--314, 2010.

Digital Library

[29]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pp. 190--200, 2005.

Digital Library

[30]

N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI, pp. 89--100, 2007.

Digital Library

[31]

K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov. AddressSanitizer: A fast address sanity checker. In USENIX ATC, pp. 309--318, 2012.

Digital Library

[32]

K. Serebryany and T. Iskhodzhanov. ThreadSanitizer -- data race detection in practice. In WBIA, pp. 62--71, 2009.

Digital Library

[33]

S. S. Shende and A. D. Malony. The Tau parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, 2006.

Digital Library

[34]

N. R. Tallent and J. M. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In PPoPP, pp. 229--240, 2009.

Digital Library

Cited By

Mohammadi JShirazi MKargahi M(2024)Energy-harvesting-aware federated scheduling of parallel real-time tasksThe Journal of Supercomputing10.1007/s11227-024-06685-781:1Online publication date: 1-Dec-2024
https://doi.org/10.1007/s11227-024-06685-7
Adnan (2023)Performance evaluation on work-stealing featured parallel programs on asymmetric performance multicore processorsArray10.1016/j.array.2023.10031119(100311)Online publication date: Sep-2023
https://doi.org/10.1016/j.array.2023.100311
Carratala-Saez RGonzalez-Escribano AIliopoulos ALeiserson CPark CRosa ISchardl TTorres YBunde D(2022)Peachy Parallel Assignments (EduHPC 2022)2022 IEEE/ACM International Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC56719.2022.00012(50-56)Online publication date: Nov-2022
https://doi.org/10.1109/EduHPC56719.2022.00012
Show More Cited By

Index Terms

The Cilkprof Scalability Profiler
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software implementation planning
        Software design techniques
    2. Software development process management
  2. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

The Cilkview scalability analyzer
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. Cilkview monitors logical parallelism during an instrumented execution of the Cilk++ application on a single ...
On-the-fly pipeline parallelism
SPAA '13: Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have ...
On-the-Fly Pipeline Parallelism
Special Issue for SPAA 2013

Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '15: Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures

June 2015

362 pages

ISBN:9781450335881

DOI:10.1145/2755573

General Chair:
Guy Blelloch
Carnegie Mellon University, USA
,
Program Chair:
Kunal Agrawal
Washington University in St. Louis, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SPAA '15

Sponsor:

SPAA '15: 27th ACM Symposium on Parallelism in Algorithms and Architectures

June 13 - 15, 2015

Oregon, Portland, USA

Acceptance Rates

SPAA '15 Paper Acceptance Rate 31 of 131 submissions, 24%;

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
328
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mohammadi JShirazi MKargahi M(2024)Energy-harvesting-aware federated scheduling of parallel real-time tasksThe Journal of Supercomputing10.1007/s11227-024-06685-781:1Online publication date: 1-Dec-2024
https://doi.org/10.1007/s11227-024-06685-7
Adnan (2023)Performance evaluation on work-stealing featured parallel programs on asymmetric performance multicore processorsArray10.1016/j.array.2023.10031119(100311)Online publication date: Sep-2023
https://doi.org/10.1016/j.array.2023.100311
Carratala-Saez RGonzalez-Escribano AIliopoulos ALeiserson CPark CRosa ISchardl TTorres YBunde D(2022)Peachy Parallel Assignments (EduHPC 2022)2022 IEEE/ACM International Workshop on Education for High Performance Computing (EduHPC)10.1109/EduHPC56719.2022.00012(50-56)Online publication date: Nov-2022
https://doi.org/10.1109/EduHPC56719.2022.00012
Li JAgrawal KLu C(2022)Parallel Real-Time SchedulingHandbook of Real-Time Computing10.1007/978-981-287-251-7_28(447-467)Online publication date: 9-Aug-2022
https://doi.org/10.1007/978-981-287-251-7_28
Basso MRosales ESchiavio FRosà ABinder W(2022)Accurate Fork-Join Profiling on the Java Virtual MachineEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_3(35-50)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_3
Li JAgrawal KLu C(2021)Parallel Real-Time SchedulingHandbook of Real-Time Computing10.1007/978-981-4585-87-3_28-1(1-21)Online publication date: 31-Oct-2021
https://doi.org/10.1007/978-981-4585-87-3_28-1
Rosà ARosales EBinder W(2019)Analysis and Optimization of Task Granularity on the Java Virtual MachineACM Transactions on Programming Languages and Systems10.1145/333849741:3(1-47)Online publication date: 16-Jul-2019
https://dl.acm.org/doi/10.1145/3338497
Schardl TDenniston TDoucet DKuszmaul BLee ILeiserson C(2019)The CSI Framework for Compiler-Inserted Program InstrumentationACM SIGMETRICS Performance Evaluation Review10.1145/3308809.330886046:1(100-102)Online publication date: 17-Jan-2019
https://doi.org/10.1145/3308809.3308860
Boushehrinejadmoradi NYoga ANagarakatte S(2018)A parallelism profiler with what-if analyses for OpenMP programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291678(1-14)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291678
Schardl TDenniston TDoucet DKuszmaul BLee ILeiserson C(2018)The CSI Framework for Compiler-Inserted Program InstrumentationACM SIGMETRICS Performance Evaluation Review10.1145/3292040.321965746:1(100-102)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3292040.3219657
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents