More Web Proxy on the site http://driver.im/

research-article

Mapping parallelism to multi-cores: a machine learning based approach

Authors:

Michael F.P. O'BoyleAuthors Info & Claims

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 75 - 84

https://doi.org/10.1145/1504176.1504189

Published: 14 February 2009 Publication History

Abstract

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.

References

[1]

D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, 1991.

Digital Library

[2]

B. Barnes, B. Rountree, et al. A regression-based approach to scalability prediction. In ICS'08, 2008.

Digital Library

[3]

E. B. Bernhard, M. G. Isabelle, et al. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 1992.

Digital Library

[4]

C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, U. K., 1996.

Digital Library

[5]

F. Blagojevic, X. Feng, et al. Modeling multi-grain parallelism on heterogeneous multicore processors: A case study of the Cell BE. In HiPEAC'08, 2008.

Digital Library

[6]

J. Cavazos, G. Fursin, et al. Rapidly selecting good compiler optimizations using performance counters. In CGO'07, 2007.

Digital Library

[7]

K. D. Cooper, P. J. Schielke, et al. Optimizing for reduced code space using genetic algorithms. In LCTES'99, 1999.

Digital Library

[8]

J. Corbalan, X. Martorell, et al. Performance-driven processor allocation. IEEE Transaction Parallel Distribution System, 16(7):599--611, 2005.

Digital Library

[9]

L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5(1):46--55, 1998.

Digital Library

[10]

E. C. David, M. K. Richard, et al. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78--85, 1996.

Digital Library

[11]

M. Gabriel and M. John. Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS'04, 2004.

Digital Library

[12]

M. R. Guthaus, J. S. Ringenberg, et al. Mibench: A free, commercially representative embedded benchmark suite, 2001.

[13]

H. Hofstee. Future microprocessors and off-chip SOP interconnect. Advanced Packaging, IEEE Transactions on, 27(2):301--303, May 2004.

[14]

S. Ilya, K. Robert, et al. A case study in top-down performance estimation for a large-scale parallel application. In PPoPP'06, 2006.

Digital Library

[15]

E. Ipek, B. R. de Supinski, et al. An approach to performance prediction for parallel applications. In Euro-Par'05, 2005.

Digital Library

[16]

Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999.

Digital Library

[17]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04, 2004.

Digital Library

[18]

C. Lee. UTDSP benchmark suite, http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.

[19]

G. V. Leslie. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990.

Digital Library

[20]

C. Liao and B. Chapman. A compile-time cost model for OpenMP. In IPDPS'07, 2007.

[21]

C. L. Liu and W. L. James. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46--61, 1973.

Digital Library

[22]

S. Long, G. Fursin, et al. A cost-aware parallel workload allocation approach based on machine learning. In NPC '07, 2007.

Digital Library

[23]

C. K. Luk, Robert Cohn, et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI'05, 2005.

Digital Library

[24]

B. S. Macey and A. Y. Zomaya. A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In IPPS/SPDP'98, 1998.

Digital Library

[25]

Z. Qin, C. Ioana, et al. Pipa: pipelined profiling and analysis on multicore systems. In CGO'08, 2008.

Digital Library

[26]

J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In SuperComputing'89, 1989.

Digital Library

[27]

T. Xinmin, G. Milind, et al. Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures. In IPDPS'03, 2003.

Digital Library

[28]

Z. Yun and V. Michael. Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In IPDPS'05, 2005.

Digital Library

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Usman SMehmood RKatib IAlbeshri A(2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
https://doi.org/10.3390/electronics12010053
Jia CRen JGao LLi ZZheng JWang Z(2022)Adaptive Model Selection for Video Super Resolution2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172(1088-1094)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172
Show More Cited By

Index Terms

Mapping parallelism to multi-cores: a machine learning based approach
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages

Recommendations

Partitioning streaming parallelism for multi-cores: a machine learning based approach
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address ...
Mapping parallelism to multi-cores: a machine learning based approach
PPoPP '09

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2009

322 pages

ISBN:9781605583976

DOI:10.1145/1504176

General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

ACM SIGPLAN Notices Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP09

Sponsor:

PPoPP09: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 14 - 18, 2009

NC, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

174
Total Citations
View Citations
1,866
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Usman SMehmood RKatib IAlbeshri A(2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
https://doi.org/10.3390/electronics12010053
Jia CRen JGao LLi ZZheng JWang Z(2022)Adaptive Model Selection for Video Super Resolution2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172(1088-1094)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172
Hou ZShen HZhou XGu JWang YZhao T(2022)Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directionsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-0625-816:5Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s11704-022-0625-8
Alcaraz JTehraniJamsaz ADutta ASikora AJannesari ASorribes JCesar E(2022)Predicting number of threads using balanced datasets for openMP regionsComputing10.1007/s00607-022-01081-6105:5(999-1017)Online publication date: 30-Apr-2022
https://doi.org/10.1007/s00607-022-01081-6
Qiu SYou LWang Z(2022)Optimizing Sparse Matrix Multiplications for Graph Neural NetworksLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_7(101-117)Online publication date: 24-Mar-2022
https://doi.org/10.1007/978-3-030-99372-6_7
Denoyelle NPerarnau SVideau BBeckman PJeannot E(2021)Narrowing the Search Space of Applications Mapping on Hierarchical Topologies2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS54543.2021.00018(118-128)Online publication date: Nov-2021
https://doi.org/10.1109/PMBS54543.2021.00018
Shreyas Madhav ASingaravel SKarmel A(2021)Memory Utilization and Machine Learning Techniques for Compiler OptimizationITM Web of Conferences10.1051/itmconf/2021370102137(01021)Online publication date: 17-Mar-2021
https://doi.org/10.1051/itmconf/20213701021
Sánchez Barrera IBlack-Schaffer DCasas MMoretó MStupnikova APopov MAyguadé EHwu WBadia RHofstee H(2020)Modeling and optimizing NUMA effects and prefetching with machine learningProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392765(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392765
Grelck CBlom C(2020)Resource-Aware Data Parallel Array ProcessingInternational Journal of Parallel Programming10.1007/s10766-020-00664-0Online publication date: 9-Jun-2020
https://doi.org/10.1007/s10766-020-00664-0
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten