More Web Proxy on the site http://driver.im/

research-article

Pandia: comprehensive contention-sensitive thread placement

Authors:

Daniel Goodman,

Georgios Varisteas,

Tim HarrisAuthors Info & Claims

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

Pages 254 - 269

https://doi.org/10.1145/3064176.3064177

Published: 23 April 2017 Publication History

Abstract

Pandia is a system for modeling the performance of in-memory parallel workloads. It generates a description of a workload from a series of profiling runs, and combines this with a description of the machine's hardware to model the workload's performance over different thread counts and different placements of those threads.

The approach is "comprehensive" in that it accounts for contention at multiple resources such as processor functional units and memory channels. The points of contention for a workload can shift between resources as the degree of parallelism and thread placement changes. Pandia accounts for these changes and provides a close correspondence between predicted performance and actual performance. Testing a set of 22 benchmarks on 2 socket Intel machines fitted with chips ranging from Sandy Bridge to Haswell we see median differences of 1.05% to 0% between the fastest predicted placement and the fastest measured placement, and median errors of 8% to 4% across all placements.

Pandia can be used to optimize the performance of a given workload---for instance, identifying whether or not multiple processor sockets should be used, and whether or not the workload benefits from using multiple threads per core. In addition, Pandia can be used to identify opportunities for reducing resource consumption where additional resources are not matched by additional performance---for instance, limiting a workload to a small number of cores when its scaling is poor.

References

[1]

G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS '67 (Spring), pages 483--185. ACM, 1967.

Digital Library

[2]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks; summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing '91, pages 158--165. ACM, 1991.

Digital Library

[3]

C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Mainmemory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 362--373, 2013.

Digital Library

[4]

M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on Supercomputing, pages 39:1--39:12, 2008.

[5]

B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS '08, pages 368--377. ACM, 2008.

Digital Library

[6]

M. Bhadauria and S. A. McKee. An approach to resource-aware co-scheduling for CMPs. In Proceedings of the 24th International Conference on Supercomputing, pages 189--199. ACM, 2010.

Digital Library

[7]

L. Carrington, A. Snavely, and N. Wolter. A performance prediction framework for scientific applications. Future Generation Computer Systems, 22(3):336--346, Feb. 2006.

Digital Library

[8]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005.

Digital Library

[9]

G. Chatzopoulos, A. Dragojević, and R. Guerraoui. ESTIMA: Extrapolating scalability of in-memory applications. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, 2016.

Digital Library

[10]

A. Collins, T. Harris, M. Cole, and C. Fensch. LIRA: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '15, pages 2:1--2:8. ACM, 2015.

Digital Library

[11]

T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Transactions on Architecture and Code Optimization, 10(4):41:1--41:25, Dec 2013.

Digital Library

[12]

G. Dhiman, G. Marchetti, and T. Rosing. vGreen: A system for energy efficient computing in virtualized environments. In Proceedings of the 14th International Symposium on Low Power Electronics and Design, pages 243--248. ACM, 2009.

Digital Library

[13]

A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25--38. IEEE, 2007.

[14]

T. Harris and S. Kaestle. Callisto-RTS: Fine-grain parallel loops. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 45--56, July 2015.

[15]

Intel Corp. Intel Xeon Processor E5 v3 Product Family---Processor Specification Update. 2016. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf.

[16]

D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, SC '01, pages 37--37. ACM, 2001.

Digital Library

[17]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, May 2008.

Digital Library

[18]

B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 277--289, July 2015.

Digital Library

[19]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In 14th International Conference on High-Performance Computer Architecture, HPCA-14 '08, pages 367--378, 2008.

[20]

J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova. The Linux scheduler: A decade of wasted cores. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16. ACM, 2016.

Digital Library

[21]

G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Performance Evaluation Review, 32(1):2--13, June 2004.

Digital Library

[22]

R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2005.

Digital Library

[23]

A. Merkel, J. Stoess, and F. Bellosa. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems, pages 153--166. ACM, 2010.

Digital Library

[24]

M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, P. Shelepugin, M. van Waveren, B. Whitney, and K. Kumaran. SPEC OMP2012 --- An Application Benchmark Suite for Parallel Systems Using OpenMP, pages 223--236. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

Digital Library

[25]

R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems, pages 237--250. ACM, 2010.

Digital Library

[26]

OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0. May 2008. http://www.openmp.org/mp-documents/spec30.pdf.

[27]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. SIGARCH Comput. Archit. News, 35(2):381--391, June 2007.

Digital Library

[28]

Y. Solihin, V. Lam, and J. Torrellas. Scal-Tool: Pinpointing and quantifying scalability bottlenecks in DSM multiprocessors. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, SC '99. ACM, 1999.

Digital Library

[29]

R. West, P. Zaroo, C. A. Waldspurger, and X. Zhang. Online cache modeling for commodity multicore processors. SIGOPS Operating Systems Review, 44(4):19--29, Dec. 2010.

Digital Library

[30]

Y. Xie and G. H. Loh. Dynamic classification of program memory behaviors in CMPs. In Proceedings 2nd Workshop on CMP Memory Systems and Interconnects (CMP-MSI), June 2008.

[31]

A. Yasin. A top-down method for performance analysis and counters architecture. 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 0:35--44, 2014.

[32]

J. Zhai, W. Chen, and W. Zheng. PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 305--314. ACM, 2010.

Digital Library

[33]

X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 379--391. ACM, 2013.

Digital Library

[34]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129--142. ACM, 2010.

Digital Library

[35]

S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys, 45(1):4, 2012.

Digital Library

Cited By

Wyzykowski ASousa GCoelho BSantos LAnschau D(2024)Optimizing Geophysical Workloads in High-Performance Computing: Leveraging Machine Learning and Transformer Models for Enhanced Parallelism and Processor Allocation2024 Third International Conference on Distributed Computing and High Performance Computing (DCHPC)10.1109/DCHPC60845.2024.10454084(1-14)Online publication date: 14-May-2024
https://doi.org/10.1109/DCHPC60845.2024.10454084
Denis AJeannot ESwartvagher P(2023)Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC SystemsInternational Journal of Networking and Computing10.15803/ijnc.13.1_6213:1(62-91)Online publication date: 2023
https://doi.org/10.15803/ijnc.13.1_62
Huang HZhao YRao JWu SJin HWang DKun SPan L(2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TC.2022.3174480
Show More Cited By

Pandia: comprehensive contention-sensitive thread placement

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

April 2017

648 pages

ISBN:9781450349383

DOI:10.1145/3064176

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '17

Sponsor:

SIGOPS

EuroSys '17: Twelfth EuroSys Conference 2017

April 23 - 26, 2017

Belgrade, Serbia

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
391
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wyzykowski ASousa GCoelho BSantos LAnschau D(2024)Optimizing Geophysical Workloads in High-Performance Computing: Leveraging Machine Learning and Transformer Models for Enhanced Parallelism and Processor Allocation2024 Third International Conference on Distributed Computing and High Performance Computing (DCHPC)10.1109/DCHPC60845.2024.10454084(1-14)Online publication date: 14-May-2024
https://doi.org/10.1109/DCHPC60845.2024.10454084
Denis AJeannot ESwartvagher P(2023)Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC SystemsInternational Journal of Networking and Computing10.15803/ijnc.13.1_6213:1(62-91)Online publication date: 2023
https://doi.org/10.15803/ijnc.13.1_62
Huang HZhao YRao JWu SJin HWang DKun SPan L(2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TC.2022.3174480
Ashry NAttia RNashaat HRizk R(2023)CO2 Emission Mitigation in Container-Based Cloud Computing by the Power of Resource ManagementProceedings of the 9th International Conference on Advanced Intelligent Systems and Informatics 202310.1007/978-3-031-43247-7_9(97-111)Online publication date: 18-Sep-2023
https://doi.org/10.1007/978-3-031-43247-7_9
Srikanthan SChakraborti SFerro PDwarkadas S(2022)MAPPER: Managing Application Performance via Parallel Efficiency Regulation*ACM Transactions on Architecture and Code Optimization10.1145/350176719:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3501767
Luan GPang PChen QXue SSong ZGuo M(2022)Online Thread Auto-Tuning for Performance Improvement and Resource SavingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316941033:12(3746-3759)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3169410
Cho YOh SEgger B(2020)Performance Modeling of Parallel Loops on Multi-Socket Platforms Using Queueing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.293817231:2(318-331)Online publication date: 1-Feb-2020
https://doi.org/10.1109/TPDS.2019.2938172
Gureya DNeto JKarimi RBarreto JBhatotia PQuema VRodrigues RRomano PVlassov V(2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00063
Antoniadis KGuerraoui RTrigonakis V(2020)Thread-Placement Learning2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00050(877-887)Online publication date: Nov-2020
https://doi.org/10.1109/ICDCS47774.2020.00050
Khan TZhao YPokam GMozafari BKasikci BMcKinley KFisher K(2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314644
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents