[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3064176.3064177acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Pandia: comprehensive contention-sensitive thread placement

Published: 23 April 2017 Publication History

Abstract

Pandia is a system for modeling the performance of in-memory parallel workloads. It generates a description of a workload from a series of profiling runs, and combines this with a description of the machine's hardware to model the workload's performance over different thread counts and different placements of those threads.
The approach is "comprehensive" in that it accounts for contention at multiple resources such as processor functional units and memory channels. The points of contention for a workload can shift between resources as the degree of parallelism and thread placement changes. Pandia accounts for these changes and provides a close correspondence between predicted performance and actual performance. Testing a set of 22 benchmarks on 2 socket Intel machines fitted with chips ranging from Sandy Bridge to Haswell we see median differences of 1.05% to 0% between the fastest predicted placement and the fastest measured placement, and median errors of 8% to 4% across all placements.
Pandia can be used to optimize the performance of a given workload---for instance, identifying whether or not multiple processor sockets should be used, and whether or not the workload benefits from using multiple threads per core. In addition, Pandia can be used to identify opportunities for reducing resource consumption where additional resources are not matched by additional performance---for instance, limiting a workload to a small number of cores when its scaling is poor.

References

[1]
G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS '67 (Spring), pages 483--185. ACM, 1967.
[2]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks; summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing '91, pages 158--165. ACM, 1991.
[3]
C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Mainmemory hash joins on multi-core CPUs: Tuning to the underlying hardware. In 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages 362--373, 2013.
[4]
M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In Proceedings of the International Conference on Supercomputing, pages 39:1--39:12, 2008.
[5]
B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS '08, pages 368--377. ACM, 2008.
[6]
M. Bhadauria and S. A. McKee. An approach to resource-aware co-scheduling for CMPs. In Proceedings of the 24th International Conference on Supercomputing, pages 189--199. ACM, 2010.
[7]
L. Carrington, A. Snavely, and N. Wolter. A performance prediction framework for scientific applications. Future Generation Computer Systems, 22(3):336--346, Feb. 2006.
[8]
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005.
[9]
G. Chatzopoulos, A. Dragojević, and R. Guerraoui. ESTIMA: Extrapolating scalability of in-memory applications. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, 2016.
[10]
A. Collins, T. Harris, M. Cole, and C. Fensch. LIRA: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '15, pages 2:1--2:8. ACM, 2015.
[11]
T. Dey, W. Wang, J. W. Davidson, and M. L. Soffa. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Transactions on Architecture and Code Optimization, 10(4):41:1--41:25, Dec 2013.
[12]
G. Dhiman, G. Marchetti, and T. Rosing. vGreen: A system for energy efficient computing in virtualized environments. In Proceedings of the 14th International Symposium on Low Power Electronics and Design, pages 243--248. ACM, 2009.
[13]
A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25--38. IEEE, 2007.
[14]
T. Harris and S. Kaestle. Callisto-RTS: Fine-grain parallel loops. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 45--56, July 2015.
[15]
Intel Corp. Intel Xeon Processor E5 v3 Product Family---Processor Specification Update. 2016. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf.
[16]
D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, SC '01, pages 37--37. ACM, 2001.
[17]
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, May 2008.
[18]
B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 277--289, July 2015.
[19]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In 14th International Conference on High-Performance Computer Architecture, HPCA-14 '08, pages 367--378, 2008.
[20]
J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Quéma, and A. Fedorova. The Linux scheduler: A decade of wasted cores. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16. ACM, 2016.
[21]
G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Performance Evaluation Review, 32(1):2--13, June 2004.
[22]
R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2005.
[23]
A. Merkel, J. Stoess, and F. Bellosa. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems, pages 153--166. ACM, 2010.
[24]
M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, P. Shelepugin, M. van Waveren, B. Whitney, and K. Kumaran. SPEC OMP2012 --- An Application Benchmark Suite for Parallel Systems Using OpenMP, pages 223--236. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[25]
R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems, pages 237--250. ACM, 2010.
[26]
OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0. May 2008. http://www.openmp.org/mp-documents/spec30.pdf.
[27]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. SIGARCH Comput. Archit. News, 35(2):381--391, June 2007.
[28]
Y. Solihin, V. Lam, and J. Torrellas. Scal-Tool: Pinpointing and quantifying scalability bottlenecks in DSM multiprocessors. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, SC '99. ACM, 1999.
[29]
R. West, P. Zaroo, C. A. Waldspurger, and X. Zhang. Online cache modeling for commodity multicore processors. SIGOPS Operating Systems Review, 44(4):19--29, Dec. 2010.
[30]
Y. Xie and G. H. Loh. Dynamic classification of program memory behaviors in CMPs. In Proceedings 2nd Workshop on CMP Memory Systems and Interconnects (CMP-MSI), June 2008.
[31]
A. Yasin. A top-down method for performance analysis and counters architecture. 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 0:35--44, 2014.
[32]
J. Zhai, W. Chen, and W. Zheng. PHANTOM: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 305--314. ACM, 2010.
[33]
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 379--391. ACM, 2013.
[34]
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129--142. ACM, 2010.
[35]
S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys, 45(1):4, 2012.

Cited By

View all
  • (2024)Optimizing Geophysical Workloads in High-Performance Computing: Leveraging Machine Learning and Transformer Models for Enhanced Parallelism and Processor Allocation2024 Third International Conference on Distributed Computing and High Performance Computing (DCHPC)10.1109/DCHPC60845.2024.10454084(1-14)Online publication date: 14-May-2024
  • (2023)Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC SystemsInternational Journal of Networking and Computing10.15803/ijnc.13.1_6213:1(62-91)Online publication date: 2023
  • (2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems
April 2017
648 pages
ISBN:9781450349383
DOI:10.1145/3064176
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroSys '17
Sponsor:
EuroSys '17: Twelfth EuroSys Conference 2017
April 23 - 26, 2017
Belgrade, Serbia

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Geophysical Workloads in High-Performance Computing: Leveraging Machine Learning and Transformer Models for Enhanced Parallelism and Processor Allocation2024 Third International Conference on Distributed Computing and High Performance Computing (DCHPC)10.1109/DCHPC60845.2024.10454084(1-14)Online publication date: 14-May-2024
  • (2023)Predicting Performance of Communications and Computations under Memory Contention in Distributed HPC SystemsInternational Journal of Networking and Computing10.15803/ijnc.13.1_6213:1(62-91)Online publication date: 2023
  • (2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
  • (2023)CO2 Emission Mitigation in Container-Based Cloud Computing by the Power of Resource ManagementProceedings of the 9th International Conference on Advanced Intelligent Systems and Informatics 202310.1007/978-3-031-43247-7_9(97-111)Online publication date: 18-Sep-2023
  • (2022)MAPPER: Managing Application Performance via Parallel Efficiency Regulation*ACM Transactions on Architecture and Code Optimization10.1145/350176719:2(1-26)Online publication date: 24-Mar-2022
  • (2022)Online Thread Auto-Tuning for Performance Improvement and Resource SavingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316941033:12(3746-3759)Online publication date: 1-Dec-2022
  • (2020)Performance Modeling of Parallel Loops on Multi-Socket Platforms Using Queueing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.293817231:2(318-331)Online publication date: 1-Feb-2020
  • (2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
  • (2020)Thread-Placement Learning2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00050(877-887)Online publication date: Nov-2020
  • (2019)Huron: hybrid false sharing detection and repairProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314644(453-468)Online publication date: 8-Jun-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media