[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3392717.3392765acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

Modeling and optimizing NUMA effects and prefetching with machine learning

Published: 29 June 2020 Publication History

Abstract

Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime.
In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the output predictions to a subset of configurations that cover most of the performance.
Our model is robust and can choose near-optimal NUMA+Pre-fetcher configurations for applications from only two profile runs. We further demonstrate how to profile online with low overhead, resulting in a technique that delivers an average of 1.68X performance improvement over a locality-optimized NUMA baseline with all prefetchers enabled.

References

[1]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks --- Summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Albuquerque, New Mexico, USA) (Supercomputing '91). ACM, New York, NY, USA, 158--165.
[2]
David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. 2015. TABARNAC: Visualizing and Resolving Memory Access Issues on NUMA Architectures. In Proceedings of the 2nd Workshop on Visual Performance Analysis (Austin, Texas, USA) (VPA '15). ACM, New York, NY, USA, Article 1, 9 pages.
[3]
Greg Bronevetsky, John Gyllenhaal, and Bronis R. De Supinski. 2008. CLOMP: Accurately Characterizing OpenMP Application Overheads. In Proceedings of the 4th International Conference on OpenMP in a New Era of Parallelism (West Lafayette, Indiana, USA) (IWOMP '08). Springer, Berlin, Heidelberg, 13--25.
[4]
François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. 2010. ForestGOMP: an efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming 38, 5 (2010), 418--439.
[5]
Marcio Castro, Luis Fabricio Wanderley Goes, Christiane Pousa Ribeiro, Murray Cole, Marcelo Cintra, and Jean-Francois Mehaut. 2011. A Machine Learning-Based Approach for Thread Mapping on Transactional Memory Applications. In Proceedings of the 2011 18th International Conference on High Performance Computing (Bangalore, India) (HiPC '11). IEEE, USA, 10.
[6]
John Cavazos, Christophe Dubach, Felix Agakov, Edwin Bonilla, Michael F. P. O'Boyle, Grigori Fursin, and Olivier Temam. 2006. Automatic Performance Model Construction for the Fast Software Exploration of New Hardware Designs. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (Seoul, Korea) (CASES '06). ACM, New York, NY, USA, 24--34.
[7]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (Austin, Texas, USA) (IISWC '09). IEEE, USA, 44--54.
[8]
Henry Cook, Miquel Moretó, Sarah Bird, Khanh Dao, David A. Patterson, and Krste Asanović. 2013. A Hardware Evaluation of Cache Partitioning to Improve Utilization and Energy-Efficiency While Preserving Responsiveness. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). ACM, New York, NY, USA, 308--319.
[9]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) (ASPLOS '13). ACM, New York, NY, USA, 381--394.
[10]
Howard David, Eugene Gorbatov, Ulf R. Hanebutte, Rahul Khanna, and Christian Le. 2010. RAPL: Memory Power Estimation and Capping. In Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design (Austin, Texas, USA) (ISLPED '10). ACM, New York, NY, USA, 189--194.
[11]
Pablo de Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. 2015. CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization. ACM Trans. Archit. Code Optim. 12, 1, Article 6 (April 2015), 24 pages.
[12]
Pablo de Oliveira Castro, Yuriy Kashnikov, Chadi Akel, Mihail Popov, and William Jalby. 2014. Fine-Grained Benchmark Subsetting for System Selection. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, Florida, USA) (CGO '14). ACM, New York, NY, USA, 132--142.
[13]
Nicolas Denoyelle, Brice Goglin, Emmanuel Jeannot, and Thomas Ropars. 2019. Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach. In Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP 2019). ACM, New York, NY, USA, Article 39, 10 pages.
[14]
Matthias Diener, Eduardo H.M. Cruz, Philippe O.A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. KMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (Edmonton, Alberta, Canada) (PACT '14). ACM, New York, NY, USA, 277--288.
[15]
Matthias Diener, Eduardo HM Cruz, Laércio L. Pilla, Fabrice Dupros, and Philippe O.A. Navaux. 2015. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88--89 (June 2015), 18--36.
[16]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Philippe O. A. Navaux, and Israel Koren. 2016. Affinity-Based Thread and Data Mapping in Shared Memory Systems. Comput. Surveys 49, 4, Article 64 (Dec. 2016), 38 pages.
[17]
Juan J Durillo, Philipp Gschwandtner, Klaus Kofler, and Thomas Fahringer. 2019. Multi-Objective region-Aware optimization of parallel programs. Parallel Comput. 83 (2019), 3--21.
[18]
Jason Hiebel, Laura E. Brown, and Zhenlin Wang. 2019. Machine Learning for Fine-Grained Hardware Prefetcher Control. In Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP 2019). ACM, New York, NY, USA, Article 3, 9 pages.
[19]
Rich Hornung, Jeff Keasler, and Maya Gokhale. 2011. Hydrodynamics Challenge Problem. Technical Report LLNL-TR-490254. Lawrence Livermore National Laboratory. 1--17 pages. https://computing.llnl.gov/projects/co-design/spec-7.pdf
[20]
Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O'Connell. 2012. Making Data Prefetch Smarter: Adaptive Prefetching on POWER7. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, Minnesota, USA) (PACT '12). ACM, New York, NY, USA, 137--146.
[21]
Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973. Lawrence Livermore National Laboratory. 19 pages. https://computing.llnl.gov/projects/co-design/lulesh2.0_changes1.pdf
[22]
Muneeb Khan, Michael A. Laurenzanoy, Jason Marsy, Erik Hagersten, and David Black-Schaffer. 2015. AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (San Francisco, California, USA) (PACT '15). IEEE, USA, 367--378.
[23]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization (Palo Alto, California, USA) (CGO '04). IEEE, USA, 12.
[24]
Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. 2007. Methods of Inference and Learning for Performance Modeling of Parallel Applications. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Jose, California, USA) (PPoPP '07). ACM, New York, NY, USA, 249--258.
[25]
Tan Li, Yufei Ren, Dantong Yu, and Shudong Jin. 2017. Analysis of NUMA Effects in Modern Multicore Systems for the Design of High-Performance Data Transfer Applications. Future Generation Computer Systems 74, C (Sept. 2017), 41--50.
[26]
Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, Chinyen Chou, Chiaheng Tu, and Hucheng Zhou. 2009. Machine Learning-Based Prefetch Optimization for Data Center Applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, Oregon, USA) (SC '09). ACM, New York, NY, USA, Article 56, 10 pages.
[27]
Zoltan Majo and Thomas R. Gross. 2012. Matching Memory Access Patterns and Data Placement for NUMA Systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (San Jose, California, USA) (CGO '12). ACM, New York, NY, USA, 230--241.
[28]
Zoltan Majo and Thomas R. Gross. 2013. (Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (Portland, Oregon, USA) (IISWC '13). IEEE, USA, 11--22.
[29]
Puya Memarzia, Suprio Ray, and Virendra C Bhavsar. 2019. Toward Efficient In-memory Data Analytics on NUMA Systems. arXiv e-prints (2019), 15. arXiv:1908.01860v2 [cs.DB]
[30]
Sparsh Mittal. 2016. A Survey of Recent Prefetching Techniques for Processor Caches. Comput. Surveys 49, 2, Article 35 (Aug. 2016), 35 pages.
[31]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830. arXiv:1201.0490 [cs.LG] http://jmlr.org/papers/v12/pedregosa11a.html
[32]
Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (Edmonton, Alberta, Canada) (PACT '14). ACM, New York, NY, USA, 369--380.
[33]
Mihail Popov, Chadi Akel, Yohan Chatelain, William Jalby, and Pablo de Oliveira Castro. 2017. Piecewise holistic autotuning of parallel programs with CERE. Concurrency and Computation: Practice and Experience 29, 15, Article e4190 (2017), 15 pages.
[34]
Mihail Popov, Chadi Akel, Florent Conti, William Jalby, and Pablo de Oliveira Castro. 2015. PCERE: Fine-Grained Parallel Benchmark Decomposition for Scalability Prediction. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (Hyderabad, India) (IPDPS '15). IEEE, USA, 1151--1160.
[35]
Mihail Popov, Alexandra Jimborean, and David Black-Schaffer. 2019. Efficient Thread/Page/Parallelism Autotuning for NUMA Systems. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona, USA) (ICS '19). ACM, New York, NY, USA, 342--353.
[36]
Petar Radojković, Paul M. Carpenter, Miquel Moretó, Vladimir Čakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2016. Thread Assignment in Multicore/Multithreaded Processors: A Statistical Approach. IEEE Trans. Comput. 65, 1 (Jan. 2016), 256--269.
[37]
Petar Radojković, Vladimir Čakarević, Miquel Moretó, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2012. Optimal Task Assignment in Multithreaded Processors: A Statistical Approach. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (London, England, UK) (ASPLOS XVII). ACM, New York, NY, USA, 235--248.
[38]
Isaac Sánchez Barrera, Miquel Moretó, Eduard Ayguadé, Jesús Labarta, Mateo Valero, and Marc Casas. 2018. Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies. In Proceedings of the 2018 International Conference on Supercomputing (Beijing, China) (ICS '18). ACM, New York, NY, USA, 207--217.
[39]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware Monitors for Dynamic Page Migration. J. Parallel and Distrib. Comput. 68, 9 (Sept. 2008), 1186--1200.
[40]
François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, Oregon, USA) (ICPP 2018). ACM, New York, NY, USA, Article 19, 10 pages.
[41]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A Lightweight Performance-Oriented Tool Suite for X86 Multicore Environments. In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops (San Diego, California, USA) (ICPPW '10). IEEE, USA, 207--216.
[42]
Wei Wang, Jack W. Davidson, and Mary Lou Soffa. 2016. Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (Barcelona, Spain) (HPCA '16). IEEE, USA, 419--431.
[43]
Zheng Wang and Michael O'Boyle. 2018. Machine Learning in Compiler Optimization. Proc. IEEE 106, 11 (2018), 1879--1901.
[44]
Zheng Wang and Michael F.P. O'Boyle. 2009. Mapping Parallelism to Multi-Cores: A Machine Learning Based Approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, North Carolina, USA) (PPoPP '09). ACM, New York, NY, USA, 75--84.
[45]
Carole-Jean Wu and Margaret Martonosi. 2011. Characterization and Dynamic Mitigation of Intra-Application Cache Interference. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (Austin, Texas, USA) (ISPASS '11). IEEE, USA, 2--11.

Cited By

View all
  • (2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
  • (2024)MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance OptimizationsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676895(156-167)Online publication date: 14-Oct-2024
  • (2024)Enhancing Graph Execution for Performance and Energy Efficiency on NUMA Machines2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00036(143-148)Online publication date: 1-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
  • General Chairs:
  • Eduard Ayguadé,
  • Wen-mei Hwu,
  • Program Chairs:
  • Rosa M. Badia,
  • H. Peter Hofstee
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. machine learning model
  3. page mapping
  4. performance optimization
  5. prefetching
  6. thread mapping

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '20
Sponsor:
ICS '20: 2020 International Conference on Supercomputing
June 29 - July 2, 2020
Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)464
  • Downloads (Last 6 weeks)62
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Global-State Aware Automatic NUMA BalancingProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671380(317-326)Online publication date: 24-Jul-2024
  • (2024)MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance OptimizationsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676895(156-167)Online publication date: 14-Oct-2024
  • (2024)Enhancing Graph Execution for Performance and Energy Efficiency on NUMA Machines2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI61997.2024.00036(143-148)Online publication date: 1-Jul-2024
  • (2024)Reinforcement Learning-Driven Co-Scheduling and Diverse Resource Assignments on NUMA Systems2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00034(170-178)Online publication date: 18-Nov-2024
  • (2024)Predictive Web Prefetching: A Combined Approach Using Clustering Algorithms and WEKA in High-Traffic SettingsArtificial Intelligence in Internet of Things (IoT): Key Digital Trends10.1007/978-981-97-5786-2_17(221-231)Online publication date: 17-Oct-2024
  • (2023)PERFOGRAPHProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668640(57783-57794)Online publication date: 10-Dec-2023
  • (2023)An Energy-Efficient Tuning Method for Cloud Servers Combining DVFS and Parameter OptimizationIEEE Transactions on Cloud Computing10.1109/TCC.2023.330892711:4(3643-3655)Online publication date: Oct-2023
  • (2023)Optimizing Single-Source Graph Execution on NUMA Machines2023 XIII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC60926.2023.10324068(1-6)Online publication date: 21-Nov-2023
  • (2023)Power Constrained Autotuning using Graph Neural Networks2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00060(535-545)Online publication date: May-2023
  • (2023)Optimizing Performance and Energy Across Problem Sizes Through a Search Space Exploration and Machine LearningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720(104720)Online publication date: Jun-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media