[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

Published: 02 March 2017 Publication History

Abstract

Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) Today’s programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software).
This article presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer’s insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB. But more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations sharing the same resource management layer.

References

[1]
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The data locality of work stealing. In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’00). ACM, New York, NY, 1--12.
[2]
Zachary Anderson. 2012. Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policies. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, New York, NY, 717--736.
[3]
Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A case for NUMA-aware contention management on multicore systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’11). USENIX Association, Berkeley, CA, 1.
[4]
William Bolosky, Robert Fitzgerald, and Michael Scott. 1989. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 19--31.
[5]
Quan Chen, Minyi Guo, and Zhiyi Huang. 2012. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 163--172.
[6]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 381--394.
[7]
Tanima Dey, Wei Wang, Jack W. Davidson, and Mary Lou Soffa. 2011. Characterizing multi-threaded applications based on shared-resource contention. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). IEEE Computer Society, Washington, DC, 76--86.
[8]
Tanima Dey, Wei Wang, Jack W. Davidson, and Mary Lou Soffa. 2013. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Trans. Archit. Code Optim. 10, 4, Article 41 (Dec. 2013), 25 pages.
[9]
Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). ACM, New York, NY, 65--74.
[10]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, NY, 212--223.
[11]
Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. 2012. SALSA: Scalable and low synchronization NUMA-aware algorithm for producer-consumer pools. In Proceedinbgs of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’12). ACM, New York, NY, 151--160.
[12]
Yi Guo, Jisheng Zhao, Vincent Cave, and Vivek Sarkar. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 341--342.
[13]
Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. 2009. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 413--422.
[14]
Tim Harris and Stefan Kaestle. 2015. Callisto-RTS: Fine-grain parallel loops. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 45--56.
[15]
Tim Harris, Martin Maas, and Virendra J. Marathe. 2014. Callisto: Co-scheduling parallel runtime systems. In Proceedings of the 9h European Conference on Computer Systems (EuroSys’14). ACM, New York, NY, Article 24, 14 pages.
[16]
Intel Corp. 2012. Intel(R) Threading Building Blocks Reference Manual. Intel Corporation.
[17]
Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 223--234.
[18]
Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, 5--5.
[19]
Xu Liu and John Mellor-Crummey. 2014. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 259--272.
[20]
Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Washington, DC, 11--22.
[21]
Zoltan Majo and Thomas R. Gross. 2015. A library for portable and composable data locality optimizations for NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 227--238.
[22]
Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. J. Parallel Distrib. Comput. 70, 12 (Dec. 2010), 1204--1219.
[23]
John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. Retrieved from https://www.researchgate.net/publication/213876927_ Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers.
[24]
Collin McCurdy and Jeffrey Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on. IEEE Computer Society, Washington, DC, 87--96.
[25]
Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 103--112.
[26]
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000. A case for user-level dynamic page migration. In Proceedings of the 2000 International Conference on Parallel Processing (ICPP’00). IEEE Computer Society, Washington, DC, 95.
[27]
Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). ACM, New York, NY, 377--390.
[28]
OpenMP ARB. 2013. OpenMP Application Programming Interface, Version 4.0. OpenMP Architecture Review Board.
[29]
OpenMP ARB. 2015. OpenMP Application Programming Interface, Version 4.5. OpenMP Architecture Review Board.
[30]
Heidi Pan, Benjamin Hindman, and Krste Asanović. 2010. Composing parallel software efficiently with lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 376--387.
[31]
Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating cache performance bottlenecks using data profiling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 335--348.
[32]
Chandan Reddy and Uday Bondhugula. 2014. Effective automatic computation placement and dataallocation for parallelization of regular programs. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, New York, NY, 13--22.
[33]
Eric C. Reed, Nicholas Chen, and Ralph E. Johnson. 2011. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proceedings of the Compilation of the Co-located Workshops on DSM’11, TMC’11, AGERE!’11, AOOPES’11, NEAT’11, 8#38; VMIL’11 (SPLASH’11 Workshops). ACM, New York, NY, 133--138.
[34]
Arch Robison, Michael Voss, and Alexey Kukanov. 2008. Optimization via reflection on work stealing in TBB. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE Computer Society, Washington, DC, 1--8.
[35]
Harsha Vardhan Simhadri, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Aapo Kyrola. 2014. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’14). ACM, New York, NY, 30--41.
[36]
David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’07). ACM, New York, NY, 47--58.
[37]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. J. Parallel Distrib. Comput. 68, 9 (Sept. 2008), 1186--1200.
[38]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). ACM, New York, NY, 279--289.
[39]
Eddy Z. Zhang, Yunlian Jiang, and Xipeng Shen. 2010. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 203--212.

Cited By

View all
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
  • (2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 3, Issue 4
Special Issue on PPoPP 2015 and Regular Papers
March 2017
130 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3057718
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2017
Accepted: 01 January 2017
Revised: 01 November 2016
Received: 01 February 2016
Published in TOPC Volume 3, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. data placement
  3. scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • SNF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
  • (2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
  • (2018)Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication TreeComputers10.3390/computers70400667:4(66)Online publication date: 3-Dec-2018
  • (2018)Cooperative NV-NUMAProceedings of the International Symposium on Memory Systems10.1145/3240302.3240308(67-78)Online publication date: 1-Oct-2018
  • (2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
  • (2018)swSpTRSVProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178513(338-353)Online publication date: 10-Feb-2018
  • (2018)A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle2018 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2018.8573486(59-70)Online publication date: Sep-2018

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media