More Web Proxy on the site http://driver.im/

research-article

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

Authors:

Thomas R. GrossAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 3, Issue 4

Article No.: 20, Pages 1 - 32

https://doi.org/10.1145/3040222

Published: 02 March 2017 Publication History

Abstract

Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) Today’s programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software).

This article presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer’s insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB. But more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations sharing the same resource management layer.

References

[1]

Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The data locality of work stealing. In Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’00). ACM, New York, NY, 1--12.

Digital Library

[2]

Zachary Anderson. 2012. Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policies. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, New York, NY, 717--736.

Digital Library

[3]

Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A case for NUMA-aware contention management on multicore systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC’11). USENIX Association, Berkeley, CA, 1.

Digital Library

[4]

William Bolosky, Robert Fitzgerald, and Michael Scott. 1989. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 19--31.

Digital Library

[5]

Quan Chen, Minyi Guo, and Zhiyi Huang. 2012. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 163--172.

Digital Library

[6]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 381--394.

Digital Library

[7]

Tanima Dey, Wei Wang, Jack W. Davidson, and Mary Lou Soffa. 2011. Characterizing multi-threaded applications based on shared-resource contention. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11). IEEE Computer Society, Washington, DC, 76--86.

Digital Library

[8]

Tanima Dey, Wei Wang, Jack W. Davidson, and Mary Lou Soffa. 2013. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity. ACM Trans. Archit. Code Optim. 10, 4, Article 41 (Dec. 2013), 25 pages.

Digital Library

[9]

Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). ACM, New York, NY, 65--74.

Digital Library

[10]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, NY, 212--223.

Digital Library

[11]

Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. 2012. SALSA: Scalable and low synchronization NUMA-aware algorithm for producer-consumer pools. In Proceedinbgs of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’12). ACM, New York, NY, 151--160.

Digital Library

[12]

Yi Guo, Jisheng Zhao, Vincent Cave, and Vivek Sarkar. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 341--342.

Digital Library

[13]

Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. 2009. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 413--422.

Digital Library

[14]

Tim Harris and Stefan Kaestle. 2015. Callisto-RTS: Fine-grain parallel loops. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 45--56.

Digital Library

[15]

Tim Harris, Martin Maas, and Virendra J. Marathe. 2014. Callisto: Co-scheduling parallel runtime systems. In Proceedings of the 9h European Conference on Computer Systems (EuroSys’14). ACM, New York, NY, Article 24, 14 pages.

Digital Library

[16]

Intel Corp. 2012. Intel(R) Threading Building Blocks Reference Manual. Intel Corporation.

[17]

Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 223--234.

Digital Library

[18]

Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, 5--5.

Digital Library

[19]

Xu Liu and John Mellor-Crummey. 2014. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 259--272.

Digital Library

[20]

Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Washington, DC, 11--22.

[21]

Zoltan Majo and Thomas R. Gross. 2015. A library for portable and composable data locality optimizations for NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 227--238.

Digital Library

[22]

Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. J. Parallel Distrib. Comput. 70, 12 (Dec. 2010), 1204--1219.

Digital Library

[23]

John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. Retrieved from https://www.researchgate.net/publication/213876927_ Memory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers.

[24]

Collin McCurdy and Jeffrey Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on. IEEE Computer Society, Washington, DC, 87--96.

[25]

Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 103--112.

Digital Library

[26]

Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000. A case for user-level dynamic page migration. In Proceedings of the 2000 International Conference on Parallel Processing (ICPP’00). IEEE Computer Society, Washington, DC, 95.

[27]

Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). ACM, New York, NY, 377--390.

Digital Library

[28]

OpenMP ARB. 2013. OpenMP Application Programming Interface, Version 4.0. OpenMP Architecture Review Board.

[29]

OpenMP ARB. 2015. OpenMP Application Programming Interface, Version 4.5. OpenMP Architecture Review Board.

[30]

Heidi Pan, Benjamin Hindman, and Krste Asanović. 2010. Composing parallel software efficiently with lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 376--387.

Digital Library

[31]

Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating cache performance bottlenecks using data profiling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 335--348.

Digital Library

[32]

Chandan Reddy and Uday Bondhugula. 2014. Effective automatic computation placement and dataallocation for parallelization of regular programs. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, New York, NY, 13--22.

Digital Library

[33]

Eric C. Reed, Nicholas Chen, and Ralph E. Johnson. 2011. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proceedings of the Compilation of the Co-located Workshops on DSM’11, TMC’11, AGERE&excl;’11, AOOPES’11, NEAT’11, 8#38; VMIL’11 (SPLASH’11 Workshops). ACM, New York, NY, 133--138.

Digital Library

[34]

Arch Robison, Michael Voss, and Alexey Kukanov. 2008. Optimization via reflection on work stealing in TBB. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE Computer Society, Washington, DC, 1--8.

[35]

Harsha Vardhan Simhadri, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Aapo Kyrola. 2014. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’14). ACM, New York, NY, 30--41.

Digital Library

[36]

David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’07). ACM, New York, NY, 47--58.

Digital Library

[37]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. J. Parallel Distrib. Comput. 68, 9 (Sept. 2008), 1186--1200.

Digital Library

[38]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). ACM, New York, NY, 279--289.

Digital Library

[39]

Eddy Z. Zhang, Yunlian Jiang, and Xipeng Shen. 2010. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 203--212.

Digital Library

Cited By

Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Gureya DNeto JKarimi RBarreto JBhatotia PQuema VRodrigues RRomano PVlassov V(2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00063
Barik RShpeisman TRong HHu CLee VAnderson THenry GLiu HWu YPetersen PLowney G(2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_13
Show More Cited By

Index Terms

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
    2. General programming languages
      1. Language features
        Concurrent programming structures
      2. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking
    2. Extra-functional properties
      1. Software performance

Recommendations

A library for portable and composable data locality optimizations for NUMA systems
PPoPP '15

Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three ...
A library for portable and composable data locality optimizations for NUMA systems
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Many recent multiprocessor systems are realized with a non-uniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three ...
Matching memory access patterns and data placement for NUMA systems
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 3, Issue 4

Special Issue on PPoPP 2015 and Regular Papers

March 2017

130 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3057718

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2017

Accepted: 01 January 2017

Revised: 01 November 2016

Received: 01 February 2016

Published in TOPC Volume 3, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

SNF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
308
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Gureya DNeto JKarimi RBarreto JBhatotia PQuema VRodrigues RRomano PVlassov V(2020)Bandwidth-Aware Page Placement in NUMA2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00063(546-556)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00063
Barik RShpeisman TRong HHu CLee VAnderson THenry GLiu HWu YPetersen PLowney G(2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_13
Știrb I(2018)Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication TreeComputers10.3390/computers70400667:4(66)Online publication date: 3-Dec-2018
https://doi.org/10.3390/computers7040066
Jokar MZhang LChong FJacob B(2018)Cooperative NV-NUMAProceedings of the International Symposium on Memory Systems10.1145/3240302.3240308(67-78)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240308
Wang XLiu WXue WWu L(2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178513
Wang XLiu WXue WWu LKrall AGross T(2018)swSpTRSVProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178513(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178513
Deters JWu JXu YLee I(2018)A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle2018 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2018.8573486(59-70)Online publication date: Sep-2018
https://doi.org/10.1109/IISWC.2018.8573486

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents