More Web Proxy on the site http://driver.im/

research-article

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

Authors:

Dmitri Perelman,

Yonathan PerezAuthors Info & Claims

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Pages 151 - 160

https://doi.org/10.1145/2312005.2312035

Published: 25 June 2012 Publication History

Abstract

We present a highly-scalable non-blocking producer-consumer task pool, designed with a special emphasis on lightweight synchronization and data locality. The core building block of our pool is SALSA, Scalable And Low Synchronization Algorithm for a single-consumer container with task stealing support. Each consumer operates on its own SALSA container, stealing tasks from other containers if necessary. We implement an elegant self-tuning policy for task insertion, which does not push tasks to overloaded SALSA containers, thus decreasing the likelihood of stealing.

SALSA manages large chunks of tasks, which improves locality and facilitates stealing. SALSA uses a novel approach for coordination among consumers, without strong atomic operations or memory barriers in the fast path. It invokes only two CAS operations during a chunk steal.

Our evaluation demonstrates that a pool built using SALSA containers scales linearly with the number of threads and significantly outperforms other FIFO and non-FIFO alternatives.

References

[1]

http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt.

[2]

www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

[3]

Y. Afek, G. Korland, M. Natanzon, and N. Shavit. Scalable producer-consumer pools based on elimination-diffraction trees. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II}, Euro-Par'10, pages 151--162, 2010.

Digital Library

[4]

Y. Afek, G. Korland, and E. Yanovsky. Quasi-linearizability: Relaxed consistency for improved concurrency. In Principles of Distributed Systems, Lecture Notes in Computer Science, pages 395--410.

[5]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures}, SPAA '98, pages 119--129, 1998.

Digital Library

[6]

D. Basin. Cafe: Scalable task pools with adjustable fairness and contention. Master's thesis, Technion, 2011.

[7]

D. Basin, R. Fan, I. Keidar, O. Kiselov, and D. Perelman. Cafe: scalable task pools with adjustable fairness and contention. In Proceedings of the 25th international conference on Distributed computing}, DISC'11, pages 475--488, 2011.

Digital Library

[8]

S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, USENIXATC'11, 2011.

Digital Library

[9]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999.

Digital Library

[10]

A. Braginsky and E. Petrank. Locality-conscious lock-free linked lists. In Proceedings of the 12th international conference on Distributed computing and networking, ICDCN'11, pages 107--118, 2011.

Digital Library

[11]

D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, 2011.

Digital Library

[12]

A. Gidenstam, H. Sundell, and P. Tsigas. Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency. In Proceedings of the 14th international conference on Principles of distributed systems}, OPODIS'10, pages 302--317, 2010.

Digital Library

[13]

E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools. Technical report, Technion, 2012.

Digital Library

[14]

D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006.

Digital Library

[15]

D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC '02, pages 280--289, 2002.

Digital Library

[16]

D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '04, pages 206--215, 2004.

Digital Library

[17]

M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th international conference on Principles of distributed systems, OPODIS'07, pages 401--414, 2007.

Digital Library

[18]

E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 75--84, 2011.

Digital Library

[19]

L. Lamport. How to make a multiprocessor computer that correctly execute multiprocess programs. IEEE Trans. Comput., pages 690--691, 1979.

Digital Library

[20]

M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst., 15:491--504, June 2004.

Digital Library

[21]

M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, pages 267--275, 1996.

Digital Library

[22]

M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 253--262, 2005.

Digital Library

[23]

P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM, pages 89--97, 2010.

Digital Library

[24]

H. Sundell, A. Gidenstam, M. Papatriantafilou, and P. Tsigas. A lock-free algorithm for concurrent bags. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 335--344, 2011.

Digital Library

Cited By

Thomas SHayne RPulaj JMendes H(2022)Using skip graphs for increased NUMA localityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.006167:C(31-49)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2022.04.006
Wang XLiu WXue WWu L(2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178513
Wang XLiu WXue WWu LKrall AGross T(2018)swSpTRSVProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178513(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178513
Show More Cited By

Index Terms

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

LOFT: lock-free transactional data structures
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Concurrent data structures are widely used in modern multicore architectures, providing atomicity (linearizability) for each concurrent operation. However, it is often desirable to execute several operations on multiple data structures atomically. We ...
Transactional Lock Elision Meets Combining
PODC '17: Proceedings of the ACM Symposium on Principles of Distributed Computing

Flat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (...
Predicate RCU: an RCU for scalable concurrent updates
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Read-copy update (RCU) is a shared memory synchronization mechanism with scalable synchronization-free reads that nevertheless execute correctly with concurrent updates. To guarantee the consistency of such reads, an RCU update transitioning the data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

June 2012

348 pages

ISBN:9781450312134

DOI:10.1145/2312005

General Chair:
Guy Blelloch
Carnegie Mellon University, USA
,
Program Chair:
Maurice Herlihy
Brown University, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA '12

Sponsor:

SPAA '12: 24th ACM Symposium on Parallelism in Algorithms and Architectures

June 25 - 27, 2012

Pennsylvania, Pittsburgh, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
252
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Thomas SHayne RPulaj JMendes H(2022)Using skip graphs for increased NUMA localityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.006167:C(31-49)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2022.04.006
Wang XLiu WXue WWu L(2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178513
Wang XLiu WXue WWu LKrall AGross T(2018)swSpTRSVProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178513(338-353)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178513
Li MLiu YYang HLuan ZQian D(2018)Multi-role SpTRSV on Sunway Many-Core Architecture2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2018.00109(594-601)Online publication date: Jun-2018
https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00109
Majo ZGross T(2017)A Library for Portable and Composable Data Locality Optimizations for NUMA SystemsACM Transactions on Parallel Computing10.1145/30402223:4(1-32)Online publication date: 2-Mar-2017
https://dl.acm.org/doi/10.1145/3040222
Dice DHerlihy MKogan A(2016)Fast non-intrusive memory reclamation for highly-concurrent data structuresACM SIGPLAN Notices10.1145/3241624.292669951:11(36-45)Online publication date: 14-Jun-2016
https://dl.acm.org/doi/10.1145/3241624.2926699
Dice DHerlihy MKogan AFlood CZhang Z(2016)Fast non-intrusive memory reclamation for highly-concurrent data structuresProceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management10.1145/2926697.2926699(36-45)Online publication date: 14-Jun-2016
https://dl.acm.org/doi/10.1145/2926697.2926699
Shavit NTaubenfeld G(2016)The computability of relaxed data structuresDistributed Computing10.1007/s00446-016-0272-029:5(395-407)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s00446-016-0272-0
Majo ZGross T(2015)A library for portable and composable data locality optimizations for NUMA systemsACM SIGPLAN Notices10.1145/2858788.268850950:8(227-238)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2858788.2688509
Majo ZGross TCohen AGrove D(2015)A library for portable and composable data locality optimizations for NUMA systemsProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688509(227-238)Online publication date: 24-Jan-2015
https://dl.acm.org/doi/10.1145/2688500.2688509
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents