[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2312005.2312035acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

Published: 25 June 2012 Publication History

Abstract

We present a highly-scalable non-blocking producer-consumer task pool, designed with a special emphasis on lightweight synchronization and data locality. The core building block of our pool is SALSA, Scalable And Low Synchronization Algorithm for a single-consumer container with task stealing support. Each consumer operates on its own SALSA container, stealing tasks from other containers if necessary. We implement an elegant self-tuning policy for task insertion, which does not push tasks to overloaded SALSA containers, thus decreasing the likelihood of stealing.
SALSA manages large chunks of tasks, which improves locality and facilitates stealing. SALSA uses a novel approach for coordination among consumers, without strong atomic operations or memory barriers in the fast path. It invokes only two CAS operations during a chunk steal.
Our evaluation demonstrates that a pool built using SALSA containers scales linearly with the number of threads and significantly outperforms other FIFO and non-FIFO alternatives.

References

[1]
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt.
[2]
www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.
[3]
Y. Afek, G. Korland, M. Natanzon, and N. Shavit. Scalable producer-consumer pools based on elimination-diffraction trees. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II}, Euro-Par'10, pages 151--162, 2010.
[4]
Y. Afek, G. Korland, and E. Yanovsky. Quasi-linearizability: Relaxed consistency for improved concurrency. In Principles of Distributed Systems, Lecture Notes in Computer Science, pages 395--410.
[5]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures}, SPAA '98, pages 119--129, 1998.
[6]
D. Basin. Cafe: Scalable task pools with adjustable fairness and contention. Master's thesis, Technion, 2011.
[7]
D. Basin, R. Fan, I. Keidar, O. Kiselov, and D. Perelman. Cafe: scalable task pools with adjustable fairness and contention. In Proceedings of the 25th international conference on Distributed computing}, DISC'11, pages 475--488, 2011.
[8]
S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, USENIXATC'11, 2011.
[9]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999.
[10]
A. Braginsky and E. Petrank. Locality-conscious lock-free linked lists. In Proceedings of the 12th international conference on Distributed computing and networking, ICDCN'11, pages 107--118, 2011.
[11]
D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, 2011.
[12]
A. Gidenstam, H. Sundell, and P. Tsigas. Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency. In Proceedings of the 14th international conference on Principles of distributed systems}, OPODIS'10, pages 302--317, 2010.
[13]
E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools. Technical report, Technion, 2012.
[14]
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006.
[15]
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC '02, pages 280--289, 2002.
[16]
D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '04, pages 206--215, 2004.
[17]
M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th international conference on Principles of distributed systems, OPODIS'07, pages 401--414, 2007.
[18]
E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 75--84, 2011.
[19]
L. Lamport. How to make a multiprocessor computer that correctly execute multiprocess programs. IEEE Trans. Comput., pages 690--691, 1979.
[20]
M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst., 15:491--504, June 2004.
[21]
M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, pages 267--275, 1996.
[22]
M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 253--262, 2005.
[23]
P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM, pages 89--97, 2010.
[24]
H. Sundell, A. Gidenstam, M. Papatriantafilou, and P. Tsigas. A lock-free algorithm for concurrent bags. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 335--344, 2011.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
June 2012
348 pages
ISBN:9781450312134
DOI:10.1145/2312005
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concurrent data structures
  2. multi-core

Qualifiers

  • Research-article

Conference

SPAA '12

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25
37th ACM Symposium on Parallelism in Algorithms and Architectures
July 28 - August 1, 2025
Portland , OR , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Using skip graphs for increased NUMA localityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.006167:C(31-49)Online publication date: 1-Sep-2022
  • (2018)swSpTRSVACM SIGPLAN Notices10.1145/3200691.317851353:1(338-353)Online publication date: 10-Feb-2018
  • (2018)swSpTRSVProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178513(338-353)Online publication date: 10-Feb-2018
  • (2018)Multi-role SpTRSV on Sunway Many-Core Architecture2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2018.00109(594-601)Online publication date: Jun-2018
  • (2017)A Library for Portable and Composable Data Locality Optimizations for NUMA SystemsACM Transactions on Parallel Computing10.1145/30402223:4(1-32)Online publication date: 2-Mar-2017
  • (2016)Fast non-intrusive memory reclamation for highly-concurrent data structuresACM SIGPLAN Notices10.1145/3241624.292669951:11(36-45)Online publication date: 14-Jun-2016
  • (2016)Fast non-intrusive memory reclamation for highly-concurrent data structuresProceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management10.1145/2926697.2926699(36-45)Online publication date: 14-Jun-2016
  • (2016)The computability of relaxed data structuresDistributed Computing10.1007/s00446-016-0272-029:5(395-407)Online publication date: 1-Oct-2016
  • (2015)A library for portable and composable data locality optimizations for NUMA systemsACM SIGPLAN Notices10.1145/2858788.268850950:8(227-238)Online publication date: 24-Jan-2015
  • (2015)A library for portable and composable data locality optimizations for NUMA systemsProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688509(227-238)Online publication date: 24-Jan-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media