More Web Proxy on the site http://driver.im/

research-article

Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory

Authors:

Stefanos Kaxiras,

David Klaftenegger,

Magnus Norgren,

Konstantinos SagonasAuthors Info & Claims

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Pages 3 - 14

https://doi.org/10.1145/2749246.2749250

Published: 15 June 2015 Publication History

Abstract

A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.

References

[1]

S. V. Adve and M. D. Hill. Weak ordering - a new definition. In 17th ISCA, pages 2--14, June 1990.

Digital Library

[2]

B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The midway distributed shared memory system. Tech. rep. 865207, Carnegie Mellon University, Jan. 1993.

Digital Library

[3]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In 17th PACT, Oct. 2008.

Digital Library

[4]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In 20th PACT, pages 155--166, Oct. 2011.

Digital Library

[5]

T. S. Craig. Building FIFO and priority-queuing spin locks from atomic swap. Tech. rep., Dept. of CSE, University of Washington, Seattle, 1993.

[6]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 38th ISCA, pages 93--103, June 2011.

Digital Library

[7]

D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing NUMA locks. In 17th PPoPP, pages 247--256, 2012. ACM.

Digital Library

[8]

P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In 17th PPoPP, pages 257--266, 2012. ACM.

Digital Library

[9]

M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: A new form of self-adjusting heap. Algorithmica, 1(1):111--129, 1986.

Digital Library

[10]

K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In 17th ISCA, pages 15--26, 1990.

Digital Library

[11]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In 36th ISCA, pages 184--195, June 2009.

Digital Library

[12]

D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In 22nd SPAA, pages 355--364, 2010. ACM.

Digital Library

[13]

L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In 8th SPAA, pages 277--287, June 1996.

Digital Library

[14]

S. Kaxiras and G. Keramidas. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro, 30(5):54--65, Sept. 2011.

Digital Library

[15]

S. Kaxiras and A. Ros. A new perspective for efficient virtual-cache coherence. In ISCA, pages 535--547, 2013.

Digital Library

[16]

P. J. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In USENIX, pages 115--132, Jan. 1994.

Digital Library

[17]

P. J. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed shared memory. In 19th ISCA, pages 13--21, May 1992.

Digital Library

[18]

D. Klaftenegger, K. Sagonas, and K. Winblad. Brief announcement: Queue delegation locking. In 26th SPAA, pages 70--72, 2014. ACM.

Digital Library

[19]

D. Klaftenegger, K. Sagonas, and K. Winblad. Delegation locking libraries for improved performance of multithreaded programs. In Euro-Par, LNCS, 2014.

[20]

D. Klaftenegger, K. Sagonas, and K. Winblad. Queue delegation locking, 2014. http://www.it.uu.se/research/group/languages/software/qd_lock_lib.

Digital Library

[21]

L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690--691, 1979.

Digital Library

[22]

A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In 22nd ISCA, pages 48--59, June 1995.

Digital Library

[23]

D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. A. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.

Digital Library

[24]

K. Li. IVY: A shared virtual memory system for parallel computing. In ICPP, pages 94--101, Aug. 1988.

[25]

J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In USENIX, pages 65--76, USA, 2012. USENIX Association.

Digital Library

[26]

V. Luchangco, D. Nussbaum, and N. Shavit. A hierarchical CLH queue lock. In 12th ICPP, pages 801--810, 2006. Springer-Verlag.

Digital Library

[27]

P. S. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In 8th ISPP, pages 165--171, 1994. IEEE Computer Society.

Digital Library

[28]

J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21--65, Feb. 1991.

Digital Library

[29]

MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0, September 2012. available at: http://www.mpi-forum.org/docs (January, 2015).

[30]

J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Grappa: A latency-tolerant runtime for large-scale irregular applications. Tech. Rep, Dept. of CSE, Univ. of Washington, Feb 2014.

[31]

Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In IWPDC, pages 182--204. 1999.

[32]

S. Phillips. M7: Next generation SPARC. In 26st HotChips Symp., Aug. 2014.

[33]

Z. Radović and E. Hagersten. DSZOOM - low latency software-based shared memory. Technical report, Parallel and Scientific Computing Institute, 2001.

[34]

Z. Radović and E. Hagersten. Hierarchical backoff locks for nonuniform communication architectures. In 9th HPCA, pages 241--252. IEEE Comp. Society, 2003.

Digital Library

[35]

B. Ramesh. Samhita: Virtual Shared Memory for Non-Cache-Coherent Systems. PhD thesis, Virginia Polytechnic Institute and State University, July 2013.

[36]

A. Ros, B. Cuesta, M. E. Gómez, A. Robles, and J. Duato. Temporal-aware mechanism to detect private data in chip multiprocessors. In 42nd ICPP, pages 562--571, Oct. 2013.

Digital Library

[37]

A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In 21st PACT, pages 241--252, Sept. 2012.

Digital Library

[38]

J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5--44, Mar. 1992.

Digital Library

[39]

M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In 14th ASPLOS, pages 253--264, New York, NY, USA, 2009. ACM.

Digital Library

[40]

UPC Consortium. UPC language specifications, v1.2. Tech. Rep, Lawrence Berkeley National Lab, 2005.

[41]

W.-D. Weber and A. Gupta. Analysis of cache invalidation patterns in multiprocessors. In 3th ASPLOS, pages 243--256, Apr. 1989.

Digital Library

Cited By

Suetterlein JManzano JMarquez A(2024)Synchronization for CXL Based MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695810(178-185)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695810
Sha MCai YWang SPhan LLi FTan K(2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
https://doi.org/10.1145/3654958
Calciu IImran MPuddu IKashyap SAl Maruf HMutlu OKolli A(2023)Using Local Cache Coherence for Disaggregated Memory SystemsACM SIGOPS Operating Systems Review10.1145/3606557.360656157:1(21-28)Online publication date: 28-Jun-2023
https://dl.acm.org/doi/10.1145/3606557.3606561
Show More Cited By

Index Terms

Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory

Recommendations

Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

June 2015

296 pages

ISBN:9781450335508

DOI:10.1145/2749246

General Chair:
Thilo Kielmann
VU University Amsterdam, The Netherlands
,
Program Chairs:
Dean Hildebrand
IBM Research Almaden
,
Michela Taufer
University of Delaware

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

HPDC'15

Sponsor:

University of Arizona
SIGARCH

HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing

June 15 - 19, 2015

Oregon, Portland, USA

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
585
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)11

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Suetterlein JManzano JMarquez A(2024)Synchronization for CXL Based MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695810(178-185)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695810
Sha MCai YWang SPhan LLi FTan K(2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
https://doi.org/10.1145/3654958
Calciu IImran MPuddu IKashyap SAl Maruf HMutlu OKolli A(2023)Using Local Cache Coherence for Disaggregated Memory SystemsACM SIGOPS Operating Systems Review10.1145/3606557.360656157:1(21-28)Online publication date: 28-Jun-2023
https://dl.acm.org/doi/10.1145/3606557.3606561
Ding BHan MChen R(2023)DArray: A High Performance RDMA-Based Distributed ArrayProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605608(715-724)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605608
Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049
Ewais MChow P(2023)Disaggregated Memory in the Datacenter: A SurveyIEEE Access10.1109/ACCESS.2023.325040711(20688-20712)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3250407
Hideshima TSato STaura K(2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
https://doi.org/10.2197/ipsjjip.30.464
Ziegler TBinnig CLeis VIves ZBonifati AEl Abbadi A(2022)ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMAProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526187(685-699)Online publication date: 11-Jun-2022
https://doi.org/10.1145/3514221.3526187
Zhang JYu XQi ZGuan H(2022)Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00099(974-984)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00099
Hadzilacos VHu XToueg S(2022)On atomic registers and randomized consensus in M&M systemsDistributed Computing10.1007/s00446-021-00405-735:1(81-103)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s00446-021-00405-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents