[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2749246.2749250acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory

Published: 15 June 2015 Publication History

Abstract

A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for Data-Race-Free programs, based on self-invalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.

References

[1]
S. V. Adve and M. D. Hill. Weak ordering - a new definition. In 17th ISCA, pages 2--14, June 1990.
[2]
B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The midway distributed shared memory system. Tech. rep. 865207, Carnegie Mellon University, Jan. 1993.
[3]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In 17th PACT, Oct. 2008.
[4]
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In 20th PACT, pages 155--166, Oct. 2011.
[5]
T. S. Craig. Building FIFO and priority-queuing spin locks from atomic swap. Tech. rep., Dept. of CSE, University of Washington, Seattle, 1993.
[6]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 38th ISCA, pages 93--103, June 2011.
[7]
D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing NUMA locks. In 17th PPoPP, pages 247--256, 2012. ACM.
[8]
P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In 17th PPoPP, pages 257--266, 2012. ACM.
[9]
M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: A new form of self-adjusting heap. Algorithmica, 1(1):111--129, 1986.
[10]
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In 17th ISCA, pages 15--26, 1990.
[11]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In 36th ISCA, pages 184--195, June 2009.
[12]
D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In 22nd SPAA, pages 355--364, 2010. ACM.
[13]
L. Iftode, J. P. Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In 8th SPAA, pages 277--287, June 1996.
[14]
S. Kaxiras and G. Keramidas. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro, 30(5):54--65, Sept. 2011.
[15]
S. Kaxiras and A. Ros. A new perspective for efficient virtual-cache coherence. In ISCA, pages 535--547, 2013.
[16]
P. J. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In USENIX, pages 115--132, Jan. 1994.
[17]
P. J. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy release consistency for software distributed shared memory. In 19th ISCA, pages 13--21, May 1992.
[18]
D. Klaftenegger, K. Sagonas, and K. Winblad. Brief announcement: Queue delegation locking. In 26th SPAA, pages 70--72, 2014. ACM.
[19]
D. Klaftenegger, K. Sagonas, and K. Winblad. Delegation locking libraries for improved performance of multithreaded programs. In Euro-Par, LNCS, 2014.
[20]
D. Klaftenegger, K. Sagonas, and K. Winblad. Queue delegation locking, 2014. http://www.it.uu.se/research/group/languages/software/qd_lock_lib.
[21]
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690--691, 1979.
[22]
A. R. Lebeck and D. A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In 22nd ISCA, pages 48--59, June 1995.
[23]
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. A. Horowitz, and M. S. Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63--79, Mar. 1992.
[24]
K. Li. IVY: A shared virtual memory system for parallel computing. In ICPP, pages 94--101, Aug. 1988.
[25]
J.-P. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In USENIX, pages 65--76, USA, 2012. USENIX Association.
[26]
V. Luchangco, D. Nussbaum, and N. Shavit. A hierarchical CLH queue lock. In 12th ICPP, pages 801--810, 2006. Springer-Verlag.
[27]
P. S. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In 8th ISPP, pages 165--171, 1994. IEEE Computer Society.
[28]
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21--65, Feb. 1991.
[29]
MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0, September 2012. available at: http://www.mpi-forum.org/docs (January, 2015).
[30]
J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Grappa: A latency-tolerant runtime for large-scale irregular applications. Tech. Rep, Dept. of CSE, Univ. of Washington, Feb 2014.
[31]
Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In IWPDC, pages 182--204. 1999.
[32]
S. Phillips. M7: Next generation SPARC. In 26st HotChips Symp., Aug. 2014.
[33]
Z. Radović and E. Hagersten. DSZOOM - low latency software-based shared memory. Technical report, Parallel and Scientific Computing Institute, 2001.
[34]
Z. Radović and E. Hagersten. Hierarchical backoff locks for nonuniform communication architectures. In 9th HPCA, pages 241--252. IEEE Comp. Society, 2003.
[35]
B. Ramesh. Samhita: Virtual Shared Memory for Non-Cache-Coherent Systems. PhD thesis, Virginia Polytechnic Institute and State University, July 2013.
[36]
A. Ros, B. Cuesta, M. E. Gómez, A. Robles, and J. Duato. Temporal-aware mechanism to detect private data in chip multiprocessors. In 42nd ICPP, pages 562--571, Oct. 2013.
[37]
A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In 21st PACT, pages 241--252, Sept. 2012.
[38]
J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5--44, Mar. 1992.
[39]
M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In 14th ASPLOS, pages 253--264, New York, NY, USA, 2009. ACM.
[40]
UPC Consortium. UPC language specifications, v1.2. Tech. Rep, Lawrence Berkeley National Lab, 2005.
[41]
W.-D. Weber and A. Gupta. Analysis of cache invalidation patterns in multiprocessors. In 3th ASPLOS, pages 243--256, Apr. 1989.

Cited By

View all
  • (2024)Synchronization for CXL Based MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695810(178-185)Online publication date: 30-Sep-2024
  • (2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
  • (2023)Using Local Cache Coherence for Disaggregated Memory SystemsACM SIGOPS Operating Systems Review10.1145/3606557.360656157:1(21-28)Online publication date: 28-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
June 2015
296 pages
ISBN:9781450335508
DOI:10.1145/2749246
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

HPDC'15
Sponsor:

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)11
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Synchronization for CXL Based MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695810(178-185)Online publication date: 30-Sep-2024
  • (2024)Object-oriented Unified Encrypted Memory Management for Heterogeneous Memory ArchitecturesProceedings of the ACM on Management of Data10.1145/36549582:3(1-29)Online publication date: 30-May-2024
  • (2023)Using Local Cache Coherence for Disaggregated Memory SystemsACM SIGOPS Operating Systems Review10.1145/3606557.360656157:1(21-28)Online publication date: 28-Jun-2023
  • (2023)DArray: A High Performance RDMA-Based Distributed ArrayProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605608(715-724)Online publication date: 7-Aug-2023
  • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
  • (2023)Disaggregated Memory in the Datacenter: A SurveyIEEE Access10.1109/ACCESS.2023.325040711(20688-20712)Online publication date: 2023
  • (2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
  • (2022)ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMAProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526187(685-699)Online publication date: 11-Jun-2022
  • (2022)Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00099(974-984)Online publication date: May-2022
  • (2022)On atomic registers and randomized consensus in M&M systemsDistributed Computing10.1007/s00446-021-00405-735:1(81-103)Online publication date: 1-Feb-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media