More Web Proxy on the site http://driver.im/

research-article

Spandex: a flexible interface for efficient heterogeneous coherence

Authors:

Johnathan Alsop,

Matthew D. Sinclair,

Sarita V. AdveAuthors Info & Claims

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Pages 261 - 274

https://doi.org/10.1109/ISCA.2018.00031

Published: 02 June 2018 Publication History

Abstract

Recent heterogeneous architectures have trended toward tighter integration and shared memory largely due to the efficient communication and programmability enabled by this shift. However, such integration is complex, because accelerators have widely disparate methods for accessing and keeping data coherent. Some processors use caches backed by hardware coherence protocols like MESI, while others prefer lightweight software coherence protocols or use specialized memories like scratchpads with differing state and communication granularities. Modern solutions tend to build interfaces that extend existing MESI-style CPU coherence protocols, often by adding hierarchical indirection through intermediate shared caches. Although functionally correct, these strategies lack flexibility and generally suffer from performance limitations that make them sub-optimal for some emerging accelerators and workloads.

Instead, we need a flexible interface that can efficiently integrate existing and future devices - without requiring intrusive changes to their memory structure. We introduce Spandex, an improved coherence interface based on the simple and scalable DeNovo coherence protocol. Spandex (which takes its name from the flexible material commonly used in one-size-fits-all textiles) directly interfaces devices with diverse coherence properties and memory demands, enabling each device to communicate in a manner appropriate for its specific access properties. We demonstrate the importance of this flexibility by comparing this strategy against a more conventional MESI-based hierarchical solution for a diverse range of heterogeneous applications. On average for the applications studied, Spandex reduces execution time by 16% (max 29%) and network traffic by 27% (max 58%) relative to the MESI-based hierarchical solution.

References

[1]

B. Munger, D. Akeson, et al., "Carrizo: A high performance, energy efficient 28 nm apu," JSSC, vol. 51, no. 1, pp. 105--116, 2016.

[2]

I. Bratt, "The ARM® Mali-T880 Mobile GPU," in IEEE Hot Chips 27 Symposium, pp. 1--27, 2015.

[3]

N. Sakharnykh, "Beyond GPU Memory Limits with Unified Memory on Pascal." https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/, 2016.

[4]

J. Stuecheli, B. Blaner, et al., "CAPI: A Coherent Accelerator Processor Interface," IBM JRD, vol. 59, no. 1, pp. 7--1, 2015.

Digital Library

[5]

I. Singh, A. Shriraman, et al., "Cache Coherence for GPU Architectures," in HPCA, 2013.

Digital Library

[6]

M. D. Sinclair, J. Alsop, and S. V. Adve, "Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models," in MICRO, pp. 647--659, 2015.

Digital Library

[7]

M. D. Sinclair, J. Alsop, and S. V. Adve, "Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems," in ISCA, pp. 161--174, 2017.

Digital Library

[8]

M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in IISWC, pp. 141--151, 2012.

Digital Library

[9]

S. Che, B. Beckmann, et al., "Pannotia: Understanding Irregular GPGPU Graph Applications," in IISWC, 2013.

[10]

J. Gómez-Luna, I. El Hajj, et al., "Chai: Collaborative Heterogeneous Applications for Integrated-Architectures," in ISPASS, pp. 43--54, 2017.

[11]

J. Y. Kim and C. Batten, "Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists," in MICRO, pp. 75--87, 2014.

Digital Library

[12]

M. S. Orr, S. Che, et al., "Synchronization Using Remote-Scope Promotion," in ASPLOS, pp. 73--86, 2015.

Digital Library

[13]

M. D. Sinclair, J. Alsop, and S. V. Adve, "HeteroSync: A Benchmark Suite for Fine-Grained Synchronization on Tightly Coupled GPUs," in IISWC, 2017.

[14]

B. Choi, R. Komuravelli, et al., "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT, pp. 155--166, 2011.

Digital Library

[15]

H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: Efficient Hardware Support for Disciplined Non-determinism," in ASPLOS, 2013.

Digital Library

[16]

H. Sung and S. V. Adve, "DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations," in ASPLOS, pp. 545--559, 2015.

Digital Library

[17]

B. M. Beckmann and A. Gutierrez, "The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5," in MICRO Tutorial, 2015.

[18]

ARM, "AMBA AXI and ACE protocol specification." http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html, 2017.

[19]

D. R. Hower, B. A. Hechtman, et al., "Heterogeneous-Race-Free Memory Models," in ASPLOS, pp. 427--440, 2014.

Digital Library

[20]

J. Alsop, M. S. Orr, et al., "Lazy Release Consistency for GPUs," in MICRO, pp. 1--14, 2016.

Digital Library

[21]

D. Lustig, C. Trippel, et al., "ArMOR: Defending Against Memory Consistency Model Mismatches in Heterogeneous Architectures," in ISCA, pp. 388--400, 2015.

Digital Library

[22]

S. Adve and M. Hill, "Weak Ordering - A New Definition," in ISCA, 1990.

Digital Library

[23]

B. R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models," TACO, vol. 12, pp. 7:1--7:26, April 2015.

Digital Library

[24]

P. S. Magnusson, M. Christensson, et al., "Simics: A full system simulation platform," Computer, vol. 35, no. 2, pp. 50--58, 2002.

Digital Library

[25]

M. M. K. Martin, D. J. Sorin, et al., "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, 2005.

Digital Library

[26]

A. Bakhoda, G. L. Yuan, et al., "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, pp. 163--174, 2009.

[27]

N. Agarwal, T. Krishna, et al., "GARNET: A Detailed On-chip Network Model Inside a Full-system Simulator," in ISPASS, pp. 33--42, 2009.

[28]

T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Transactions on Mathematical Software, vol. 38, pp. 1:1--1:25, Dec. 2011.

Digital Library

[29]

"Cache Coherent Interconnect for Accelerators (CCIX)." http://www.ccixconsortium.com, 2017.

[30]

"Welcome to The Gen-Z Consortium!." http://genzconsortium.org, 2017.

[31]

"Welcome to OpenCAPI Consortium." http://www.opencapi.org, 2017.

[32]

B. Hechtman, S. Che, et al., "QuickRelease: A Throughput-Oriented Approach to Release Consistency on GPUs," in HPCA, 2014.

[33]

J. Power, A. Basu, et al., "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in MICRO, pp. 457--467, 2013.

Digital Library

[34]

S. Kumar, A. Shriraman, and N. Vedula, "Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators," in ISCA, 2015.

Digital Library

[35]

D. Negrut, R. Serban, et al., "Unified Memory in CUDA 6.0: A Brief Overview of Related Data Access and Transfer Issues," tech. rep., University of Wisconsin-Madison, 2014.

[36]

N. Agarwal, D. Nellans, et al., "Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence," in HPCA, pp. 494--506, 2016.

[37]

K. Koukos, A. Ros, et al., "Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead," TACO, vol. 13, no. 1, 2016.

Digital Library

[38]

A. R. Lebeck and D. A. Wood, "Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors," in ISCA, pp. 48--59, 1995.

Digital Library

[39]

J. Torrellas, H. Lam, and J. L. Hennessy, "False Sharing and Spatial Locality in Multiprocessor Caches," TOCS, vol. 43, no. 6, 1994.

Digital Library

[40]

H. Zhao, A. Shriraman, et al., "Protozoa: Adaptive Granularity Cache Coherence," in ISCA, pp. 547--558, 2013.

Digital Library

[41]

J. F. Cantin, J. E. Smith, et al., "Coarse-grain Coherence Tracking: RegionScout and Region Coherence Arrays," IEEE Micro, 2006.

Digital Library

[42]

J. B. Rothman and A. J. Smith, "Sector Cache Design and Performance," in ISMASCTS, pp. 124--133, 2000.

Digital Library

[43]

L. E. Olson, M. D. Hill, and D. A. Wood, "Crossing Guard: Mediating Host-Accelerator Coherence Interactions," in ASPLOS, 2017.

Digital Library

[44]

J. G. Beu, M. C. Rosier, and T. M. Conte, "Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies," in MICRO, pp. 226--236, 2011.

Digital Library

[45]

D. Lustig, G. Sethi, et al., "COATCheck: Verifying Memory Ordering at the Hardware-OS Interface," in ASPLOS, pp. 233--247, 2016.

Digital Library

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Tadepalli SWu ZPatel H(2023)PASoC: A Predictable Accelerator-rich SoCProceedings of Cyber-Physical Systems and Internet of Things Week 202310.1145/3576914.3587496(325-330)Online publication date: 9-May-2023
https://dl.acm.org/doi/10.1145/3576914.3587496
Oswald NNagarajan VSorin DGavrielatos VOlausson TCarr R(2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/MM.2023.3274993
Show More Cited By

Index Terms

Spandex: a flexible interface for efficient heterogeneous coherence
1. Computer systems organization
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Index terms have been assigned to the content through auto-classification.

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

June 2018

884 pages

ISBN:9781538659847

Publisher

IEEE Press

Publication History

Published: 02 June 2018

Check for updates

Qualifiers

Research-article

Conference

ISCA '18

ISCA '18: The 45th Annual International Symposium on Computer Architecture

June 2 - 6, 2018

California, Los Angeles

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
290
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Tadepalli SWu ZPatel H(2023)PASoC: A Predictable Accelerator-rich SoCProceedings of Cyber-Physical Systems and Internet of Things Week 202310.1145/3576914.3587496(325-330)Online publication date: 9-May-2023
https://dl.acm.org/doi/10.1145/3576914.3587496
Oswald NNagarajan VSorin DGavrielatos VOlausson TCarr R(2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/MM.2023.3274993
Loughlin KSaroiu SWolman AManerkar YKasikci BSalapura VZahran MChong FTang L(2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527427
Zuckerman JGiri DKwon JMantovani PCarloni L(2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480065
Puthoor SLipasti M(2021)Systems-on-Chip with Strong OrderingACM Transactions on Architecture and Code Optimization10.1145/342815318:1(1-27)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3428153
Balkind JLim KSchaffner MGao FChirkov GLi ALavrov ANguyen TFu YZaruba FGulati KBenini LWentzlaff DLarus JCeze LStrauss K(2020)BYOCProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378479(699-714)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378479
Barbalace AOlivier PRavindran B(2019)Rethinking Communication in Multiple-kernel OSes for New Shared Memory InterconnectsProceedings of the 10th Workshop on Programming Languages and Operating Systems10.1145/3365137.3365399(45-52)Online publication date: 27-Oct-2019
https://dl.acm.org/doi/10.1145/3365137.3365399
Huang SChang LEl Hajj IGarcia de Gonzalo SGómez-Luna JChalamalasetti SEl-Hadedy MMilojicic DMutlu OChen DHwu WApte VDi Marco ALitoiu MMerseguer J(2019)Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA ArchitecturesProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310305(79-90)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297663.3310305
Giri DMantovani PCarloni LShibuya T(2019)Runtime reconfigurable memory hierarchy in embedded scalable platformsProceedings of the 24th Asia and South Pacific Design Automation Conference10.1145/3287624.3288755(719-726)Online publication date: 21-Jan-2019
https://dl.acm.org/doi/10.1145/3287624.3288755

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents