[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ISCA.2018.00031acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Spandex: a flexible interface for efficient heterogeneous coherence

Published: 02 June 2018 Publication History

Abstract

Recent heterogeneous architectures have trended toward tighter integration and shared memory largely due to the efficient communication and programmability enabled by this shift. However, such integration is complex, because accelerators have widely disparate methods for accessing and keeping data coherent. Some processors use caches backed by hardware coherence protocols like MESI, while others prefer lightweight software coherence protocols or use specialized memories like scratchpads with differing state and communication granularities. Modern solutions tend to build interfaces that extend existing MESI-style CPU coherence protocols, often by adding hierarchical indirection through intermediate shared caches. Although functionally correct, these strategies lack flexibility and generally suffer from performance limitations that make them sub-optimal for some emerging accelerators and workloads.
Instead, we need a flexible interface that can efficiently integrate existing and future devices - without requiring intrusive changes to their memory structure. We introduce Spandex, an improved coherence interface based on the simple and scalable DeNovo coherence protocol. Spandex (which takes its name from the flexible material commonly used in one-size-fits-all textiles) directly interfaces devices with diverse coherence properties and memory demands, enabling each device to communicate in a manner appropriate for its specific access properties. We demonstrate the importance of this flexibility by comparing this strategy against a more conventional MESI-based hierarchical solution for a diverse range of heterogeneous applications. On average for the applications studied, Spandex reduces execution time by 16% (max 29%) and network traffic by 27% (max 58%) relative to the MESI-based hierarchical solution.

References

[1]
B. Munger, D. Akeson, et al., "Carrizo: A high performance, energy efficient 28 nm apu," JSSC, vol. 51, no. 1, pp. 105--116, 2016.
[2]
I. Bratt, "The ARM® Mali-T880 Mobile GPU," in IEEE Hot Chips 27 Symposium, pp. 1--27, 2015.
[3]
N. Sakharnykh, "Beyond GPU Memory Limits with Unified Memory on Pascal." https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/, 2016.
[4]
J. Stuecheli, B. Blaner, et al., "CAPI: A Coherent Accelerator Processor Interface," IBM JRD, vol. 59, no. 1, pp. 7--1, 2015.
[5]
I. Singh, A. Shriraman, et al., "Cache Coherence for GPU Architectures," in HPCA, 2013.
[6]
M. D. Sinclair, J. Alsop, and S. V. Adve, "Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models," in MICRO, pp. 647--659, 2015.
[7]
M. D. Sinclair, J. Alsop, and S. V. Adve, "Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems," in ISCA, pp. 161--174, 2017.
[8]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in IISWC, pp. 141--151, 2012.
[9]
S. Che, B. Beckmann, et al., "Pannotia: Understanding Irregular GPGPU Graph Applications," in IISWC, 2013.
[10]
J. Gómez-Luna, I. El Hajj, et al., "Chai: Collaborative Heterogeneous Applications for Integrated-Architectures," in ISPASS, pp. 43--54, 2017.
[11]
J. Y. Kim and C. Batten, "Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists," in MICRO, pp. 75--87, 2014.
[12]
M. S. Orr, S. Che, et al., "Synchronization Using Remote-Scope Promotion," in ASPLOS, pp. 73--86, 2015.
[13]
M. D. Sinclair, J. Alsop, and S. V. Adve, "HeteroSync: A Benchmark Suite for Fine-Grained Synchronization on Tightly Coupled GPUs," in IISWC, 2017.
[14]
B. Choi, R. Komuravelli, et al., "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT, pp. 155--166, 2011.
[15]
H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: Efficient Hardware Support for Disciplined Non-determinism," in ASPLOS, 2013.
[16]
H. Sung and S. V. Adve, "DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations," in ASPLOS, pp. 545--559, 2015.
[17]
B. M. Beckmann and A. Gutierrez, "The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5," in MICRO Tutorial, 2015.
[18]
ARM, "AMBA AXI and ACE protocol specification." http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html, 2017.
[19]
D. R. Hower, B. A. Hechtman, et al., "Heterogeneous-Race-Free Memory Models," in ASPLOS, pp. 427--440, 2014.
[20]
J. Alsop, M. S. Orr, et al., "Lazy Release Consistency for GPUs," in MICRO, pp. 1--14, 2016.
[21]
D. Lustig, C. Trippel, et al., "ArMOR: Defending Against Memory Consistency Model Mismatches in Heterogeneous Architectures," in ISCA, pp. 388--400, 2015.
[22]
S. Adve and M. Hill, "Weak Ordering - A New Definition," in ISCA, 1990.
[23]
B. R. Gaster, D. Hower, and L. Howes, "HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models," TACO, vol. 12, pp. 7:1--7:26, April 2015.
[24]
P. S. Magnusson, M. Christensson, et al., "Simics: A full system simulation platform," Computer, vol. 35, no. 2, pp. 50--58, 2002.
[25]
M. M. K. Martin, D. J. Sorin, et al., "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, 2005.
[26]
A. Bakhoda, G. L. Yuan, et al., "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, pp. 163--174, 2009.
[27]
N. Agarwal, T. Krishna, et al., "GARNET: A Detailed On-chip Network Model Inside a Full-system Simulator," in ISPASS, pp. 33--42, 2009.
[28]
T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Transactions on Mathematical Software, vol. 38, pp. 1:1--1:25, Dec. 2011.
[29]
"Cache Coherent Interconnect for Accelerators (CCIX)." http://www.ccixconsortium.com, 2017.
[30]
"Welcome to The Gen-Z Consortium!." http://genzconsortium.org, 2017.
[31]
"Welcome to OpenCAPI Consortium." http://www.opencapi.org, 2017.
[32]
B. Hechtman, S. Che, et al., "QuickRelease: A Throughput-Oriented Approach to Release Consistency on GPUs," in HPCA, 2014.
[33]
J. Power, A. Basu, et al., "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in MICRO, pp. 457--467, 2013.
[34]
S. Kumar, A. Shriraman, and N. Vedula, "Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators," in ISCA, 2015.
[35]
D. Negrut, R. Serban, et al., "Unified Memory in CUDA 6.0: A Brief Overview of Related Data Access and Transfer Issues," tech. rep., University of Wisconsin-Madison, 2014.
[36]
N. Agarwal, D. Nellans, et al., "Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence," in HPCA, pp. 494--506, 2016.
[37]
K. Koukos, A. Ros, et al., "Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead," TACO, vol. 13, no. 1, 2016.
[38]
A. R. Lebeck and D. A. Wood, "Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors," in ISCA, pp. 48--59, 1995.
[39]
J. Torrellas, H. Lam, and J. L. Hennessy, "False Sharing and Spatial Locality in Multiprocessor Caches," TOCS, vol. 43, no. 6, 1994.
[40]
H. Zhao, A. Shriraman, et al., "Protozoa: Adaptive Granularity Cache Coherence," in ISCA, pp. 547--558, 2013.
[41]
J. F. Cantin, J. E. Smith, et al., "Coarse-grain Coherence Tracking: RegionScout and Region Coherence Arrays," IEEE Micro, 2006.
[42]
J. B. Rothman and A. J. Smith, "Sector Cache Design and Performance," in ISMASCTS, pp. 124--133, 2000.
[43]
L. E. Olson, M. D. Hill, and D. A. Wood, "Crossing Guard: Mediating Host-Accelerator Coherence Interactions," in ASPLOS, 2017.
[44]
J. G. Beu, M. C. Rosier, and T. M. Conte, "Manager-Client Pairing: A Framework for Implementing Coherence Hierarchies," in MICRO, pp. 226--236, 2011.
[45]
D. Lustig, G. Sethi, et al., "COATCheck: Verifying Memory Ordering at the Hardware-OS Interface," in ASPLOS, pp. 233--247, 2016.

Cited By

View all
  • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
  • (2023)PASoC: A Predictable Accelerator-rich SoCProceedings of Cyber-Physical Systems and Internet of Things Week 202310.1145/3576914.3587496(325-330)Online publication date: 9-May-2023
  • (2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
  • Show More Cited By

Index Terms

  1. Spandex: a flexible interface for efficient heterogeneous coherence
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture
      June 2018
      884 pages
      ISBN:9781538659847

      Publisher

      IEEE Press

      Publication History

      Published: 02 June 2018

      Check for updates

      Qualifiers

      • Research-article

      Conference

      ISCA '18

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 30 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
      • (2023)PASoC: A Predictable Accelerator-rich SoCProceedings of Cyber-Physical Systems and Internet of Things Week 202310.1145/3576914.3587496(325-330)Online publication date: 9-May-2023
      • (2023)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence ProtocolsIEEE Micro10.1109/MM.2023.327499343:4(62-70)Online publication date: 1-Jul-2023
      • (2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
      • (2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
      • (2021)Systems-on-Chip with Strong OrderingACM Transactions on Architecture and Code Optimization10.1145/342815318:1(1-27)Online publication date: 20-Jan-2021
      • (2020)BYOCProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378479(699-714)Online publication date: 9-Mar-2020
      • (2019)Rethinking Communication in Multiple-kernel OSes for New Shared Memory InterconnectsProceedings of the 10th Workshop on Programming Languages and Operating Systems10.1145/3365137.3365399(45-52)Online publication date: 27-Oct-2019
      • (2019)Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA ArchitecturesProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering10.1145/3297663.3310305(79-90)Online publication date: 4-Apr-2019
      • (2019)Runtime reconfigurable memory hierarchy in embedded scalable platformsProceedings of the 24th Asia and South Pacific Design Automation Conference10.1145/3287624.3288755(719-726)Online publication date: 21-Jan-2019

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media