[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Efficient Data Supply for Parallel Heterogeneous Architectures

Published: 26 April 2019 Publication History

Abstract

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This article explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some microarchitectural improvements for data supply units to efficiently handle long-latency indirect loads.

References

[1]
Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO’03). http://dl.acm.org/citation.cfm?id=956417.956554
[2]
Peter Bird, Alasdair Rawsthorne, and Nigel Topham. 1993. The effectiveness of decoupling. In Proceedings of the 7th International Conference on Supercomputing (ICS’93).
[3]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11).
[4]
Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3 (2014), 23.
[5]
Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). 10.
[6]
Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zeffer, et al. 2009. Simultaneous speculative threading: A novel pipeline architecture implemented in Sun’s Rock Processor. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09).
[7]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC’09).
[8]
T. Chen and G. E. Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO’16).
[9]
Neal Clayton Crago and Sanjay Jeram Patel. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11).
[10]
Adrián Cristal, Oliverio J. Santana, Mateo Valero, and José F. Martínez. 2004. Toward kilo-instruction processors. ACM Transactions on Architecture and Code Optimization 1, 4 (2004), 389--417.
[11]
Assia Djabelkhir and Andre Seznec. 2003. Characterization of embedded applications for decoupled processor architecture. In Proceedings of the International Workshop on Workload Characterization (WWC’03).
[12]
Stijn Eyerman and Lieven Eeckhout. 2014. Restating the case for weighted-IPC metrics to evaluate multiprogram workload performance. IEEE Computer Architecture Letters 13, 2 (July 2014), 93--96.
[13]
Alok Garg and Michael C. Huang. 2008. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st Annual International Symposium on Microarchitecture (MICRO’08).
[14]
J. D. Gindele. 1977. Buffer Block Prefetching Method. IBM.
[15]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th Annual International Symposium on Microarchitecture (MICRO’15).
[16]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2017. Decoupling data supply from computation for latency-tolerant communication in heterogeneous architectures. ACM Transactions on Architecture and Code Optimization 14, 2 (June 2017), Article 16, 27 pages.
[17]
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16).
[18]
Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 12. http://dl.acm.org/citation.cfm?id=3195638.3195712
[19]
Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, et al. 2018. Learning memory access patterns. In Proceedings of the 35th International Conference on Machine Learning (ICML’18).
[20]
AMD. 2015. High-Bandwidth Memory (HBM). Retrieved March 22, 2019 from https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.
[21]
A. Hilton and A. Roth. 2010. BOLT: Energy-efficient out-of-order latency-tolerant execution. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA’10).
[22]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). 13.
[23]
Hybrid Memory Cube Consortium. 2018. Hybrid Memory Cube (HMC). Retrieved March 22, 2019 from http://hybridmemorycube.org.
[24]
Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13).
[25]
Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schaffer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to voltage-frequency scaling. In Proceedings of Annual International Symposium on Code Generation and Optimization (CGO’14). Article 262, 11 pages.
[26]
Doug Joseph and Dirk Grunwald. 1997. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97).
[27]
N. P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). 364--373.
[28]
Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).
[29]
Khubaib, M. Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N. Patt. 2012. MorphCore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. In Proceedings of the 45th Annual International Symposium on Microarchitecture (MICRO’12). 12.
[30]
Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’02).
[31]
Alvin R. Lebeck, Jinson Koppanalil, Tong Li, Jaidev Patwardhan, and Eric Rotenberg. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). 12. http://dl.acm.org/citation.cfm?id=545215.545223
[32]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual International Symposium on Microarchitecture.
[33]
Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’11).
[34]
Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abraham. 2005. Dynamic helper threaded prefetching on the Sun UltraSPARC CMP Processor. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO’05).
[35]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA’03). http://dl.acm.org/citation.cfm?id=822080.822823
[36]
Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA’04).
[37]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17).
[38]
S. Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA’94).
[39]
R. Parihar and M. C. Huang. 2017. DRUT: An efficient turbo boost solution via load balancing in decoupled look-ahead architecture. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). 91--104.
[40]
Miquel Pericas, Adrian Cristal, Francisco J. Cazorla, Ruben Gonzalez, Daniel A. Jimenez, and Mateo Valero. 2007. A flexible heterogeneous multi-core architecture. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07).
[41]
Miquel Pericas, Adrian Cristal, Ruben González, Daniel Jiménez, and Mateo Valero. 2006. A decoupled KILO-instruction processor. In Proceedings of the 12th International Symposium on High Performance Computer Architecture (HPCA’06).
[42]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04).
[43]
Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently scaling out-of-order cores for simultaneous multithreading. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 13.
[44]
James E. Smith. 1982. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA’82). 8. http://dl.acm.org/citation.cfm?id=800048.801719
[45]
James E. Smith. 1984. Decoupled access/execute computer architectures. ACM Transactions on Computer Systems 2, 4 (1984), 289--308.
[46]
Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton. 2004. Continual flow pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’04).
[47]
John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, et al. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign.
[48]
Karthik Sundaramoorthy, Zach Purser, and Eric Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).
[49]
Nigel Topham, Alasdair Rawsthorne, Callum McLean, Muriel Mewissen, and Peter Bird. 1995. Compiling and optimizing for decoupled architectures. In Proceedings of the Conference on Supercomputing (SC’95). 40.
[50]
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96). 12.
[51]
Yasuko Watanabe, John D. Davis, and David A. Wood. 2010. WiDGET: Wisconsin decoupled grid execution tiles. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 12.
[52]
Mark Weiser. 1981. Program slicing. In Proceedings of the 5th International Conference on Software Engineering (ICSE’81).
[53]
William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1 (March 1995), 20--24.
[54]
Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and adapting precomputation threads for effcient prefetching. In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA’07).
[55]
Huiyang Zhou. 2005. Dual-core execution: Building a highly scalable single-thread instruction window. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05).
[56]
Craig Zilles and Gurindar Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’56). 12.

Cited By

View all
  • (2022)An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed AcceleratorsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00083(1160-1177)Online publication date: 1-Oct-2022
  • (2020)MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00029(136-148)Online publication date: Aug-2020

Index Terms

  1. Efficient Data Supply for Parallel Heterogeneous Architectures

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
      June 2019
      317 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3325131
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 April 2019
      Accepted: 01 January 2019
      Revised: 01 December 2018
      Received: 01 September 2018
      Published in TACO Volume 16, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Heterogeneous architecture
      2. data access optimization
      3. decoupled architecture

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)161
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 26 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed AcceleratorsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00083(1160-1177)Online publication date: 1-Oct-2022
      • (2020)MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00029(136-148)Online publication date: Aug-2020

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media