More Web Proxy on the site http://driver.im/

research-article

Open access

Efficient Data Supply for Parallel Heterogeneous Architectures

Authors:

Juan L. Aragón,

Margaret MartonosiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 2

Article No.: 9, Pages 1 - 23

https://doi.org/10.1145/3310332

Published: 26 April 2019 Publication History

All formats PDF

Abstract

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This article explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some microarchitectural improvements for data supply units to efficiently handle long-latency indirect loads.

References

[1]

Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO’03). http://dl.acm.org/citation.cfm?id=956417.956554

Digital Library

[2]

Peter Bird, Alasdair Rawsthorne, and Nigel Topham. 1993. The effectiveness of decoupling. In Proceedings of the 7th International Conference on Supercomputing (ICS’93).

Digital Library

[3]

Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11).

Digital Library

[4]

Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3 (2014), 23.

Digital Library

[5]

Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous subordinate microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99). 10.

Digital Library

[6]

Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zeffer, et al. 2009. Simultaneous speculative threading: A novel pipeline architecture implemented in Sun’s Rock Processor. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09).

Digital Library

[7]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC’09).

Digital Library

[8]

T. Chen and G. E. Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO’16).

Digital Library

[9]

Neal Clayton Crago and Sanjay Jeram Patel. 2011. OUTRIDER: Efficient memory latency tolerance with decoupled strands. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11).

Digital Library

[10]

Adrián Cristal, Oliverio J. Santana, Mateo Valero, and José F. Martínez. 2004. Toward kilo-instruction processors. ACM Transactions on Architecture and Code Optimization 1, 4 (2004), 389--417.

Digital Library

[11]

Assia Djabelkhir and Andre Seznec. 2003. Characterization of embedded applications for decoupled processor architecture. In Proceedings of the International Workshop on Workload Characterization (WWC’03).

[12]

Stijn Eyerman and Lieven Eeckhout. 2014. Restating the case for weighted-IPC metrics to evaluate multiprogram workload performance. IEEE Computer Architecture Letters 13, 2 (July 2014), 93--96.

Digital Library

[13]

Alok Garg and Michael C. Huang. 2008. A performance-correctness explicitly-decoupled architecture. In Proceedings of the 41st Annual International Symposium on Microarchitecture (MICRO’08).

Digital Library

[14]

J. D. Gindele. 1977. Buffer Block Prefetching Method. IBM.

[15]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th Annual International Symposium on Microarchitecture (MICRO’15).

Digital Library

[16]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2017. Decoupling data supply from computation for latency-tolerant communication in heterogeneous architectures. ACM Transactions on Architecture and Code Optimization 14, 2 (June 2017), Article 16, 27 pages.

Digital Library

[17]

Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16).

Digital Library

[18]

Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 12. http://dl.acm.org/citation.cfm?id=3195638.3195712

Digital Library

[19]

Milad Hashemi, Kevin Swersky, Jamie A. Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, et al. 2018. Learning memory access patterns. In Proceedings of the 35th International Conference on Machine Learning (ICML’18).

[20]

AMD. 2015. High-Bandwidth Memory (HBM). Retrieved March 22, 2019 from https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.

[21]

A. Hilton and A. Roth. 2010. BOLT: Energy-efficient out-of-order latency-tolerant execution. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA’10).

[22]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). 13.

Digital Library

[23]

Hybrid Memory Cube Consortium. 2018. Hybrid Memory Cube (HMC). Retrieved March 22, 2019 from http://hybridmemorycube.org.

[24]

Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13).

Digital Library

[25]

Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schaffer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to voltage-frequency scaling. In Proceedings of Annual International Symposium on Code Generation and Optimization (CGO’14). Article 262, 11 pages.

Digital Library

[26]

Doug Joseph and Dirk Grunwald. 1997. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97).

Digital Library

[27]

N. P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). 364--373.

Digital Library

[28]

Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).

Digital Library

[29]

Khubaib, M. Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N. Patt. 2012. MorphCore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. In Proceedings of the 45th Annual International Symposium on Microarchitecture (MICRO’12). 12.

Digital Library

[30]

Dongkeun Kim and Donald Yeung. 2002. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’02).

Digital Library

[31]

Alvin R. Lebeck, Jinson Koppanalil, Tong Li, Jaidev Patwardhan, and Eric Rotenberg. 2002. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). 12. http://dl.acm.org/citation.cfm?id=545215.545223

Digital Library

[32]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual International Symposium on Microarchitecture.

Digital Library

[33]

Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’11).

Digital Library

[34]

Jiwei Lu, Abhinav Das, Wei-Chung Hsu, Khoa Nguyen, and Santosh G. Abraham. 2005. Dynamic helper threaded prefetching on the Sun UltraSPARC CMP Processor. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO’05).

Digital Library

[35]

Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th Annual International Symposium on High-Performance Computer Architecture (HPCA’03). http://dl.acm.org/citation.cfm?id=822080.822823

Digital Library

[36]

Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA’04).

Digital Library

[37]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17).

Digital Library

[38]

S. Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA’94).

Digital Library

[39]

R. Parihar and M. C. Huang. 2017. DRUT: An efficient turbo boost solution via load balancing in decoupled look-ahead architecture. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). 91--104.

[40]

Miquel Pericas, Adrian Cristal, Francisco J. Cazorla, Ruben Gonzalez, Daniel A. Jimenez, and Mateo Valero. 2007. A flexible heterogeneous multi-core architecture. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07).

Digital Library

[41]

Miquel Pericas, Adrian Cristal, Ruben González, Daniel Jiménez, and Mateo Valero. 2006. A decoupled KILO-instruction processor. In Proceedings of the 12th International Symposium on High Performance Computer Architecture (HPCA’06).

[42]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04).

Digital Library

[43]

Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently scaling out-of-order cores for simultaneous multithreading. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 13.

Digital Library

[44]

James E. Smith. 1982. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA’82). 8. http://dl.acm.org/citation.cfm?id=800048.801719

Digital Library

[45]

James E. Smith. 1984. Decoupled access/execute computer architectures. ACM Transactions on Computer Systems 2, 4 (1984), 289--308.

Digital Library

[46]

Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton. 2004. Continual flow pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’04).

Digital Library

[47]

John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, et al. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign.

[48]

Karthik Sundaramoorthy, Zach Purser, and Eric Rotenburg. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).

Digital Library

[49]

Nigel Topham, Alasdair Rawsthorne, Callum McLean, Muriel Mewissen, and Peter Bird. 1995. Compiling and optimizing for decoupled architectures. In Proceedings of the Conference on Supercomputing (SC’95). 40.

Digital Library

[50]

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96). 12.

Digital Library

[51]

Yasuko Watanabe, John D. Davis, and David A. Wood. 2010. WiDGET: Wisconsin decoupled grid execution tiles. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 12.

Digital Library

[52]

Mark Weiser. 1981. Program slicing. In Proceedings of the 5th International Conference on Software Engineering (ICSE’81).

Digital Library

[53]

William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1 (March 1995), 20--24.

Digital Library

[54]

Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and adapting precomputation threads for effcient prefetching. In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA’07).

Digital Library

[55]

Huiyang Zhou. 2005. Dual-core execution: Building a highly scalable single-thread instruction window. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05).

Digital Library

[56]

Craig Zilles and Gurindar Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’56). 12.

Digital Library

Cited By

Baskaran SKandemir MSampson JHardavellas NCampanoni SGrot BKarpuzcu U(2022)An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed AcceleratorsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00083(1160-1177)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00083
Matthews OManocha AGiri DOrenes-Vera MTureci ESorensen THam TAragon JCarloni LMartonosi M(2020)MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00029(136-148)Online publication date: Aug-2020
https://doi.org/10.1109/ISPASS48437.2020.00029

Index Terms

Efficient Data Supply for Parallel Heterogeneous Architectures
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures

Recommendations

Energy-efficient and high-performance instruction fetch using a block-aware ISA
ISLPED '05: Proceedings of the 2005 international symposium on Low power electronics and design

The front-end in superscalar processors must deliver high application performance in an energy-effective manner. Impediments such as multi-cycle instruction accesses, instruction-cache misses, and mispredictions reduce performance by 48% and increase ...
Block-aware instruction set architecture

Instruction delivery is a critical component for wide-issue, high-frequency processors since its bandwidth and accuracy place an upper limit on performance. The processor front-end accuracy and bandwidth are limited by instruction-cache misses, ...
Low Overhead CS-Based Heterogeneous Framework for Big Data Acceleration
Special Issue on Autonomous Battery-Free Sensing and Communication, Special Issue on ESWEEK 2016 and Regular Papers

Big data processing on hardware gained immense interest among the hardware research community to take advantage of fast processing and reconfigurability. Though the computation latency can be reduced using hardware, big data processing cost is dominated ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 2

June 2019

317 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3325131

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2019

Accepted: 01 January 2019

Revised: 01 December 2018

Received: 01 September 2018

Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Spanish State Research Agency
Center for Future Architecture Research
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
897
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)13

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baskaran SKandemir MSampson JHardavellas NCampanoni SGrot BKarpuzcu U(2022)An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed AcceleratorsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00083(1160-1177)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00083
Matthews OManocha AGiri DOrenes-Vera MTureci ESorensen THam TAragon JCarloni LMartonosi M(2020)MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00029(136-148)Online publication date: Aug-2020
https://doi.org/10.1109/ISPASS48437.2020.00029

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents