More Web Proxy on the site http://driver.im/

Article

Compiling and optimizing for decoupled architectures

Authors:

Alasdair Rawsthorne,

Muriel Mewissen,

Peter BirdAuthors Info & Claims

Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing

Pages 40 - es

https://doi.org/10.1145/224170.224301

Published: 08 December 1995 Publication History

Abstract

Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.

References

[1]

Goodman, J., Hsieh, J., Liou, K., Plezkun, A., Schecteur, P., Young, H.: PIPE: A VLSI Decoupled Architecture. Proc. 12 Int. Symp. on Computer Architecture, (June 1985).

Digital Library

[2]

Smith, J.E., et al.: The ZS-1 Central Processor. Proc. 2 Int. Conf. on Architectural Support for Programming Languages and Operating Systems, (Oct. 1987), Palo Alto, CA.

Digital Library

[3]

Wulf, Wm. A,: An Evaluation of the WM Architecture, Proc. Int. Symp. on Computer Architecture, (May 1992), Gold Coast, Australia.

Digital Library

[4]

R.P. Colwell and R.L. Steck, "A 0.6 micron BiCMOS Processor with Dynamic Execution", in Proc. IEEE Int. Solid-state Circuits Conf. 1994. See also URL http://www.intel.com/procs/p6

[5]

P. Hsu, "Design of the TFP Microprocessor", IEEE Micro, April 1994, pp.23-33. See also URL http://www.mips.com/HTMLs/R8000_B.html

Digital Library

[6]

Bird, P., Rawsthorne, A., Topham, N.P.: The Effectiveness of Decoupling. Proc. Int. Conf. on Supercomputing (July 1993), Tokyo, Japan.

Digital Library

[7]

Sites, R.L. (Ed.): Alpha Architecture Reference Manual. Digital Press, 1992.

Digital Library

[8]

The Official HTML Standard (available at URL http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html)

[9]

Cybenko, G., Kipp, L., Pointer, L., Kuck, D.: Supercomputer Performance Evaluation and the Perfect Benchmarks, Proc. Int. Conf. on Supercomputing (1990).

Digital Library

[10]

Harris, T.J., and Topham, N.P.: The Scalability of Decoupled Multiprocessors. Proc. Conf. on Scalable High Performance Computing (1994), Knoxville, TN.

[11]

Gannon, D. et al.: SIGMA II: A Tool Kit for Building Parallelizing Compilers and Performance Analysis Systems. IFIP Transactions A-11, Programming Environments for Parallel Computing. North-Holland, 1992.

Digital Library

[12]

Fisher, J.A.: VLIW architectures: Supercomputing via overlapped execution. Proc. 2nd Int. Conf. Supercomputing, Santa Barbara (May, 1987).

[13]

Harris, T.J., and Topham, N.P.: The Use of Caching in Decoupled Multiprocessors with Shared Memory, Proc. Scalable Shared Memory Workshop, at Int. Parallel Processing Symposium (1994), Cancun, Mexico.

[14]

Oed, W.: Cray Y-MP C90: System Features and Early Benchmark Results. Parallel Computing 18 (1992) 947-954.

Digital Library

[15]

Rau B.R., Glaeser C.D: Some Scheduling Techniques and an easily schedulable horizontal architecture for high performance scientific computing. Proc. 14th Ann. Microprogramming Workshop (Oct. 1981), pp. 183-197.

Digital Library

[16]

Topham, N.P. and McDougall, K.: Performance of the Decoupled ACRI-1 Architecture: the Perfect Club. Proc. High Performance Computing - Europe (1995), Milan, Italy.

Digital Library

Cited By

Crago NDamani SSankaralingam KKeckler S(2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00086
Talati NMay KBehroozi AYang YKaszyk KVasiladiotis CVerma TLi LNguyen BSun JMorton JAhmadi AAustin TO'Boyle MMahlke SMudge TDreslinski R(2021)Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00061(654-667)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00061
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Show More Cited By

Index Terms

Recommendations

Energy-efficient and high-performance instruction fetch using a block-aware ISA
ISLPED '05: Proceedings of the 2005 international symposium on Low power electronics and design

The front-end in superscalar processors must deliver high application performance in an energy-effective manner. Impediments such as multi-cycle instruction accesses, instruction-cache misses, and mispredictions reduce performance by 48% and increase ...
A complexity-effective microprocessor design with decoupled dispatch queues and prefetching

Continuing demands for high degrees of Instruction Level Parallelism (ILP) require large dispatch queues (or centralized reservation stations) in modern superscalar microprocessors. However, such large dispatch queues are inevitably accompanied by high ...
Block-aware instruction set architecture

Instruction delivery is a critical component for wide-issue, high-frequency processors since its bandwidth and accuracy place an upper limit on performance. The processor front-end accuracy and bandwidth are limited by instruction-cache misses, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing

December 1995

875 pages

ISBN:0897918169

DOI:10.1145/224170

Chairman:
Sid Karin
San Diego Supercomputer Center, San Diego, CA

Copyright © 1995 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 1995

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SC '95

Sponsor:

SIGARCH
IEEE-CS

SC '95: International Conference for High Performance Computing, Networking, Storage and Analysis

December 4 - 8, 1995

California, San Diego, USA

Acceptance Rates

Supercomputing '95 Paper Acceptance Rate 69 of 241 submissions, 29%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
187
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Crago NDamani SSankaralingam KKeckler S(2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00086
Talati NMay KBehroozi AYang YKaszyk KVasiladiotis CVerma TLi LNguyen BSun JMorton JAhmadi AAustin TO'Boyle MMahlke SMudge TDreslinski R(2021)Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00061(654-667)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00061
Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Ham TAragón JMartonosi M(2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
https://dl.acm.org/doi/10.1145/3075620
Ham TAragón JMartonosi MPrvulovic M(2015)DeSCProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830800(191-203)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830800
Crago NPatel S(2011)OUTRIDERACM SIGARCH Computer Architecture News10.1145/2024723.200007939:3(117-128)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000079
Crago NPatel SIyer RYang QGonzález A(2011)OUTRIDERProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000079(117-128)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000079
Howes LLokhmotov ADonaldson AKelly P(2008)Deriving Efficient Data Movement from Decoupled Access/Execute SpecificationsProceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers10.1007/978-3-540-92990-1_14(168-182)Online publication date: 24-Dec-2008
https://dl.acm.org/doi/10.1007/978-3-540-92990-1_14
Jones GTopham N(2005)A limitation study into access decouplingEuro-Par'97 Parallel Processing10.1007/BFb0002859(1102-1111)Online publication date: 26-Sep-2005
https://doi.org/10.1007/BFb0002859
Sung MKrashinsky RAsanović K(2001)Multithreading decoupled architectures for complexity-effective general purpose computingACM SIGARCH Computer Architecture News10.1145/563647.56365829:5(56-61)Online publication date: 1-Dec-2001
https://dl.acm.org/doi/10.1145/563647.563658
Show More Cited By

View Options

View options

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents