More Web Proxy on the site http://driver.im/

research-article

Decoupling loads for nano-instruction set computers

Authors:

Andrew D. Hilton,

Benjamin C. LeeAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 44, Issue 3

Pages 406 - 417

https://doi.org/10.1145/3007787.3001181

Published: 18 June 2016 Publication History

Abstract

We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.

References

[1]

D. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order peformance advantage: Is it speculation or dynamism?" in ASPLOS, 2013.

Digital Library

[2]

P. Chang, W. Chen, S. Mahlke, and W. Hwu, "Comparing static and dynamic code scheduling for multiple-instruction-issue processors," in MICRO, 1991.

Digital Library

[3]

C. Love and H. Jordan, "An investigation of static versus dynamic scheduling," in ISCA, 1990.

Digital Library

[4]

D. Patterson and D. Ditzel, "The case for the Reduced Instruction Set Computer," SIGARCH Computer Architecture News, 1980.

Digital Library

[5]

D. McFarlin and C. Zilles, "Branch Vanguard: Decomposing branch functionality into prediction and resolution instructions," in ISCA, 2015.

Digital Library

[6]

X. Dai, A. Zhai, W. Hsu, and P. Yew, "A general compiler framework for speculative optimizations using data speculative code motion," in CGO, 2005.

Digital Library

[7]

J. Lin, T. Chen, W. Hsu, P. Yew, R. Ju, T. Ngai, and S. Chan, "A compiler framework for speculative analysis and optimizations," in PLDI, 2003.

Digital Library

[8]

J. Dehnert, B. Grant, J. Banning, R. Johnson, and T. Kistler, "Using speculation, recovery, and adaptive retranslation to address real-life challenges," in CGO, 2003.

Digital Library

[9]

W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter, R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara, G. Haab, J. Holm, and D. Lavery, "The Superblock: An effective technique for VLIW and superscalar compilation," Journal of Supercomputing, 1993.

Digital Library

[10]

S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann, "Effective compiler support for predicted execution using the hyperblock," in MICRO, 1992.

Digital Library

[11]

H. Sharangpani and K. Arora, "Itanium processor microarchitecture," IEEE Micro, 2000.

Digital Library

[12]

T. Austin and G. Sohi, "Zero-cycle loads: Microarchitecture support for reducing load latency," in MICRO, 1995.

Digital Library

[13]

C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in CGO, 2004.

Digital Library

[14]

A. Jaleel, "Memory chracterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites," = http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf.

[15]

A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, "A large, fast instruction window for tolerating cache misses," in ISCA, 2002.

Digital Library

[16]

A. Cristal, O. J. Santana, M. Valero, and J. F. Martínez, "Toward kilo-instruction processors," ACM Trans. Archit. Code Optim., vol. 1, no. 4, pp. 389--417, Dec. 2004. {Online}. Available: http://doi.acm.org/10.1145/1044823.1044825

Digital Library

[17]

S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, "Continual flow pipelines," in ASPLOS, 2004.

Digital Library

[18]

A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," in HPCA, 2009.

[19]

A. Hilton and A. Roth, "Bolt: Energy-efficient out-of-order latency-tolerant execution," in HPCA, Jan 2010, pp. 1--12.

[20]

S. Nekkalapu, H. Akkary, K. Jothi, R. Retnamma, and X. Song, "A simple latency tolerant processor," in ICCD, Oct 2008, pp. 384--389.

[21]

R. Barnes, S. Ryoo, and W.-M. Hwu, ""flea-flicker" multipass pipelining: an alternative to the high-power out-of-order offense," in MICRO, 2005.

Digital Library

[22]

U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak, "Architectural mechanisms for explicit communication in shared memory multiprocessors," in Supercomputing, 1995. Proceedings of the IEEE/ACM SC95 Conference, 1995, pp. 62--62.

Digital Library

[23]

T. Mowry and A. Gupta, "Tolerating latency through software-controlled prefetching in shared-memory multiprocessors," J. Parallel Distrib. Comput., vol. 12, no. 2, pp. 87--106, Jun. 1991.

Digital Library

[24]

A. Klaiber and H. Levy, "An architecture for software-controlled data prefetching," in Computer Architecture, 1991. The 18th Annual International Symposium on, 1991, pp. 43--53.

Digital Library

[25]

T. C. Mowry, "Tolerating latency in multiprocessors through compiler-inserted prefetching," ACM Trans. Comput. Syst., vol. 16, no. 1, Feb. 1998.

Digital Library

[26]

M. Karlsson, F. Dahlgren, and P. Stenstrom, "A prefetching technique for irregular accesses to linked data structures," in High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, 2000, pp. 206--217.

[27]

A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based prefetching for linked data structures," SIGOPS Oper. Syst. Rev., vol. 32, no. 5, Oct. 1998.

Digital Library

[28]

J. Smith, "Decoupled access/execute architectures," in ACM Transactions on Computer Systems, 1984.

[29]

K. Ebcioglu and E. R. Altman, "Daisy: Dynamic compilation for 100% architectural compatibility," in ISCA, 1997.

Digital Library

[30]

M. Merten, A. Trick, C. George, J. Gyllenhaal, and W.-M. Hwu, "A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization," in ISCA, 1999, pp. 136--148.

Digital Library

[31]

S. Jee and K. Palaniappan, "Dynamically scheduling vliw instructions with dependency information," in Interaction between Compilers and Computer Architectures, 2002. Proceedings. Sixth Annual Workshop on, 2002, pp. 15--23.

Digital Library

[32]

R. Nair and M. Hopkins, "Exploiting instruction level parallelism in processors by caching scheduled groups," in ISCA, June 1997, pp. 13--25.

Digital Library

[33]

S. Patel and S. S. Lumetta, "replay: A hardware framework for dynamic optimization," Computers, IEEE Transactions on, vol. 50, no. 6, pp. 590--608, Jun 2001.

Digital Library

[34]

F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta, "Improving quasi-dynamic schedules through region slip," in Code Generation and Optimization, 2003. CGO 2003. International Symposium on, March 2003, pp. 149--158.

Digital Library

[35]

J. Fisher, "Trace scheduling: A technique for global microcode compaction," Computers, IEEE Transactions on, vol. C-30, no. 7, pp. 478--490, July 1981.

Digital Library

[36]

S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in MICRO, Dec 1992, pp. 45--54.

Digital Library

Decoupling loads for nano-instruction set computers
1. General and reference
  1. Cross-computing tools and techniques
2. Software and its engineering
  1. Software notations and tools

Recommendations

Decoupling loads for nano-instruction set computers
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction ...
Automatic custom instruction identification for application-specific instruction set processors

The application-specific instruction set processors (ASIPs) have received more and more attention in recent years. ASIPs make trade-offs between flexibility and performance by extending the base instruction set of a general-purpose processor with custom ...
Increasing the instruction fetch rate via block-structured instruction set architectures
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture

To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 44, Issue 3

ISCA'16

June 2016

730 pages

ISSN:0163-5964

DOI:10.1145/3007787

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
June 2016
756 pages
ISBN:9781467389471
General Chairs:
Sang Lyul Min
Seoul National University
,
Gabriel Loh
AMD Research

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2016

Published in SIGARCH Volume 44, Issue 3

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
192
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents