[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Decoupling loads for nano-instruction set computers

Published: 18 June 2016 Publication History

Abstract

We propose an ISA extension that decouples the data access and register write operations in a load instruction. We describe system and hardware support for decoupled loads. Furthermore, we show how compilers can generate better static instruction schedules by hoisting a decoupled load's data access above may-alias stores and branches. We find that decoupled loads improve performance with geometric mean speedups of 8.4%.

References

[1]
D. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order peformance advantage: Is it speculation or dynamism?" in ASPLOS, 2013.
[2]
P. Chang, W. Chen, S. Mahlke, and W. Hwu, "Comparing static and dynamic code scheduling for multiple-instruction-issue processors," in MICRO, 1991.
[3]
C. Love and H. Jordan, "An investigation of static versus dynamic scheduling," in ISCA, 1990.
[4]
D. Patterson and D. Ditzel, "The case for the Reduced Instruction Set Computer," SIGARCH Computer Architecture News, 1980.
[5]
D. McFarlin and C. Zilles, "Branch Vanguard: Decomposing branch functionality into prediction and resolution instructions," in ISCA, 2015.
[6]
X. Dai, A. Zhai, W. Hsu, and P. Yew, "A general compiler framework for speculative optimizations using data speculative code motion," in CGO, 2005.
[7]
J. Lin, T. Chen, W. Hsu, P. Yew, R. Ju, T. Ngai, and S. Chan, "A compiler framework for speculative analysis and optimizations," in PLDI, 2003.
[8]
J. Dehnert, B. Grant, J. Banning, R. Johnson, and T. Kistler, "Using speculation, recovery, and adaptive retranslation to address real-life challenges," in CGO, 2003.
[9]
W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter, R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara, G. Haab, J. Holm, and D. Lavery, "The Superblock: An effective technique for VLIW and superscalar compilation," Journal of Supercomputing, 1993.
[10]
S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann, "Effective compiler support for predicted execution using the hyperblock," in MICRO, 1992.
[11]
H. Sharangpani and K. Arora, "Itanium processor microarchitecture," IEEE Micro, 2000.
[12]
T. Austin and G. Sohi, "Zero-cycle loads: Microarchitecture support for reducing load latency," in MICRO, 1995.
[13]
C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in CGO, 2004.
[14]
A. Jaleel, "Memory chracterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites," = http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf.
[15]
A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, "A large, fast instruction window for tolerating cache misses," in ISCA, 2002.
[16]
A. Cristal, O. J. Santana, M. Valero, and J. F. Martínez, "Toward kilo-instruction processors," ACM Trans. Archit. Code Optim., vol. 1, no. 4, pp. 389--417, Dec. 2004. {Online}. Available: http://doi.acm.org/10.1145/1044823.1044825
[17]
S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, "Continual flow pipelines," in ASPLOS, 2004.
[18]
A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," in HPCA, 2009.
[19]
A. Hilton and A. Roth, "Bolt: Energy-efficient out-of-order latency-tolerant execution," in HPCA, Jan 2010, pp. 1--12.
[20]
S. Nekkalapu, H. Akkary, K. Jothi, R. Retnamma, and X. Song, "A simple latency tolerant processor," in ICCD, Oct 2008, pp. 384--389.
[21]
R. Barnes, S. Ryoo, and W.-M. Hwu, ""flea-flicker" multipass pipelining: an alternative to the high-power out-of-order offense," in MICRO, 2005.
[22]
U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak, "Architectural mechanisms for explicit communication in shared memory multiprocessors," in Supercomputing, 1995. Proceedings of the IEEE/ACM SC95 Conference, 1995, pp. 62--62.
[23]
T. Mowry and A. Gupta, "Tolerating latency through software-controlled prefetching in shared-memory multiprocessors," J. Parallel Distrib. Comput., vol. 12, no. 2, pp. 87--106, Jun. 1991.
[24]
A. Klaiber and H. Levy, "An architecture for software-controlled data prefetching," in Computer Architecture, 1991. The 18th Annual International Symposium on, 1991, pp. 43--53.
[25]
T. C. Mowry, "Tolerating latency in multiprocessors through compiler-inserted prefetching," ACM Trans. Comput. Syst., vol. 16, no. 1, Feb. 1998.
[26]
M. Karlsson, F. Dahlgren, and P. Stenstrom, "A prefetching technique for irregular accesses to linked data structures," in High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, 2000, pp. 206--217.
[27]
A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based prefetching for linked data structures," SIGOPS Oper. Syst. Rev., vol. 32, no. 5, Oct. 1998.
[28]
J. Smith, "Decoupled access/execute architectures," in ACM Transactions on Computer Systems, 1984.
[29]
K. Ebcioglu and E. R. Altman, "Daisy: Dynamic compilation for 100% architectural compatibility," in ISCA, 1997.
[30]
M. Merten, A. Trick, C. George, J. Gyllenhaal, and W.-M. Hwu, "A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization," in ISCA, 1999, pp. 136--148.
[31]
S. Jee and K. Palaniappan, "Dynamically scheduling vliw instructions with dependency information," in Interaction between Compilers and Computer Architectures, 2002. Proceedings. Sixth Annual Workshop on, 2002, pp. 15--23.
[32]
R. Nair and M. Hopkins, "Exploiting instruction level parallelism in processors by caching scheduled groups," in ISCA, June 1997, pp. 13--25.
[33]
S. Patel and S. S. Lumetta, "replay: A hardware framework for dynamic optimization," Computers, IEEE Transactions on, vol. 50, no. 6, pp. 590--608, Jun 2001.
[34]
F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta, "Improving quasi-dynamic schedules through region slip," in Code Generation and Optimization, 2003. CGO 2003. International Symposium on, March 2003, pp. 149--158.
[35]
J. Fisher, "Trace scheduling: A technique for global microcode compaction," Computers, IEEE Transactions on, vol. C-30, no. 7, pp. 478--490, July 1981.
[36]
S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in MICRO, Dec 1992, pp. 45--54.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
ISCA'16
June 2016
730 pages
ISSN:0163-5964
DOI:10.1145/3007787
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
    June 2016
    756 pages
    ISBN:9781467389471
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2016
Published in SIGARCH Volume 44, Issue 3

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media