[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Continual flow pipelines

Published: 07 October 2004 Publication History

Abstract

Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.

References

[1]
H. Akkary and M. A. Driscoll. A Dynamic Multithreading Processor. In Proceedings of the 31st International Symposium on Microarchitecture, November 1998.
[2]
H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. In Proceedings of the 36th International Symposium on Microarchitecture, December 2003.
[3]
R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Reducing the complexity of the register file in dynamic superscalar processors. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001, pp. 237--249.
[4]
R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. Dynamically allocating processor resources between nearby and distant ILP. In Proceedings of the 28th Annual International Symposium on Computer Architecture, June 2001, pp. 26--37.
[5]
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997, pp. 338--349.
[6]
R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Multithreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999.
[7]
G. Z. Chrysos and J. S. Emer. Memory dependence prediction using store sets. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998, pp. 142--153.
[8]
A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-Order Commit Processors. In Proceedings of the Tenth International Symposium on High-Performance Computer Architecture, February 2004, pp. 48--59.
[9]
A. Cristal, M. Valero, J.-L. Llosa, and A. Gonzalez. Large Virtual ROBs by Processor Checkpointing. Technical Report UPC-DAC-2002-39, Universitat Politecnica de Catalunya, July 2002.
[10]
J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. Multiple-banked register file architectures. In Proceedings of the 28th Annual International Symposium on Computer Architecture, June 200.
[11]
J. Dundas and T. Mudge. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the 1997 International Conference on Supercomputing, 1997, pp. 68--75.
[12]
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, February 2001.
[13]
T. Karkhanis and J. E. Smith. A Day in the Life of a Data Cache Miss. In Workshop on Memory Performance Issues, June 2002.
[14]
A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg. A large, fast instruction window for tolerating cache misses. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pp. 59--70.
[15]
J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. Cherry: Checkpointed Early Resource Recycling in Out-of-order Microprocessors. In Proceedings of the 35th International Symposium on Microarchitecture, November 2002.
[16]
T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, and V. Vinals. Dynamic Register Renaming Through Virtual-Physical Registers. In Journal of Instruction Level Parallelism, May 2000.
[17]
M. Moudgill, K. Pingali, and S. Vassiliadis. Register Renaming and Dynamic Speculation: an alternative Approach. In Proceedings of the 26th International Symposium on Microarchitecture, December 1993.
[18]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, February 2003.
[19]
A. Roth and G. S. Sohi. Speculative Data-Driven Multi-Threading. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, January 2001.
[20]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.
[21]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 414--425.
[22]
Y. Song and M. Dubois, Assisted Execution. University of Southern California, Technical Report #CENG 98-25, Department of EE-Systems, October 1998.
[23]
C. B. Zilles and G. S. Sohi. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, June 2001, pp. 2--13.

Cited By

View all
  • (2017)CG-OoOACM Transactions on Architecture and Code Optimization10.1145/315103414:4(1-26)Online publication date: 5-Dec-2017
  • (2022)The Forward Slice Core: A High-Performance, Yet Low-Complexity MicroarchitectureACM Transactions on Architecture and Code Optimization10.1145/349942419:2(1-25)Online publication date: 31-Jan-2022
  • (2022)Reliability-Aware Runahead2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00062(772-785)Online publication date: Apr-2022
  • Show More Cited By

Index Terms

  1. Continual flow pipelines

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGOPS Operating Systems Review
    ACM SIGOPS Operating Systems Review  Volume 38, Issue 5
    ASPLOS '04
    December 2004
    283 pages
    ISSN:0163-5980
    DOI:10.1145/1037949
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
      October 2004
      296 pages
      ISBN:1581138040
      DOI:10.1145/1024393
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 October 2004
    Published in SIGOPS Volume 38, Issue 5

    Check for updates

    Author Tags

    1. CFP
    2. instruction window
    3. latency tolerance
    4. non-blocking

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)CG-OoOACM Transactions on Architecture and Code Optimization10.1145/315103414:4(1-26)Online publication date: 5-Dec-2017
    • (2022)The Forward Slice Core: A High-Performance, Yet Low-Complexity MicroarchitectureACM Transactions on Architecture and Code Optimization10.1145/349942419:2(1-25)Online publication date: 31-Jan-2022
    • (2022)Reliability-Aware Runahead2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00062(772-785)Online publication date: Apr-2022
    • (2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
    • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
    • (2020)Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous CoreProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368496(207-216)Online publication date: 15-Jan-2020
    • (2020)Precise Runahead Execution2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00040(397-410)Online publication date: Feb-2020
    • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
    • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
    • (2019)iMODE (interactive MOod Detection Engine) Processor2019 4th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK.2019.8907005(1-6)Online publication date: Sep-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media