[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Hiding the misprediction penalty of a resource-efficient high-performance processor

Published: 30 January 2008 Publication History

Abstract

Misprediction is a major obstacle for increasing speculative out-of-order processors performance. Performance degradation depends on both the number of misprediction events and the recovery time associated with each one of them. In recent years a few checkpoint based microarchitectures have been proposed. In comparison with ROB-based processors, checkpoint processors are scalable and highly resource efficient. Unfortunately, in these proposals the misprediction recovery time is proportional to the instruction queue size.
In this paper we analyze methods to reduce the misprediction recovery time. We propose a new register file management scheme and techniques to selectively flush the instruction queue and the load store queue, and to isolate deeply pipelined execution units. The result is a novel checkpoint processor with Constant misprediction RollBack time (CRB). We further present a streamlined, cost-efficient solution, which saves complexity at the price of slightly lower performance.

References

[1]
Akkary, H., Rajwar, R., and Srinivasan, S. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proc. of the 36th Annual Int'l Symp. on Microarchitecture. 423--434.
[2]
Akkary, H., Rajwar, R., and Srinivasan, S. 2004. An analysis of a resource efficient checkpoint architecture. ACM Trans. on Architecture and Code Optimization 1, 4 (Dec.), 418--444.
[3]
Aragon, J., Gonzalez, J., Gonzalez, A., and Smith, J. 2002. Dual path instruction processing. In Proc. of the 16th Annual Int'l Conf. on Supercomputing. 220--229.
[4]
Balakrishnan, S., Rajwar, R., Upton, M., and Lai, K. 2005. The impact of performance asymmetry in emerging multicore architectures. In Proc. of the 32nd Annual Int'l Symp. on Computer Architecture. 506--517.
[5]
Burger, D. and Austin, T. 1997. The simplescalar tool set. SIGARCH Computer Architecture News 25, 3 (June), 13--25.
[6]
Canal, R. and Gonzalez, A. 2001. Reducing the complexity of the issue logic. In Proc. of the 15th Annual Int'l Conf. on Supercomputing. 312--320.
[7]
Ceze, L., Strauss, K., Tuck, J., Torrellas, J., and Renau, J. 2006. Cava: Using checkpoint-assisted value prediction to hide l2 misses. ACM Trans. on Architecture and Code Optimization 3, 2 (June), 182--208.
[8]
Chou, Y., Fung, J., and Shen, J. 1999. Reducing branch misprediction penalties via dynamic control independence detection. In Proc. of the 13th Annual Int'l Conf. on Supercomputing. 109--118.
[9]
Chung, J., Chafi, H., Minh, C., McDonald, A., Carlstrom, B., Kozyrakis, C., and Olukotun, K. 2006. The common case transactional behavior of multithreaded programs. In Proc. of the 12th IEEE Int'l Symp. on High-Performance Computer Architecture. 266--277.
[10]
Cristal, A., Martinez, J., and Valero, M. 2003. A case for resource-conscious out-of-order processors. IEEE Computer Architecture Letters 2.
[11]
Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2004. Out-of-order commit processors. In Proc. of the 9th IEEE Int'l Symp. on High-Performance Computer Architecture. 48--59.
[12]
Cristal, A., Santana, O., Valero, M., and Martinez, J. 2004. Toward kilo-instruction processors. ACM Trans. on Architecture and Code Optimization 1, 4 (Dec.), 389--417.
[13]
Davis, J., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proc. of the 14th Int'l Conf. on Parallel Architectures and Compilation Techniques. 51--62.
[14]
Gandhi, A., Akkary, H., Rajwar, R., Srinivasan, S., and Lai, K. 2005. Scalable load and store processing in latency tolerant processors. In Proc. of the 32nd Annual Int'l Symp. on Computer Architecture. 446--457.
[15]
Gandhi, A., Akkary, H., and Srinivasan, S. 2004. Reducing branch misprediction penalty via selective branch recovery. In Proc. of the 10th IEEE Int'l Symp. on High-Performance Computer Architecture. 254--264.
[16]
Grochowski, E., Ronen, R., Shen, J., and Wang, H. 2004. Best of both latency and throughput. In Proc. of the Int'l Conf. on Computer Design. 236--243.
[17]
Grunwald, D., Klauser, A., Manne, S., and Pleszkun, A. 1998. Confidence estimation for speculation control. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture. 122--131.
[18]
Hammond, L., Wong, V., Chen, M., Carlstrom, B., Davis, J., Hertzberg, B., Prabhu, M., Wijaya, H., Kozyrakis, C., and Olukotun, K. 2004. Transactional memory coherence and consistency. In Proc. of the 31st Annual Int'l Symp. on Computer Architecture. 102--113.
[19]
Heil, T. and Smith, J. 1996. Selective dual path execution. Tech. Rep. ECE, University of Wisconsin-Madison. Nov.
[20]
Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., and Roussel, P. 2001. The microarchitecture of the Pentium 4 processor. Intel Technology Journal 1 (Feb.), 1--12.
[21]
Hwu, W. and Patt, Y. 1987. Checkpoint repair for high-performance out-of-order execution machines. IEEE Transactions on Computers 36, 12 (Dec.), 1496--1514.
[22]
Jacobsen, E., Rotenberg, E., and Smith, J. 1996. Assigning confidence to conditional branch predictions. In Proc. of the 29th Annual Int'l Symp. on Microarchitecture. 142--152.
[23]
Kahle, J. 2005. The Cell processor architecture. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 3.
[24]
Kessler, R. 1999. The Alpha 21264 microprocessor. IEEE micro 19, 2 (Apr.), 24--36.
[25]
Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded Sparc processor. IEEE micro 25, 2 (Mar.), 21--29.
[26]
Kung, H. and Robinson, J. 1981. On optimistic methods for concurrency control. ACM Transactions on Database Systems 6, 2 (June), 213--226.
[27]
Lee, J. and Smith, A. 1984. Branch prediction strategies and branch target buffer design. IEEE Computer Magazine 17, 1 (Jan.), 6--22.
[28]
Lipasti, M. and Shen, J. 1996. Exceeding the dataflow limit via value prediction. In Proc. of the 29th Annual Int'l Symp. on Microarchitecture. 226--237.
[29]
Manne, S., Klauser, A., and Grunwald, D. 1998. Pipeline gating: Speculation control for energy reduction. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture. 132--141.
[30]
Morad, T., Weiser, U., Kolodny, A., Valero, M., and Ayguad, E. 2005. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters 4.
[31]
Moshovos, A. 2003. Checkpointing alternatives for high performance, power-aware processors. In Proc. of the 2003 Int'l Symp. on Low Power Electronics and Design. 318--321.
[32]
Moshovos, A. and Sohi, G. 1999. Read-after-read memory dependence prediction. In Proc. of the 32st Annual Int'l Symp. on Microarchitecture. 177--185.
[33]
Moudgill, M., Pingali, K., and Vassiliadis, S. 1993. Register renaming and dynamic speculation: an alternative approach. In Proc. of the 26th Annual Int'l Symp. on Microarchitecture. 202--213.
[34]
Mutlu, O., Kim, H., and Patt, Y. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 233--244.
[35]
Mutlu, O., Kim, H., Stark, J., and Patt, Y. 2005. On reusing the results of pre-executed instructions in a runahead execution processor. IEEE Computer Architecture Letters 4.
[36]
Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proc. of the 9th IEEE Int'l Symp. on High-Performance Computer Architecture.
[37]
Palacharla, S., Jouppi, N., and Smith, J. 1997. Complexity-effective superscalar processors. In Proc. of the 24th Annual Int'l Symp. on Computer Architecture. 206--218.
[38]
Rajwar, R. and Goodman, J. 2001. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. of the 34th Annual Int'l Symp. on Microarchitecture.
[39]
Rattner, J. 2005. Multi-core to the masses. In Proc. of the 14th Int'l Conf. on Parallel Architectures and Compilation Techniques. 3.
[40]
Sarangi, S., W. Liu, J. T., and Zhou, Y. 2005. Reslice: Selective re-execution of long-retired misspeculated instructions using forward slicing. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 257--270.
[41]
Seznec, A. and Michaud, P. 2006. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism 8.
[42]
Shen, J. and Lipasti, M. 2005. Modern Processor Design, Fundamentals of Superscalar Processors. Mcgraw-Hill.
[43]
Smith, J. and Pleszkun, A. 1988. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers 37, 5 (May), 562--573.
[44]
Spracklen, L. and Abraham, S. 2005. Chip multithreading: Opportunities and challenges. In Proc. of the 11th IEEE Int'l Symp. on High-Performance Computer Architecture. 248--252.
[45]
Srinivasan, S., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proc. of the 11th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. 107--119.
[46]
Tarjan, D., Thoziyoor, S., and Jouppi, N. 2006. Cacti 4.0. Tech. Rep. HPL-2006-86, HP Laboratories Palo Alto. June.
[47]
Yeager, K. 1996. The MIPS R10000 superscalar microprocessor. IEEE micro 16, 2 (Apr.), 28--40.
[48]
Zhang, Y., Rauchwerger, L., and Torrellas, J. 1999. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In Proc. of the 5th IEEE Int'l Symp. on High-Performance Computer Architecture.
[49]
Zhou, P., Onder, S., and Carr, S. 2005. Fast branch misprediction recovery in out-of-order superscalar processors. In Proc. of the 19th Annual Int'l Conf. on Supercomputing. 41--50.

Cited By

View all
  • (2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
  • (2011)A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical MispredictionIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.E94.A.2639E94-A:12(2639-2648)Online publication date: 2011
  • (2009)Checkpoint allocation and releaseACM Transactions on Architecture and Code Optimization10.1145/1582710.15827126:3(1-27)Online publication date: 2-Oct-2009
  • Show More Cited By

Index Terms

  1. Hiding the misprediction penalty of a resource-efficient high-performance processor

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 4, Issue 4
    January 2008
    187 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/1328195
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 January 2008
    Accepted: 01 May 2007
    Revised: 01 December 2006
    Received: 01 August 2006
    Published in TACO Volume 4, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Checkpoints
    2. misprediction
    3. out-of-order execution
    4. rollback
    5. scalable architecture

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
    • (2011)A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical MispredictionIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.E94.A.2639E94-A:12(2639-2648)Online publication date: 2011
    • (2009)Checkpoint allocation and releaseACM Transactions on Architecture and Code Optimization10.1145/1582710.15827126:3(1-27)Online publication date: 2-Oct-2009
    • (2009)CPROBProceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2009.42(159-168)Online publication date: 12-Sep-2009
    • (2009)Reexecution and Selective Reuse in Checkpoint ProcessorsTransactions on High-Performance Embedded Architectures and Compilers II10.1007/978-3-642-00904-4_13(242-268)Online publication date: 22-Apr-2009

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media