More Web Proxy on the site http://driver.im/

research-article

Open access

Hiding the misprediction penalty of a resource-efficient high-performance processor

Authors:

Shlomo WeissAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 4, Issue 4

Article No.: 6, Pages 1 - 32

https://doi.org/10.1145/1328195.1328201

Published: 30 January 2008 Publication History

Abstract

Misprediction is a major obstacle for increasing speculative out-of-order processors performance. Performance degradation depends on both the number of misprediction events and the recovery time associated with each one of them. In recent years a few checkpoint based microarchitectures have been proposed. In comparison with ROB-based processors, checkpoint processors are scalable and highly resource efficient. Unfortunately, in these proposals the misprediction recovery time is proportional to the instruction queue size.

In this paper we analyze methods to reduce the misprediction recovery time. We propose a new register file management scheme and techniques to selectively flush the instruction queue and the load store queue, and to isolate deeply pipelined execution units. The result is a novel checkpoint processor with Constant misprediction RollBack time (CRB). We further present a streamlined, cost-efficient solution, which saves complexity at the price of slightly lower performance.

References

[1]

Akkary, H., Rajwar, R., and Srinivasan, S. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proc. of the 36th Annual Int'l Symp. on Microarchitecture. 423--434.

Digital Library

[2]

Akkary, H., Rajwar, R., and Srinivasan, S. 2004. An analysis of a resource efficient checkpoint architecture. ACM Trans. on Architecture and Code Optimization 1, 4 (Dec.), 418--444.

Digital Library

[3]

Aragon, J., Gonzalez, J., Gonzalez, A., and Smith, J. 2002. Dual path instruction processing. In Proc. of the 16th Annual Int'l Conf. on Supercomputing. 220--229.

Digital Library

[4]

Balakrishnan, S., Rajwar, R., Upton, M., and Lai, K. 2005. The impact of performance asymmetry in emerging multicore architectures. In Proc. of the 32nd Annual Int'l Symp. on Computer Architecture. 506--517.

Digital Library

[5]

Burger, D. and Austin, T. 1997. The simplescalar tool set. SIGARCH Computer Architecture News 25, 3 (June), 13--25.

Digital Library

[6]

Canal, R. and Gonzalez, A. 2001. Reducing the complexity of the issue logic. In Proc. of the 15th Annual Int'l Conf. on Supercomputing. 312--320.

Digital Library

[7]

Ceze, L., Strauss, K., Tuck, J., Torrellas, J., and Renau, J. 2006. Cava: Using checkpoint-assisted value prediction to hide l2 misses. ACM Trans. on Architecture and Code Optimization 3, 2 (June), 182--208.

Digital Library

[8]

Chou, Y., Fung, J., and Shen, J. 1999. Reducing branch misprediction penalties via dynamic control independence detection. In Proc. of the 13th Annual Int'l Conf. on Supercomputing. 109--118.

Digital Library

[9]

Chung, J., Chafi, H., Minh, C., McDonald, A., Carlstrom, B., Kozyrakis, C., and Olukotun, K. 2006. The common case transactional behavior of multithreaded programs. In Proc. of the 12th IEEE Int'l Symp. on High-Performance Computer Architecture. 266--277.

[10]

Cristal, A., Martinez, J., and Valero, M. 2003. A case for resource-conscious out-of-order processors. IEEE Computer Architecture Letters 2.

Digital Library

[11]

Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2004. Out-of-order commit processors. In Proc. of the 9th IEEE Int'l Symp. on High-Performance Computer Architecture. 48--59.

Digital Library

[12]

Cristal, A., Santana, O., Valero, M., and Martinez, J. 2004. Toward kilo-instruction processors. ACM Trans. on Architecture and Code Optimization 1, 4 (Dec.), 389--417.

Digital Library

[13]

Davis, J., Laudon, J., and Olukotun, K. 2005. Maximizing CMP throughput with mediocre cores. In Proc. of the 14th Int'l Conf. on Parallel Architectures and Compilation Techniques. 51--62.

Digital Library

[14]

Gandhi, A., Akkary, H., Rajwar, R., Srinivasan, S., and Lai, K. 2005. Scalable load and store processing in latency tolerant processors. In Proc. of the 32nd Annual Int'l Symp. on Computer Architecture. 446--457.

Digital Library

[15]

Gandhi, A., Akkary, H., and Srinivasan, S. 2004. Reducing branch misprediction penalty via selective branch recovery. In Proc. of the 10th IEEE Int'l Symp. on High-Performance Computer Architecture. 254--264.

Digital Library

[16]

Grochowski, E., Ronen, R., Shen, J., and Wang, H. 2004. Best of both latency and throughput. In Proc. of the Int'l Conf. on Computer Design. 236--243.

Digital Library

[17]

Grunwald, D., Klauser, A., Manne, S., and Pleszkun, A. 1998. Confidence estimation for speculation control. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture. 122--131.

Digital Library

[18]

Hammond, L., Wong, V., Chen, M., Carlstrom, B., Davis, J., Hertzberg, B., Prabhu, M., Wijaya, H., Kozyrakis, C., and Olukotun, K. 2004. Transactional memory coherence and consistency. In Proc. of the 31st Annual Int'l Symp. on Computer Architecture. 102--113.

Digital Library

[19]

Heil, T. and Smith, J. 1996. Selective dual path execution. Tech. Rep. ECE, University of Wisconsin-Madison. Nov.

[20]

Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., and Roussel, P. 2001. The microarchitecture of the Pentium 4 processor. Intel Technology Journal 1 (Feb.), 1--12.

[21]

Hwu, W. and Patt, Y. 1987. Checkpoint repair for high-performance out-of-order execution machines. IEEE Transactions on Computers 36, 12 (Dec.), 1496--1514.

Digital Library

[22]

Jacobsen, E., Rotenberg, E., and Smith, J. 1996. Assigning confidence to conditional branch predictions. In Proc. of the 29th Annual Int'l Symp. on Microarchitecture. 142--152.

Digital Library

[23]

Kahle, J. 2005. The Cell processor architecture. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 3.

Digital Library

[24]

Kessler, R. 1999. The Alpha 21264 microprocessor. IEEE micro 19, 2 (Apr.), 24--36.

Digital Library

[25]

Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded Sparc processor. IEEE micro 25, 2 (Mar.), 21--29.

Digital Library

[26]

Kung, H. and Robinson, J. 1981. On optimistic methods for concurrency control. ACM Transactions on Database Systems 6, 2 (June), 213--226.

Digital Library

[27]

Lee, J. and Smith, A. 1984. Branch prediction strategies and branch target buffer design. IEEE Computer Magazine 17, 1 (Jan.), 6--22.

Digital Library

[28]

Lipasti, M. and Shen, J. 1996. Exceeding the dataflow limit via value prediction. In Proc. of the 29th Annual Int'l Symp. on Microarchitecture. 226--237.

Digital Library

[29]

Manne, S., Klauser, A., and Grunwald, D. 1998. Pipeline gating: Speculation control for energy reduction. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture. 132--141.

Digital Library

[30]

Morad, T., Weiser, U., Kolodny, A., Valero, M., and Ayguad, E. 2005. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. IEEE Computer Architecture Letters 4.

Digital Library

[31]

Moshovos, A. 2003. Checkpointing alternatives for high performance, power-aware processors. In Proc. of the 2003 Int'l Symp. on Low Power Electronics and Design. 318--321.

Digital Library

[32]

Moshovos, A. and Sohi, G. 1999. Read-after-read memory dependence prediction. In Proc. of the 32st Annual Int'l Symp. on Microarchitecture. 177--185.

Digital Library

[33]

Moudgill, M., Pingali, K., and Vassiliadis, S. 1993. Register renaming and dynamic speculation: an alternative approach. In Proc. of the 26th Annual Int'l Symp. on Microarchitecture. 202--213.

Digital Library

[34]

Mutlu, O., Kim, H., and Patt, Y. 2005. Address-value delta (AVD) prediction: Increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 233--244.

Digital Library

[35]

Mutlu, O., Kim, H., Stark, J., and Patt, Y. 2005. On reusing the results of pre-executed instructions in a runahead execution processor. IEEE Computer Architecture Letters 4.

Digital Library

[36]

Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proc. of the 9th IEEE Int'l Symp. on High-Performance Computer Architecture.

Digital Library

[37]

Palacharla, S., Jouppi, N., and Smith, J. 1997. Complexity-effective superscalar processors. In Proc. of the 24th Annual Int'l Symp. on Computer Architecture. 206--218.

Digital Library

[38]

Rajwar, R. and Goodman, J. 2001. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. of the 34th Annual Int'l Symp. on Microarchitecture.

Digital Library

[39]

Rattner, J. 2005. Multi-core to the masses. In Proc. of the 14th Int'l Conf. on Parallel Architectures and Compilation Techniques. 3.

Digital Library

[40]

Sarangi, S., W. Liu, J. T., and Zhou, Y. 2005. Reslice: Selective re-execution of long-retired misspeculated instructions using forward slicing. In Proc. of the 38th Annual Int'l Symp. on Microarchitecture. 257--270.

Digital Library

[41]

Seznec, A. and Michaud, P. 2006. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism 8.

[42]

Shen, J. and Lipasti, M. 2005. Modern Processor Design, Fundamentals of Superscalar Processors. Mcgraw-Hill.

[43]

Smith, J. and Pleszkun, A. 1988. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers 37, 5 (May), 562--573.

Digital Library

[44]

Spracklen, L. and Abraham, S. 2005. Chip multithreading: Opportunities and challenges. In Proc. of the 11th IEEE Int'l Symp. on High-Performance Computer Architecture. 248--252.

Digital Library

[45]

Srinivasan, S., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual flow pipelines. In Proc. of the 11th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems. 107--119.

Digital Library

[46]

Tarjan, D., Thoziyoor, S., and Jouppi, N. 2006. Cacti 4.0. Tech. Rep. HPL-2006-86, HP Laboratories Palo Alto. June.

[47]

Yeager, K. 1996. The MIPS R10000 superscalar microprocessor. IEEE micro 16, 2 (Apr.), 28--40.

Digital Library

[48]

Zhang, Y., Rauchwerger, L., and Torrellas, J. 1999. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In Proc. of the 5th IEEE Int'l Symp. on High-Performance Computer Architecture.

Digital Library

[49]

Zhou, P., Onder, S., and Carr, S. 2005. Fast branch misprediction recovery in out-of-order superscalar processors. In Proc. of the 19th Annual Int'l Conf. on Supercomputing. 41--50.

Digital Library

Cited By

Jin ZÖnder S(2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205300
YE JWAN YWATANABE T(2011)A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical MispredictionIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.E94.A.2639E94-A:12(2639-2648)Online publication date: 2011
https://doi.org/10.1587/transfun.E94.A.2639
Golander AWeiss S(2009)Checkpoint allocation and releaseACM Transactions on Architecture and Code Optimization10.1145/1582710.15827126:3(1-27)Online publication date: 2-Oct-2009
https://dl.acm.org/doi/10.1145/1582710.1582712
Show More Cited By

Index Terms

Hiding the misprediction penalty of a resource-efficient high-performance processor
1. Computer systems organization
  1. Architectures

Recommendations

An analysis of a resource efficient checkpoint architecture

Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly ...
Checkpoint allocation and release

Out-of-order speculative processors need a bookkeeping method to recover from incorrect speculation. In recent years, several microarchitectures that employ checkpoints have been proposed, either extending the reorder buffer or entirely replacing it. ...
Microarchitecture of the Godson-2 processor

The Godson project is the first attempt to design high performance general-purpose microprocessors in China. This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 4, Issue 4

January 2008

187 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/1328195

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2008

Accepted: 01 May 2007

Revised: 01 December 2006

Received: 01 August 2006

Published in TACO Volume 4, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
762
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)11

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jin ZÖnder S(2018)A two-phase recovery mechanismProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205300(107-117)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205300
YE JWAN YWATANABE T(2011)A New Recovery Mechanism in Superscalar Microprocessors by Recovering Critical MispredictionIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.E94.A.2639E94-A:12(2639-2648)Online publication date: 2011
https://doi.org/10.1587/transfun.E94.A.2639
Golander AWeiss S(2009)Checkpoint allocation and releaseACM Transactions on Architecture and Code Optimization10.1145/1582710.15827126:3(1-27)Online publication date: 2-Oct-2009
https://dl.acm.org/doi/10.1145/1582710.1582712
Hilton AEswaran NRoth A(2009)CPROBProceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2009.42(159-168)Online publication date: 12-Sep-2009
https://dl.acm.org/doi/10.1109/PACT.2009.42
Golander AWeiss S(2009)Reexecution and Selective Reuse in Checkpoint ProcessorsTransactions on High-Performance Embedded Architectures and Compilers II10.1007/978-3-642-00904-4_13(242-268)Online publication date: 22-Apr-2009
https://dl.acm.org/doi/10.1007/978-3-642-00904-4_13

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents