[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/PACT.2005.18guideproceedingsArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
Article

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Published: 17 September 2005 Publication History

Abstract

Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.

References

[1]
H. Akkary, R. Rajwar, and S. Srinivasan, "Checkpoint processing and recovery: towards scalable large instruction window processors", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.
[2]
R. Balasubramonian, S. Dwarkadas, and D. Albonesi, "Dynamically allocating processor resources between nearby and distant ILP", Proc. of the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.
[3]
R. Balasubramonian, S. Dwarkadas, and D. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors", Proc. of the 34th Int. Symp. on Microarch. (MICRO-34) , 2001.
[4]
R. Barnes, E. Nystrom, J. Sias, S. Patel, N. Navarro, and W. Hwu, "Beating in-order stalls with flea-flicker two pass pipelining", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.
[5]
E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black, "Hierarchical scheduling windows". MICRO-35 , 2002.
[6]
D. Burger and T. Austin, "The SimpleScalar tool set, v2.0", Computer Architecture News , vol. 25, June 1997.
[7]
L. Ceze, K. Strauss, J. Tuck, J. Renau, J. Torrellas. "CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction." Comp. Arch. Letters, Volume 3, Dec. 2004.
[8]
Y. Chou, B. Fahs, and S. Abraham, "Microarchitecture optimizations for exploiting memory-level parallelism", Proc. of the 31st Int. Symp. on Comp. Arch. (ISCA-31) , 2004.
[9]
G. Chrysos and J. Emer, "Memory dependence prediction using store sets", Proc. of the 25th Int. Symp. on Comp. Arch. (ISCA-25) , 1998.
[10]
J. D. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen, "Speculative precomputation: longrange prefetching of delinquent loads", ISCA-28 , 2001.
[11]
A. Cristal, D. Ortega, J. Llosa, and M. Valero, "Out-of-order commit processors", Proc. of the 10th Int. Symp. on High Performance Comp. Arch. (HPCA-10) , 2004.
[12]
A. Cristal, M. Valero, A. Gonzalez, and J. Llosa, "Large virtual ROBs by processor checkpointing", Tech. Rep. UPCDAC- 2002-39 , 2002.
[13]
J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss", ICS-97 , 1997.
[14]
K. Farkas, P. Chow, N . Jouppi, and Z. Vranesic, "Memorysystem design considerations for dynamically scheduled processors", Proc. of the 24th Int. Symp. on Comp. Arch. (ISCA-24) , 1997.
[15]
I. Ganusov and M. Burtscher, "Future execution: a hardware prefetching technique for chip multiprocessors", Int'l. Conf. on Parallel Arch, and Comp. Tech. (PACT 2005), 2005.
[16]
A. Gandhi, H. Akkary, R. Rajwar, S. Srinivasan, and K. Lai, "Scalable load and store processing in latency tolerant processors", ISCA-32 , 2005.
[17]
J. Henning, "SPEC2000: measuring CPU performance in the new millennium", IEEE Computer , July 2000.
[18]
T. Karkhanis and J. Smith, "A Day in the Life of a Cache Miss", 2nd Workshop on Memory Performance Issues , 2002.
[19]
N. Kirman, M. Kirman, M. Chaudhuri, and J. Martinez, "Checkpointed Early Load Retirement", Proc. of the 11th Int. Symp. on High Perf. Comp. Arch. (HPCA-11) , 2005.
[20]
A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, "A large, fast instruction window for tolerating cache misses", Proc. of the 29th Int. Symp. on Comp. Arch. (ISCA-29) , 2002.
[21]
M. H. Lipasti and J. P. Shen, "Exceeding the dataflow limit via value prediction," Proc. of the 29th Int. Symp. on Microarch. (MICRO-29) , 1996.
[22]
C. K. Luk, "Tolerating memory latency through soft-ware-controlled pre-execution in simultaneous multithreading processors", Proc. of the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.
[23]
O. Mutlu, H. Kim, and Y. Patt, "Techniques for efficient processing in runahead execution engines", Proc. of the 32nd Int. Symp. on Comp. Arch. (ISCA-32) , 2005.
[24]
O. Mutlu, H. Kim, J. Stark, and Y. Patt, "On reusing the results of pre-executed instructions in a runahead execution processor", Comp. Arch. Letters , Jan 2005.
[25]
O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, "Runahead execution: an alternative to very large instruction windows for out-of-order processors", Proc. of the 9th Int. Symp. on High Perf. Comp. Arch. (HPCA-9) , 2003.
[26]
I. Park, C. Ooi, and T. Vijaykumar, "Reducing design complexity of the load/store queue", MICRO-36 , 2003.
[27]
Z. Purser, K. Sundaramoorthy, and E. Rotenberg. "A Study of Slipstream Processors". MICRO-33 , 2000.
[28]
Z. Purser, K. Sundaramoorthy, and E. Rotenberg, "Slipstream memory hierarchies", Tech. Report, ECE dept., NCSU , 2002.
[29]
E. Rotenberg, Personal Communication, 2003.
[30]
A. Roth and G. Sohi, "Speculative data driven multithreading", HPCA-7 , 2001.
[31]
S. Sethumadhavan, R. Desikan, D. Burger, C. Moore, and S. Keckler, "Scalable hardware memory disambiguation for high ILP processors", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.
[32]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically characterizing large scale program behavior", ASPLOS-X, 2002.
[33]
T. Sherwood, S. Sair, and B. Calder, "Predictor-directed stream buffers", MICRO-33 , 2000.
[34]
J. E. Smith, "Decoupled access/execute computer architectures", Proc. of the 9th Int. Symp. on Comp. Arch. (ISCA-9) , 1982.
[35]
G. Sohi, S. E. Breach, T. N. Vijaykumar, "Multiscalar processors", Proc. of the 22nd Int. Symp. on Comp.Arch.(ISCA-22) , 1995.
[36]
S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, "Continual flow pipelines", ASPLOS-11 , 2004.
[37]
K. Sundaramoorthy, Z. Purser, and E. Rotenberg, "Slipstream processors: improving both performance and fault tolerance", ASPLOS-9 , 2000.
[38]
P. H. Wang, H. Wang, J. D. Collins, E. Grochowski, R. M. Kling, and J. P. Shen, "Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation", HPCA-8 , 2002.
[39]
W. A. Wulf and S. A. McKee, "Hitting the memory wall: implications of the obvious", ACM SIGARCH Comp. Arch. News, 1995.
[40]
K. C. Yeager, "The MIPS R10000 superscalar microprocessor", IEEE Micro , 1996.
[41]
J. Zalamea, J Llosa, E. Ayguade, and M. Valero, "Twolevel hierarchical register file organization for VLIW processors", Proc. of the 33rd Int. Symp. on Microarch. (MICRO-33) , 2000.
[42]
H. Zhou and T. Conte, ""Enhancing memory level parallelism via recovery-free value prediction", Int. Conf. on Supercomputing (ICS 2003), June 2003.
[43]
C. Zilles and G. Sohi, "Execution-based prediction using speculative slices", the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.
[44]
C. Zilles and G. Sohi, "Master/Slave Speculative Parallelization", Proc. of the 35th Int. Symp. on Microarch. (MICRO-35) , 2002.
[45]
C. Zilles, "Master/Slave Speculative Parallelization and Approximate Code", PhD Thesis, Univ. of Wisconsin, 2002.

Cited By

View all
  • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
  • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
  • (2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
  • Show More Cited By

Index Terms

  1. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
      September 2005
      350 pages
      ISBN:076952429X

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 17 September 2005

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
      • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
      • (2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
      • (2018)CritICs critiquing criticality in mobile appsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00075(867-880)Online publication date: 20-Oct-2018
      • (2017)A Survey on Post-Silicon Functional Validation for Multicore ArchitecturesACM Computing Surveys10.1145/310761550:4(1-30)Online publication date: 25-Aug-2017
      • (2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
      • (2016)Continuous runaheadThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195712(1-12)Online publication date: 15-Oct-2016
      • (2016)Simulation driven insertion of data prefetching instructions for early software-on-SoC optimizationProceedings of the 27th International Symposium on Rapid System Prototyping: Shortening the Path from Specification to Prototype10.1145/2990299.2990315(93-99)Online publication date: 1-Oct-2016
      • (2015)Self-contained, accurate precomputation prefetchingProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830816(153-165)Online publication date: 5-Dec-2015
      • (2015)Long term parking (LTP)Proceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830815(334-346)Online publication date: 5-Dec-2015
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media