More Web Proxy on the site http://driver.im/

Article

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Author:

Huiyang ZhouAuthors Info & Claims

PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques

Pages 231 - 242

https://doi.org/10.1109/PACT.2005.18

Published: 17 September 2005 Publication History

Abstract

Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.

References

[1]

H. Akkary, R. Rajwar, and S. Srinivasan, "Checkpoint processing and recovery: towards scalable large instruction window processors", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.

[2]

R. Balasubramonian, S. Dwarkadas, and D. Albonesi, "Dynamically allocating processor resources between nearby and distant ILP", Proc. of the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.

[3]

R. Balasubramonian, S. Dwarkadas, and D. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors", Proc. of the 34th Int. Symp. on Microarch. (MICRO-34) , 2001.

[4]

R. Barnes, E. Nystrom, J. Sias, S. Patel, N. Navarro, and W. Hwu, "Beating in-order stalls with flea-flicker two pass pipelining", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.

[5]

E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black, "Hierarchical scheduling windows". MICRO-35 , 2002.

[6]

D. Burger and T. Austin, "The SimpleScalar tool set, v2.0", Computer Architecture News , vol. 25, June 1997.

[7]

L. Ceze, K. Strauss, J. Tuck, J. Renau, J. Torrellas. "CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction." Comp. Arch. Letters, Volume 3, Dec. 2004.

[8]

Y. Chou, B. Fahs, and S. Abraham, "Microarchitecture optimizations for exploiting memory-level parallelism", Proc. of the 31st Int. Symp. on Comp. Arch. (ISCA-31) , 2004.

[9]

G. Chrysos and J. Emer, "Memory dependence prediction using store sets", Proc. of the 25th Int. Symp. on Comp. Arch. (ISCA-25) , 1998.

[10]

J. D. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen, "Speculative precomputation: longrange prefetching of delinquent loads", ISCA-28 , 2001.

[11]

A. Cristal, D. Ortega, J. Llosa, and M. Valero, "Out-of-order commit processors", Proc. of the 10th Int. Symp. on High Performance Comp. Arch. (HPCA-10) , 2004.

[12]

A. Cristal, M. Valero, A. Gonzalez, and J. Llosa, "Large virtual ROBs by processor checkpointing", Tech. Rep. UPCDAC- 2002-39 , 2002.

[13]

J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss", ICS-97 , 1997.

[14]

K. Farkas, P. Chow, N . Jouppi, and Z. Vranesic, "Memorysystem design considerations for dynamically scheduled processors", Proc. of the 24th Int. Symp. on Comp. Arch. (ISCA-24) , 1997.

[15]

I. Ganusov and M. Burtscher, "Future execution: a hardware prefetching technique for chip multiprocessors", Int'l. Conf. on Parallel Arch, and Comp. Tech. (PACT 2005), 2005.

[16]

A. Gandhi, H. Akkary, R. Rajwar, S. Srinivasan, and K. Lai, "Scalable load and store processing in latency tolerant processors", ISCA-32 , 2005.

[17]

J. Henning, "SPEC2000: measuring CPU performance in the new millennium", IEEE Computer , July 2000.

[18]

T. Karkhanis and J. Smith, "A Day in the Life of a Cache Miss", 2nd Workshop on Memory Performance Issues , 2002.

[19]

N. Kirman, M. Kirman, M. Chaudhuri, and J. Martinez, "Checkpointed Early Load Retirement", Proc. of the 11th Int. Symp. on High Perf. Comp. Arch. (HPCA-11) , 2005.

[20]

A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, "A large, fast instruction window for tolerating cache misses", Proc. of the 29th Int. Symp. on Comp. Arch. (ISCA-29) , 2002.

[21]

M. H. Lipasti and J. P. Shen, "Exceeding the dataflow limit via value prediction," Proc. of the 29th Int. Symp. on Microarch. (MICRO-29) , 1996.

[22]

C. K. Luk, "Tolerating memory latency through soft-ware-controlled pre-execution in simultaneous multithreading processors", Proc. of the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.

[23]

O. Mutlu, H. Kim, and Y. Patt, "Techniques for efficient processing in runahead execution engines", Proc. of the 32nd Int. Symp. on Comp. Arch. (ISCA-32) , 2005.

[24]

O. Mutlu, H. Kim, J. Stark, and Y. Patt, "On reusing the results of pre-executed instructions in a runahead execution processor", Comp. Arch. Letters , Jan 2005.

[25]

O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, "Runahead execution: an alternative to very large instruction windows for out-of-order processors", Proc. of the 9th Int. Symp. on High Perf. Comp. Arch. (HPCA-9) , 2003.

[26]

I. Park, C. Ooi, and T. Vijaykumar, "Reducing design complexity of the load/store queue", MICRO-36 , 2003.

[27]

Z. Purser, K. Sundaramoorthy, and E. Rotenberg. "A Study of Slipstream Processors". MICRO-33 , 2000.

[28]

Z. Purser, K. Sundaramoorthy, and E. Rotenberg, "Slipstream memory hierarchies", Tech. Report, ECE dept., NCSU , 2002.

[29]

E. Rotenberg, Personal Communication, 2003.

[30]

A. Roth and G. Sohi, "Speculative data driven multithreading", HPCA-7 , 2001.

[31]

S. Sethumadhavan, R. Desikan, D. Burger, C. Moore, and S. Keckler, "Scalable hardware memory disambiguation for high ILP processors", Proc. of the 36th Int. Symp. on Microarch. (MICRO-36) , 2003.

[32]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically characterizing large scale program behavior", ASPLOS-X, 2002.

[33]

T. Sherwood, S. Sair, and B. Calder, "Predictor-directed stream buffers", MICRO-33 , 2000.

[34]

J. E. Smith, "Decoupled access/execute computer architectures", Proc. of the 9th Int. Symp. on Comp. Arch. (ISCA-9) , 1982.

[35]

G. Sohi, S. E. Breach, T. N. Vijaykumar, "Multiscalar processors", Proc. of the 22nd Int. Symp. on Comp.Arch.(ISCA-22) , 1995.

[36]

S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, "Continual flow pipelines", ASPLOS-11 , 2004.

[37]

K. Sundaramoorthy, Z. Purser, and E. Rotenberg, "Slipstream processors: improving both performance and fault tolerance", ASPLOS-9 , 2000.

[38]

P. H. Wang, H. Wang, J. D. Collins, E. Grochowski, R. M. Kling, and J. P. Shen, "Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation", HPCA-8 , 2002.

[39]

W. A. Wulf and S. A. McKee, "Hitting the memory wall: implications of the obvious", ACM SIGARCH Comp. Arch. News, 1995.

[40]

K. C. Yeager, "The MIPS R10000 superscalar microprocessor", IEEE Micro , 1996.

[41]

J. Zalamea, J Llosa, E. Ayguade, and M. Valero, "Twolevel hierarchical register file organization for VLIW processors", Proc. of the 33rd Int. Symp. on Microarch. (MICRO-33) , 2000.

[42]

H. Zhou and T. Conte, ""Enhancing memory level parallelism via recovery-free value prediction", Int. Conf. on Supercomputing (ICS 2003), June 2003.

[43]

C. Zilles and G. Sohi, "Execution-based prediction using speculative slices", the 28th Int. Symp. on Comp. Arch. (ISCA-28) , 2001.

[44]

C. Zilles and G. Sohi, "Master/Slave Speculative Parallelization", Proc. of the 35th Int. Symp. on Microarch. (MICRO-35) , 2002.

[45]

C. Zilles, "Master/Slave Speculative Parallelization and Approximate Code", PhD Thesis, Univ. of Wisconsin, 2002.

Cited By

Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Kondguli SHuang MBahar IHerlihy MWitchel ELebeck A(2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304052
Kondguli SHuang M(2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3170433
Show More Cited By

Index Terms

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Hardware validation

Recommendations

A hyperscalar dual-core architecture for embedded systems

This paper proposes a lightweight reconfigurable dual-core architecture for embedded systems, called hyperscalar dual-core architecture. The proposed architecture can play three different roles (a 2-issue statically scheduled superscalar processor, a ...
Enabling SIMT Execution Model on Homogeneous Multi-Core System

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques

September 2005

350 pages

ISBN:076952429X

Publisher

IEEE Computer Society

United States

Publication History

Published: 17 September 2005

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ham TAragón JMartonosi M(2019)Efficient Data Supply for Parallel Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/331033216:2(1-23)Online publication date: 26-Apr-2019
https://dl.acm.org/doi/10.1145/3310332
Kondguli SHuang MBahar IHerlihy MWitchel ELebeck A(2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304052
Kondguli SHuang M(2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3170433
Rengasamy PZhang HZhao SNachiappan NSivasubramaniam AKandemir MDas COskin MInoue K(2018)CritICs critiquing criticality in mobile appsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00075(867-880)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00075
Jayaraman PParthasarathi R(2017)A Survey on Post-Silicon Functional Validation for Multicore ArchitecturesACM Computing Surveys10.1145/310761550:4(1-30)Online publication date: 25-Aug-2017
https://dl.acm.org/doi/10.1145/3107615
Ham TAragón JMartonosi M(2017)Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/307562014:2(1-27)Online publication date: 28-Jun-2017
https://dl.acm.org/doi/10.1145/3075620
Hashemi MMutlu OPatt YHsu WYang CLipasti MLee H(2016)Continuous runaheadThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195712(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195712
Ntafam PPaire EClouard APetrot FKent KYoo S(2016)Simulation driven insertion of data prefetching instructions for early software-on-SoC optimizationProceedings of the 27th International Symposium on Rapid System Prototyping: Shortening the Path from Specification to Prototype10.1145/2990299.2990315(93-99)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1145/2990299.2990315
Atta ITong XSrinivasan VBaldini IMoshovos APrvulovic M(2015)Self-contained, accurate precomputation prefetchingProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830816(153-165)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830816
Sembrant ACarlson THagersten EBlack-Shaffer DPerais ASeznec AMichaud PPrvulovic M(2015)Long term parking (LTP)Proceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830815(334-346)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830815
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents