[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

Published: 13 June 2015 Publication History

Abstract

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.
We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.

References

[1]
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, ser. POPL '83. New York, NY, USA: ACM, 1983, pp. 177--189. {Online}. Available: http://doi.acm.org/10.1145/567067.567085
[2]
D. I. August, D. A. Connors, J. C. Gyllenhaal, and W.-m. W. Hwu, "Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results," in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, ser. HPCA '97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 84--. {Online}. Available: http://dl.acm.org/citation.cfm?id=548716.822702
[3]
E. Brunvand, "The nsr processor," in System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, vol. i, Jan 1993, pp. 428--435 vol.1.
[4]
H. W. Cain and P. Nagpurkar, "Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor," in ISPASS, 2010, pp. 203--212.
[5]
M. Charney, "Intel software development emulator." {Online}. Available: https://software.intel.com/en-us/articles/pintool
[6]
R. P. Colwell, R. P. Nix, J. J. O. Donnell, D. B. Papworth, and P. K. Rodman, "A vliw architecture for a trace scheduling compiler," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1987, pp. 180--192.
[7]
B. Dally, ""project denver"processor to usher in a new era of computing," Jan. 2011. {Online}. Available: http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computing
[8]
J. W. Davidson and D. B. Whalley, "Reducing the cost of branches by using registers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90. New York, NY, USA: ACM, 1990, pp. 182--191. {Online}. Available: http://doi.acm.org/10.1145/325164.325138
[9]
J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges," in Proceedings of the International Symposium on Code Generation and Optimization, 2003, pp. 15--24.
[10]
J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in Proceedings of the 11th International Conference on Supercomputing, ser. ICS '97. New York, NY, USA: ACM, 1997, pp. 68--75. {Online}. Available: http://doi.acm.org/10.1145/263580.263597
[11]
J. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan, "Superscalar instruction execution in the 21164 alpha microprocessor," Micro, IEEE, vol. 15, no. 2, pp. 33--43, Apr 1995.
[12]
M. Farrens and A. Pleszhun, "Implementation of the pipe processor," Computer, vol. 24, no. 1, pp. 65--70, Jan 1991.
[13]
B. A. Fields, S. Rubin, and R. Bodik, "Focusing processor policies via Critical-Path prediction," in Proceedings of the 28th Annual International Symposium on Computer Architecture, Jul. 2001, pp. 74--85. {Online}. Available: http://www.cs.wisc.edu/~bodik/research/isca01a.pdf
[14]
J. A. Fisher, "Trace scheduling: a technique for global microcode compaction," vol. 30(7), pp. 478--490, 1981.
[15]
J. Fritts and W. Wolf, "Evaluation of static and dynamic scheduling for media processors," in Proceedings of the 2nd Workshop on Media Processors and DSPs, ser. Micro '00, 2000.
[16]
J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," SIGARCH Comput. Archit. News, vol. 13, no. 3, pp. 20--27, Jun. 1985. {Online}. Available: http://doi.acm.org/10.1145/327070.327117
[17]
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic processing in cell's multicore architecture," IEEE Micro, vol. 26, no. 2, pp. 10--24, Mar. 2006. {Online}. Available: http://dx.doi.org/10.1109/MM.2006.41
[18]
J. Hennessy, N. Jouppi, F. Baskett, T. Gross, and J. Gill, "Hardware/software tradeoffs for increased performance," in Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS I. New York, NY, USA: ACM, 1982, pp. 2--11. {Online}. Available: http://doi.acm.org/10.1145/800050.801820
[19]
A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," IEEE Micro, vol. 30, no. 1, pp. 12--19, Jan. 2010. {Online}. Available: http://dx.doi.org/10.1109/MM.2010.20
[20]
P. Y. T. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceedings of the 13th Annual International Symposium on Computer Architecture, ser. ISCA '86. Los Alamitos, CA, USA: IEEE Computer Society Press, 1986, pp. 386--395. {Online}. Available: http://dl.acm.org/citation.cfm?id=17407.17401
[21]
W. M. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, "The Superblock: An Effective Technique for VLIW and Superscalar Compilation," Journal of Supercomputing, vol. 7, no. 1, pp. 229--248, Mar 1993. {Online}. Available: http://www.crhc.uiuc.edu/IMPACT/ftp/journal/jsc.superblock.93.pdf
[22]
Intel, "Intel itanium processor 9500 series refence manual. software development and optimization guide," Intel Technical Manual, 2012.
[23]
A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites." {Online}. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdf
[24]
V. Kathail, M. Schlansker, and B. Rau, "HPL PlayDoh architecture specification: Version 1.0," Hewlett-Packard Laboratories, Tech. Rep. HPL-93-80, Feb. 1993.
[25]
H. Kim, J. Joao, O. Mutlu, and Y. N. Patt, "Profile-assisted compiler support for dynamic predication in diverge-merge processors," in Proceedings of the International Symposium on Code Generation and Optimization, ser. CGO '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 367--378. {Online}. Available: http://dx.doi.org/10.1109/CGO.2007.31
[26]
H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, "Diverge-merge processor: Generalized and energy-efficient dynamic predication," IEEE Micro, vol. 27, no. 1, pp. 94--104, Jan. 2007. {Online}. Available: http://dx.doi.org/10.1109/MM.2007.9
[27]
H. Kim, O. Mutlu, J. Stark, and Y. Patt, "Wish branches: combining conditional branching and predication for adaptive predicated execution," in Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, Nov 2005, pp. 12 pp.--54.
[28]
A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, Oct 1998, pp. 278--285.
[29]
S. Mahlke and B. Natarajan, "Compiler synthesized dynamic branch prediction," in Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on, Dec 1996, pp. 153--164.
[30]
S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in In Proceedings of the 25th International Symposium on Microarchitecture, 1992, pp. 45--54.
[31]
D. S. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism?" in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '13. New York, NY, USA: ACM, 2013, pp. 241--252. {Online}. Available: http://doi.acm.org/10.1145/2451116.2451143
[32]
C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 44--55, Mar. 2003. {Online}. Available: http://dx.doi.org/10.1109/MM.2003.1196114
[33]
A. S. Nadkarni and A. Tyagi, "A trace based evaluation of speculative branch decoupling," in Computer Design, 2000. Proceedings. 2000 International Conference on. IEEE, 2000, pp. 300--307.
[34]
N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles, "Hardware atomicity for reliable software speculation," in Proceedings of the 34th International Symposium on Computer Architecture, 2007, pp. 174--185.
[35]
A. Seznec, "A new case for the tage branch predictor," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp. 117--127. {Online}. Available: http://doi.acm.org/10.1145/2155620.2155635
[36]
R. Sheikh, J. Tuck, and E. Rotenberg, "Control-flow decoupling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 329--340. {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.38
[37]
G. Shobaki, K. Wilken, and M. Heffernan, "Optimal trace scheduling using enumeration," ACM Trans. Archit. Code Optim., vol. 5, no. 4, pp. 19:1--19:32, Mar. 2009. {Online}. Available: http://doi.acm.org/10.1145/1498690.1498694
[38]
M. Smotherman, "Documentation project for the IBM ACS-1 Supercomputer," Jun. 2010. {Online}. Available: http://www.cs.clemson.edu/~mark/acs.html
[39]
A. Srivastava and A. Despain, "Prophetic branches: a branch architecture for code compaction and efficient execution," in Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on, Dec 1993, pp. 94--99.
[40]
A. Tyagi, H.-C. Ng, and P. Mohapatra, "Dynamic branch decoupled architecture," in Computer Design, 1999.(ICCD'99) International Conference on. IEEE, 1999, pp. 442--450.
[41]
W. J. Watson, "The ti asc: A highly modular and flexible super computer architecture," in Proceedings of the December 5-7, 1972, Fall Joint Computer Conference, Part I, ser. AFIPS '72 (Fall, part I). New York, NY, USA: ACM, 1972, pp. 221--228. {Online}. Available: http://doi.acm.org/10.1145/1479992.1480022
[42]
C. Young and M. D. Smith, "Improving the accuracy of static branch prediction using branch correlation," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 232--241. {Online}. Available: http://doi.acm.org/10.1145/195473.195549
[43]
H. C. Young, "Code scheduling methods for some architectural features in pipe," Microprocessing and Microprogramming, vol. 22, no. 1, pp. 39--63, 1988. {Online}. Available: http://www.sciencedirect.com/science/article/pii/0165607488900063
[44]
M. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International Symposium on, April 2007, pp. 23--34.

Cited By

View all
  • (2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
  • (2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
  • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
    June 2015
    768 pages
    ISBN:9781450334020
    DOI:10.1145/2749469
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015
Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
  • (2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
  • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
  • (2018)Architectural support for probabilistic branchesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00018(108-120)Online publication date: 20-Oct-2018
  • (2016)Decoupling loads for nano-instruction set computersACM SIGARCH Computer Architecture News10.1145/3007787.300118144:3(406-417)Online publication date: 18-Jun-2016
  • (2016)PowerChopACM SIGARCH Computer Architecture News10.1145/3007787.300115244:3(140-152)Online publication date: 18-Jun-2016
  • (2016)Decoupling loads for nano-instruction set computersProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.43(406-417)Online publication date: 18-Jun-2016
  • (2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
  • (2015)A Graph-Based Program Representation for Analyzing Hardware Specialization ApproachesIEEE Computer Architecture Letters10.1109/LCA.2015.247680114:2(94-98)Online publication date: 1-Jul-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media