[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach

Published: 16 September 2013 Publication History

Abstract

Balancing Instruction-Level Parallelism (ILP) and register pressure during preallocation instruction scheduling is a fundamentally important problem in code generation and optimization. The problem is known to be NP-complete. Many heuristic techniques have been proposed to solve this problem. However, due to the inherently conflicting requirements of maximizing ILP and minimizing register pressure, heuristic techniques may produce poor schedules in many cases. If such cases occur in hot code, significant performance degradation may result. A few combinatorial optimization approaches have also been proposed, but none of them has been shown to solve large real-world instances within reasonable time. This article presents the first combinatorial algorithm that is efficient enough to optimally solve large instances of this problem (basic blocks with hundreds of instructions) within a few seconds per instance. The proposed algorithm uses branch-and-bound enumeration with a number of powerful pruning techniques to efficiently search the solution space. The search is based on a cost function that incorporates schedule length and register pressure. An implementation of the proposed scheduling algorithm has been integrated into the LLVM Compiler and evaluated using SPEC CPU 2006. On x86-64, with a time limit of 10ms per instruction, it optimally schedules 79% of the hot basic blocks in FP2006. Another 19% of the blocks are not optimally scheduled but are improved in cost relative to LLVM's heuristic. This improves the execution time of some benchmarks by up to 21%, with a geometric-mean improvement of 2.4% across the entire benchmark suite. With the use of precise latency information, the geometric-mean improvement is increased to 2.8%.

References

[1]
Barany, G. 2011. Register reuse scheduling. In Proceedings of the 9th Workshop on Optimizations for DSP and Embedded Systems (ODES'11).
[2]
Barany, G. and Krall, A. 2013. Optimal and heuristic global code motion for minimal spilling. In Proceedings of the International Conference on Compiler Construction.
[3]
Berson, D., Gupta, R., and Soffa, M. 1993. URSA: A unified resource allocator for registers and functional units in VLIW architectures. In Proceedings of the IFIP Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism. 243--254.
[4]
Cooper, K. and Torczon, L. 2004. Engineering a Compiler. Morgan Kaufmann, San Fransisco, CA.
[5]
Faraboschi, P., Fisher, J., and Young, C. 2001. Instruction scheduling for instruction level parallel processors. Proc. IEEE 89, 11, 1638--1659.
[6]
Fog, A. 2012. The micro-architecture of INTEL, AMD and VIA cpus. An optimization guide for assembly programmers and compiler makers. http://www.agner.org/optimize/microarchitecture.pdf.
[7]
Goodman, J. and Hsu, W. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the International Conference on Supercomputing.
[8]
Govindarajan, R., Yang, H., Amaral, J., Zhang, C., and Gao, G. 2003. Minimum register instruction sequencing to reduce register spills in out-of-order. IEEE Trans. Comput. 52, 1, 4--20.
[9]
Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide-issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA'98).
[10]
Kessler, C. 1998. Scheduling expression DAGs for minimal register need. J. Comput. Lang. 24, 1, 33--53.
[11]
Langevin, M. and Cerny, E. 1996. A recursive technique for computing lower-bound performance of schedules. ACM Trans. Des. Autom. Electron. Syst. 1, 4, 443--456.
[12]
Malik, A. 2008. Constraint programming techniques for optimal instruction scheduling. Ph.D. thesis, University of Waterloo. https://cs.uwaterloo.ca/∼vanbeek/Publications/malik.pdf.
[13]
Rim, M. and Jain, R. 1994. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 13, 4, 451--458.
[14]
Shobaki, G. and Wilken, K. 2004. Optimal superblock scheduling using enumeration. In Proceedings of the 37th International Symposium on Microarchitecture.
[15]
Shobaki, G. 2006. Optimal global instruction scheduling using enumeration. Ph.D. dissertation, Department of Computer Science, UC Davis. http://www.cs.ucdavis.edu/research/tech-reports/2006/CSE-2006-19.pdf.
[16]
Shobaki, G., Wilken, K., and Heffernan, M. 2009. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim. 5, 4.
[17]
Touati, S. 2005. Register saturation in instruction-level parallelism. Int. J. Parallel Program. 33, 4, 393--449.
[18]
Weicker, R. and Henning, J. 2007. Subroutine profiling results for the CPU2006 benchmarks. ACM SIGARCH Comput. Archit. News 35, 1, 102--111.
[19]
Winkel, S. 2007. Optimal versus heuristic global code scheduling. In Proceedings of the 40th International Symposium on Microarchitecture.

Cited By

View all
  • (2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
  • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
  • (2023)A parallel branch-and-bound algorithm with history-based domination and its application to the sequential ordering problemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.10.007172(131-143)Online publication date: Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 3
September 2013
310 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2509420
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2013
Accepted: 01 March 2013
Revised: 01 February 2013
Received: 01 February 2012
Published in TACO Volume 10, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compiler optimizations
  2. NP-complete problems
  3. branch-and-bound enumeration
  4. instruction-level parallelism (ILP)
  5. optimal instruction scheduling
  6. performance optimization
  7. register pressure reduction

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)234
  • Downloads (Last 6 weeks)42
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
  • (2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
  • (2023)A parallel branch-and-bound algorithm with history-based domination and its application to the sequential ordering problemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.10.007172(131-143)Online publication date: Feb-2023
  • (2022)Register-Pressure-Aware Instruction Scheduling Using Ant Colony OptimizationACM Transactions on Architecture and Code Optimization10.1145/350555819:2(1-23)Online publication date: 31-Jan-2022
  • (2022)Graph transformations for register-pressure-aware instruction schedulingProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517771(41-53)Online publication date: 19-Mar-2022
  • (2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
  • (2022)Formally verified superblock schedulingProceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs10.1145/3497775.3503679(40-54)Online publication date: 17-Jan-2022
  • (2019)Combinatorial Register Allocation and Instruction SchedulingACM Transactions on Programming Languages and Systems10.1145/333237341:3(1-53)Online publication date: 2-Jul-2019
  • (2019)Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction SchedulingACM Transactions on Architecture and Code Optimization10.1145/330148916:1(1-30)Online publication date: 27-Feb-2019
  • (2019)Survey on Combinatorial Register Allocation and Instruction SchedulingACM Computing Surveys10.1145/320092052:3(1-50)Online publication date: 18-Jun-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media