[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Mostly static program partitioning of binary executables

Published: 03 July 2009 Publication History

Abstract

We have built a runtime compilation system that takes unmodified sequential binaries and improves their performance on off-the-shelf multiprocessors using dynamic vectorization and loop-level parallelization techniques. Our system, Azure, is purely software based and requires no specific hardware support for speculative thread execution, yet it is able to break even in most cases; that is, the achieved speedup exceeds the cost of runtime monitoring and compilation, often by significant amounts.
Key to this remarkable performance is an offline preprocessing step that extracts a mostly correct control flow graph (CFG) from the binary program ahead of time. This statically obtained CFG is incomplete in that it may be missing some edges corresponding to computed branches. We describe how such additional control flow edges are discovered and handled at runtime, so that an incomplete static analysis never leads to an incorrect optimization result.
The availability of a mostly correct CFG enables us to statically partition a binary executable into single-entry multiple-exit regions and to identify potential parallelization candidates ahead of execution. Program regions that are not candidates for parallelization can thereby be excluded completely from runtime monitoring and dynamic recompilation. Azure's extremely low overhead is a direct consequence of this design.

References

[1]
]]Akkary, H. and Driscoll, M. A. 1998. A dynamic multithreading processor. In Proceedings of the 31st Annual International Symposium on Microarchitecture. ACM Press, 226--236.
[2]
]]Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12.
[3]
]]Balakrishnan, G. and Reps, T. 2004. Analyzing memory accesses in x86 executables. In Proceedings of the Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 2985, Springer Verlag, 5--23.
[4]
]]Baraz, L., Devor, T., Etzion, O., Goldenberg, S., Skaletsky, A., Wang, Y., and Zemach, Y. 2003. IA-32 Execution Layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the 36th International Symposium on Micro-architecture. IEEE, 191--201.
[5]
]]Buck, B. and Hollingsworth, J. K. 2000. An API for runtime code patching. Int. J. High Perform. Comput. Applic. 14, 4, 317--329.
[6]
]]Byrd, G. T. and Holliday, M. A. 1995. Multithreaded processor architecture. IEEE Spectrum 32, 8, 38--46.
[7]
]]Carlisle, M. C., Rogers, A., Reppy, J. H., and Hendren, L. J. 1994. Early experiences with Olden. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 1--20.
[8]
]]Chambers, C. 2002. Staged compilation. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation. ACM Press, New York, NY, 1--8.
[9]
]]Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Yadavalli, S. B., and Yates, J. 1998. FX!32: A profile-directed binary translator. IEEE Micro 18, 2, 56--64.
[10]
]]Cifuentes, C. and Gough, K. J. 1995. Decompilation of binary programs. Softw. Pract. Exper. 25, 7, 811--829.
[11]
]]Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York. 13--24.
[12]
]]de Alba, M. and Kaeli, D. 2001. Runtime predictability of loops. In Proceedings of the 4th Annual IEEE International Workshop on Workload Characterization.
[13]
]]Ebcioğlu, K. and Altman, E. R. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 26--37.
[14]
]]Ebcioğlu, K., Fritts, J., Kosonocky, S., Gschwind, M., Altman, E., Kailas, K., and Bright, T. 1998. An eight-issue tree VLIW processor for dynamic binary translation. In Proceedings of the International Conference on Computer Design.
[15]
]]Fisher, J. A. 1981. Trace scheduling: A technique for global micro-code compaction. IEEE Trans. Comput. 30, 7, 478--490.
[16]
]]Grant, B., Philipose, M., Mock, M., Chambers, C., and Eggers, S. J. 1999. An evaluation of staged run time optimizations in DyC. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 293--304.
[17]
]]Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The Stanford Hydra CMP. IEEE Micro 20, 2, 71--84.
[18]
]]Hwu, W., Mahlke, S., Chen, W., Chang, P., Warter, N., Bringmann, R., Ouellette, R., Hank, R., Kiyohara, T., Haab, G., Holm, J., and Lavery, D. 1993. The superblock: An effective technique for vliw and superscalar compilation. J. Supercomput. 7, 1--2, 229--248.
[19]
]]Kagan, M., Gochman, S., Orenstien, D., and Lin, D. 1997. MMX micro-architecture of Pentium processors with MMX technology and Pentium II micro-processors. Intel Techn. J. 8.
[20]
]]Kistler, T. and Franz, M. 2001. Continuous program optimization: Design and evaluation. IEEE Trans. Comput. 50, 6, 549--566.
[21]
]]Kistler, T. and Franz, M. 2003. Continuous program optimization: A case study. ACM Trans. Program. Lang. Syst. 25, 4, 500--548.
[22]
]]Klaiber, A. 2000. The technology behind Crusoe processors. White Paper, Transmeta Corp.
[23]
]]Krishnan, V. and Torrellas, J. 1998. Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor. In Proceedings of the International Conference on Supercomputing. 85--92.
[24]
]]Krishnan, V. and Torrellas, J. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9, 866--880.
[25]
]]Krishnan, V. S. 1998. Speculative multithreading architectures. Tech. rep. UIUCDCS-R-98-2048, UIUC.
[26]
]]Larus, J. R. and Ball, T. 1994. Rewriting executable files to measure program behavior. Softw. Pract. Exper. 24, 2, 197--218.
[27]
]]Leung, A. and George, L. 1999. Static single assignment form for machine code. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 204--214.
[28]
]]Lo, J. 1998. Exploiting thread-level parallelism on simultaneous multithreaded processors. Ph.D. thesis, University of Washington.
[29]
]]Nguyen, H. and John, L. K. 1999. Exploiting SIMD parallelism in DSP and multimedia algorithms using the altivec technology. In Proceedings of the International Conference on Supercomputing. 11--20.
[30]
]]Oberman, S., Favor, G., and Weber, F. 1999. AMD 3DNow! technology: Architecture and implementations. IEEE Micro 19, 2 (Mar./Apr.), 37--48.
[31]
]]Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The case for a single-chip multiprocessor. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). 2--11.
[32]
]]Quinones, C. G., Madriles, C., Sanchez, J., Marcuello, P., Gonzalez, A., and Tullsen, D. M. 2005. Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York. 269--279.
[33]
]]Rauchwerger, L. and Padua, D. A. 1999. The LRPD test: Speculative run time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parall. Distrib. Syst. 10, 2, 160--180.
[34]
]]Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of the International Symposium on Microarchitecture. 138--148.
[35]
]]Skoglund, J. and Felsberg, M. 2005. Fast image processing using SSE2. In Proceedings of the SSBA Symposium on Image Analysis.
[36]
]]Sohi, G. S., Breach, S. E., and Vijaykumar, T. N. 1998. Multi-scalar processors. In 25 Years of ISCA: Retrospectives and Reprints. 521--532.
[37]
]]Srivastava, A. and Eustace, A. 2004. Atom: A system for building customized program analysis tools. SIGPLAN Not. 39, 4, 528--539.
[38]
]]Thakkar, S. T. and Huff, T. 1999. The Internet Streaming SIMD Extensions. Intel Tech. J., 8.
[39]
]]Tsai, J.-Y., Huang, J., Amlo, C., Lilja, D. J., and Yew, P.-C. 1999. The superthreaded processor architecture. IEEE Trans. Comput. 48, 9, 881--902.
[40]
]]Tsai, J.-Y. and Yew, P.-C. 1996. The superthreaded architecture: Thread pipelining with run time data dependence checking and control speculation. In Proceedings of Parallel Architectures and Compilation Techniques. 35--46.
[41]
]]Tullsen, D. M., Eggers, S., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22th Annual International Symposium on Computer Architecture. 392--403.
[42]
]]Voss, M. J. and Eigenmann, R. 2000. Adapt: Automated de-coupled adaptive program transformation. In Proceedings of the International Conference on Parallel Processing. IEEE Computer Society, Washington, 163.
[43]
]]Voss, M. J. and Eigenmann, R. 2001. High-level adaptive program optimization with ADAPT. ACM SIGPLAN Not. 36, 7, 93--102.
[44]
]]Yardimci, E. 2006. Exploiting parallelism to improve the performance of sequential binary executables. Ph.D. thesis, University of California, Irvine.
[45]
]]Yardimci, E. and Franz, M. 2006. Dynamic parallelization and mapping of binary executables on hierarchical platforms. In Proceedings of the 3rd Conference on Computing Frontiers. ACM Press, New York. 127--138.
[46]
]]Zilles, C. and Sohi, G. 2001. A programmable co-processor for profiling. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture.

Cited By

View all
  • (2023)DASS: Dynamic Adaptive Sub-Target Specialization2023 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)10.1109/SBAC-PADW60351.2023.00016(36-45)Online publication date: 17-Oct-2023
  • (2020)BinRecProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387550(1-16)Online publication date: 15-Apr-2020
  • (2016)Secure Identification of Actively Executed Code on a Generic Trusted Component2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.45(419-430)Online publication date: Jun-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 31, Issue 5
June 2009
152 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/1538917
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2009
Accepted: 01 November 2008
Revised: 01 December 2007
Received: 01 August 2006
Published in TOPLAS Volume 31, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Continuous compilation and optimization
  2. binary translation
  3. dynamic parallelization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)6
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)DASS: Dynamic Adaptive Sub-Target Specialization2023 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)10.1109/SBAC-PADW60351.2023.00016(36-45)Online publication date: 17-Oct-2023
  • (2020)BinRecProceedings of the Fifteenth European Conference on Computer Systems10.1145/3342195.3387550(1-16)Online publication date: 15-Apr-2020
  • (2016)Secure Identification of Actively Executed Code on a Generic Trusted Component2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.45(419-430)Online publication date: Jun-2016
  • (2013)JIT technology with C/C++ACM Transactions on Architecture and Code Optimization10.1145/2541228.255531510:4(1-25)Online publication date: 1-Dec-2013

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media