[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Layout-oblivious compiler optimization for matrix computations

Published: 20 January 2013 Publication History

Abstract

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to demonstrate that complex internal organizations of data structures can severely degrade the effectiveness of compiler optimizations. We then present a data-layout-oblivious optimization methodology, where by isolating an abstract representation of the computations from complex implementation details of their data, we enable these computations to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. We evaluated our approach on an Intel 8-core platform using two source-to-source compiler infrastructures, Pluto and EPOD. Our results show that while the efficiency of a computational kernel differs when using different data layouts, the alternative implementations typically benefit from a common set of optimizations on the operations. Therefore separately optimizing the operations and the data layout of a computation could dramatically enhance the effectiveness of compiler optimizations compared with the conventional approaches of using a unified representation.

References

[1]
Agarwal, R. C., Gustavson, F. G., Joshi, M. V., and Zubair, M. 1995. A scalable parallel block algorithm for band cholesky factorization. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (PPSC'95). 430--435.
[2]
Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann.
[3]
Anderson, B. S., Wasniewski, J., and Gustavson, F. G. 2001. A recursive formulation of cholesky factorization of a matrix in packed storage. IEEE Trans. Math. Softw. 27, 214--244.
[4]
Bilmes, J., Asanovic, K., Chin, C., and Demmel, J. 1997. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In Proceedings of the International Conference on Supercomputing (ICS'97).
[5]
Bondhugula, U., Hartono, A., Ramanujan, J., and Sadayappan, P. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08).
[6]
Cheatham, T. E., Holloway, J. G. H., and Townley, H. A. 1979. Symbolic evaluation and the analysis of programs. IEEE Trans. Softw. Engin. 5, 4.
[7]
Chen, C., Chame, J., and Hall, M. W. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'05).
[8]
Clarke, L. A. 1976. A program testing system. In Proceedings of the Annual Conference.
[9]
Cui, H., Wang, L., Xue, J., Yang, Y., and Feng, X. 2011a. Automatic library generation for blas3 on gpus. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'11).
[10]
Cui, H., Xue, J., Wang, L., Yang, Y., Feng, X., and Fan, D. 2011b. Extendable pattern-oriented optimization directives. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'11).
[11]
Ding, X., Wang, K., and Zhang, X. 2011. Ulcc: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'11).
[12]
Dongarra, J., Croz, J. D., Duff, I., and Hammarling, S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1--17.
[13]
Elmroth, E., Gustavson, F. G., Jonsson, I., and Kagstrom, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 3--45.
[14]
Grant, B., Mock, M., Philipose, M., Chambers, C., and Eggers, S. J. 2000. DyC: An expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248, 1--2, 147--199.
[15]
Guo, J., Stiles, M., Yi, Q., and Psarris, K. 2011. Enhancing the role of inlining in effective interprocedural parallelization. In Proceedings of the International Conference on Parallel Processing (ICPP'11).
[16]
Gustavson, F. G., Wasniewski, J., and Dongarra, J. J. 2008. Rectangular full packed format for choleskys algorithm: Factorization, solution and inversion. Tech. rep., LAPACK working note 199.
[17]
Guyer, S. Z. and Lin, C. 2005. Broadway: A compiler for exploiting the domain-specific semantics of software libraries. Proc. IEEE 93, 2, 342--357.
[18]
Hu, Z., Cuvillo, J., Zhu, W., and Gao, G. R. 2006. Optimization of dense matrix multiplication on ibm Cyclops-64: Challenges and experiences. In Proceedings of the International European Conference on Parallel Processing (Euro-Par'06).
[19]
Mullhaupt, A. and Riedel, K. 2001. Banded matrix fraction representation of triangular input normal pairs. IEEE Trans. Autom. Control 46, 12.
[20]
Pasareanu, C. S. and Visser, W. 2009. A survey of new trends in symbolic execution for software testing and analysis. Int. J. Softw. Tools Technol. Transfer 11, 4.
[21]
Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M. A., Kaleem, R., Lee, T., Lenharth, A., Manevich, R., M'endez-Lojo, M., Prountzos, D., and Sui, X. 2011. The tao of parallelism in algorithms. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'11).
[22]
Pottenger, W. M. 1995. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Tech. rep.
[23]
Sitaraman, M., Weide, B. W., Long, T. J., and Odgen, W. F. 2000. A data abstraction alternative to data structure/algorithm modularization. In International Seminar on Generic Programming.
[24]
Volkov, V. and Demmel, J. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the Conference on Supercomputing (SC'08).
[25]
Whaley, R. C., Petitet, A., and Dongarra, J. 2001. Automated empirical optimizations of software and the atlas project. Parallel Comput. 27, 3--35.
[26]
Wolfe, M. J. 1989. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA.
[27]
Wu, P., Moreira, J. E., Midkiff, S. P., Gupta, M., and Padua, D. A. 1999. Semantic inlining: The compiler support for java in technical computing. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (PPSC'99).
[28]
Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston, MA.
[29]
Xue, J., Huang, Q., and Guo, M. 2005. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the International Conference on Parallel Processing (ICPP'05). 107--115.
[30]
Yi, Q. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'11).
[31]
Yi, Q. 2012. Poet: A scripting language for applying parameterized source-to-source program transformations. Softw. Pract. Exper. 675--706.
[32]
Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzaran, M., Padua, D. A., Pingali, K., Stodghill, P., and Wu, P. 2003. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03).

Cited By

View all
  • (2022)Symbolic Analysis for Data Plane Programs SpecializationACM Transactions on Architecture and Code Optimization10.1145/355772720:1(1-21)Online publication date: 17-Nov-2022
  • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
  • (2019)Automating the Exchangeability of Shared Data AbstractionsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_14(185-192)Online publication date: 13-Nov-2019
  • Show More Cited By

Index Terms

  1. Layout-oblivious compiler optimization for matrix computations

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 9, Issue 4
    Special Issue on High-Performance Embedded Architectures and Compilers
    January 2013
    876 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2400682
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 January 2013
    Accepted: 01 October 2012
    Revised: 01 September 2012
    Received: 01 June 2012
    Published in TACO Volume 9, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Compiler
    2. layout
    3. matrix computation
    4. optimization
    5. pattern

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)67
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Symbolic Analysis for Data Plane Programs SpecializationACM Transactions on Architecture and Code Optimization10.1145/355772720:1(1-21)Online publication date: 17-Nov-2022
    • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
    • (2019)Automating the Exchangeability of Shared Data AbstractionsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_14(185-192)Online publication date: 13-Nov-2019
    • (2017)Automatic generation of fast BLAS3-GEMM: a portable compiler approachProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049846(122-133)Online publication date: 4-Feb-2017
    • (2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
    • (2017)Transpiler-based architecture for multi-platform web applications2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM)10.1109/ETCM.2017.8247456(1-6)Online publication date: Oct-2017
    • (2017)Automatic generation of fast BLAS3-GEMM: A portable compiler approach2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863734(122-133)Online publication date: Feb-2017
    • (2014)Dynamic I/O-Aware Scheduling for Batch-Mode Applications on Chip Multiprocessor Systems of Cluster PlatformsJournal of Computer Science and Technology10.1007/s11390-013-1409-229:1(21-37)Online publication date: 10-Jan-2014
    • (2013)AUGEMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503219(1-12)Online publication date: 17-Nov-2013

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media