More Web Proxy on the site http://driver.im/

research-article

Open access

Layout-oblivious compiler optimization for matrix computations

Authors:

Xiaobing FengAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 9, Issue 4

Article No.: 35, Pages 1 - 20

https://doi.org/10.1145/2400682.2400694

Published: 20 January 2013 Publication History

Abstract

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this article, we use a number of widely used matrix computations from the LINPACK library to demonstrate that complex internal organizations of data structures can severely degrade the effectiveness of compiler optimizations. We then present a data-layout-oblivious optimization methodology, where by isolating an abstract representation of the computations from complex implementation details of their data, we enable these computations to be much more accurately analyzed and optimized through varying state-of-the-art compiler technologies. We evaluated our approach on an Intel 8-core platform using two source-to-source compiler infrastructures, Pluto and EPOD. Our results show that while the efficiency of a computational kernel differs when using different data layouts, the alternative implementations typically benefit from a common set of optimizations on the operations. Therefore separately optimizing the operations and the data layout of a computation could dramatically enhance the effectiveness of compiler optimizations compared with the conventional approaches of using a unified representation.

References

[1]

Agarwal, R. C., Gustavson, F. G., Joshi, M. V., and Zubair, M. 1995. A scalable parallel block algorithm for band cholesky factorization. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (PPSC'95). 430--435.

[2]

Allen, R. and Kennedy, K. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann.

Digital Library

[3]

Anderson, B. S., Wasniewski, J., and Gustavson, F. G. 2001. A recursive formulation of cholesky factorization of a matrix in packed storage. IEEE Trans. Math. Softw. 27, 214--244.

Digital Library

[4]

Bilmes, J., Asanovic, K., Chin, C., and Demmel, J. 1997. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In Proceedings of the International Conference on Supercomputing (ICS'97).

Digital Library

[5]

Bondhugula, U., Hartono, A., Ramanujan, J., and Sadayappan, P. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08).

Digital Library

[6]

Cheatham, T. E., Holloway, J. G. H., and Townley, H. A. 1979. Symbolic evaluation and the analysis of programs. IEEE Trans. Softw. Engin. 5, 4.

Digital Library

[7]

Chen, C., Chame, J., and Hall, M. W. 2005. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'05).

Digital Library

[8]

Clarke, L. A. 1976. A program testing system. In Proceedings of the Annual Conference.

Digital Library

[9]

Cui, H., Wang, L., Xue, J., Yang, Y., and Feng, X. 2011a. Automatic library generation for blas3 on gpus. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'11).

Digital Library

[10]

Cui, H., Xue, J., Wang, L., Yang, Y., Feng, X., and Fan, D. 2011b. Extendable pattern-oriented optimization directives. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'11).

Digital Library

[11]

Ding, X., Wang, K., and Zhang, X. 2011. Ulcc: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'11).

Digital Library

[12]

Dongarra, J., Croz, J. D., Duff, I., and Hammarling, S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1--17.

Digital Library

[13]

Elmroth, E., Gustavson, F. G., Jonsson, I., and Kagstrom, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 3--45.

[14]

Grant, B., Mock, M., Philipose, M., Chambers, C., and Eggers, S. J. 2000. DyC: An expressive annotation-directed dynamic compiler for c. Theor. Comput. Sci. 248, 1--2, 147--199.

Digital Library

[15]

Guo, J., Stiles, M., Yi, Q., and Psarris, K. 2011. Enhancing the role of inlining in effective interprocedural parallelization. In Proceedings of the International Conference on Parallel Processing (ICPP'11).

Digital Library

[16]

Gustavson, F. G., Wasniewski, J., and Dongarra, J. J. 2008. Rectangular full packed format for choleskys algorithm: Factorization, solution and inversion. Tech. rep., LAPACK working note 199.

[17]

Guyer, S. Z. and Lin, C. 2005. Broadway: A compiler for exploiting the domain-specific semantics of software libraries. Proc. IEEE 93, 2, 342--357.

[18]

Hu, Z., Cuvillo, J., Zhu, W., and Gao, G. R. 2006. Optimization of dense matrix multiplication on ibm Cyclops-64: Challenges and experiences. In Proceedings of the International European Conference on Parallel Processing (Euro-Par'06).

Digital Library

[19]

Mullhaupt, A. and Riedel, K. 2001. Banded matrix fraction representation of triangular input normal pairs. IEEE Trans. Autom. Control 46, 12.

[20]

Pasareanu, C. S. and Visser, W. 2009. A survey of new trends in symbolic execution for software testing and analysis. Int. J. Softw. Tools Technol. Transfer 11, 4.

Digital Library

[21]

Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M. A., Kaleem, R., Lee, T., Lenharth, A., Manevich, R., M'endez-Lojo, M., Prountzos, D., and Sui, X. 2011. The tao of parallelism in algorithms. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'11).

Digital Library

[22]

Pottenger, W. M. 1995. Induction variable substitution and reduction recognition in the Polaris parallelizing compiler. Tech. rep.

[23]

Sitaraman, M., Weide, B. W., Long, T. J., and Odgen, W. F. 2000. A data abstraction alternative to data structure/algorithm modularization. In International Seminar on Generic Programming.

Digital Library

[24]

Volkov, V. and Demmel, J. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the Conference on Supercomputing (SC'08).

Digital Library

[25]

Whaley, R. C., Petitet, A., and Dongarra, J. 2001. Automated empirical optimizations of software and the atlas project. Parallel Comput. 27, 3--35.

Digital Library

[26]

Wolfe, M. J. 1989. Optimizing Supercompilers for Supercomputers. The MIT Press, Cambridge, MA.

Digital Library

[27]

Wu, P., Moreira, J. E., Midkiff, S. P., Gupta, M., and Padua, D. A. 1999. Semantic inlining: The compiler support for java in technical computing. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (PPSC'99).

[28]

Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston, MA.

Digital Library

[29]

Xue, J., Huang, Q., and Guo, M. 2005. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the International Conference on Parallel Processing (ICPP'05). 107--115.

Digital Library

[30]

Yi, Q. 2011. Automated programmable control and parameterization of compiler optimizations. In Proceedings of the Annual IEEE/ACM Symposium on Code Generation and Optimization (CGO'11).

Digital Library

[31]

Yi, Q. 2012. Poet: A scripting language for applying parameterized source-to-source program transformations. Softw. Pract. Exper. 675--706.

Digital Library

[32]

Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzaran, M., Padua, D. A., Pingali, K., Stodghill, P., and Wu, P. 2003. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'03).

Digital Library

Cited By

Luinaud TLanglois JSavaria Y(2022)Symbolic Analysis for Data Plane Programs SpecializationACM Transactions on Architecture and Code Optimization10.1145/355772720:1(1-21)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3557727
Liu YHuang LWu MCui HLv FFeng XXue JAmaral JKulkarni M(2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307350
Zhang JWang QYi QCui H(2019)Automating the Exchangeability of Shared Data AbstractionsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_14(185-192)Online publication date: 13-Nov-2019
https://doi.org/10.1007/978-3-030-34627-0_14
Show More Cited By

Index Terms

Layout-oblivious compiler optimization for matrix computations
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

gpucc: an open-source GPGPU compiler
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Graphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has ...
Layout-oblivious optimization for matrix computations
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Most scientific computations serve to apply mathematical operations to a set of preconceived data structures, e.g., matrices, vectors, and grids. In this paper, we use a number of widely used matrix computations from the LINPACK library to demonstrate ...
The Implementation of a High Performance GPGPU Compiler

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 9, Issue 4

Special Issue on High-Performance Embedded Architectures and Compilers

January 2013

876 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2400682

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2013

Accepted: 01 October 2012

Revised: 01 September 2012

Received: 01 June 2012

Published in TACO Volume 9, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
571
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)13

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Luinaud TLanglois JSavaria Y(2022)Symbolic Analysis for Data Plane Programs SpecializationACM Transactions on Architecture and Code Optimization10.1145/355772720:1(1-21)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3557727
Liu YHuang LWu MCui HLv FFeng XXue JAmaral JKulkarni M(2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307350
Zhang JWang QYi QCui H(2019)Automating the Exchangeability of Shared Data AbstractionsLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_14(185-192)Online publication date: 13-Nov-2019
https://doi.org/10.1007/978-3-030-34627-0_14
Su XLiao XXue JReddi VSmith ATang L(2017)Automatic generation of fast BLAS3-GEMM: a portable compiler approachProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049846(122-133)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049846
Baroudi TSeghir RLoechner V(2017)Optimization of Triangular and Banded Matrix Operations Using 2d-Packed LayoutsACM Transactions on Architecture and Code Optimization10.1145/316201614:4(1-19)Online publication date: 18-Dec-2017
https://dl.acm.org/doi/10.1145/3162016
Andres BPerez M(2017)Transpiler-based architecture for multi-platform web applications2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM)10.1109/ETCM.2017.8247456(1-6)Online publication date: Oct-2017
https://doi.org/10.1109/ETCM.2017.8247456
Su XLiao XXue J(2017)Automatic generation of fast BLAS3-GEMM: A portable compiler approach2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863734(122-133)Online publication date: Feb-2017
https://doi.org/10.1109/CGO.2017.7863734
Lv FCui HWang LLiu LWu CFeng XYew P(2014)Dynamic I/O-Aware Scheduling for Batch-Mode Applications on Chip Multiprocessor Systems of Cluster PlatformsJournal of Computer Science and Technology10.1007/s11390-013-1409-229:1(21-37)Online publication date: 10-Jan-2014
https://doi.org/10.1007/s11390-013-1409-2
Wang QZhang XZhang YYi QGropp WMatsuoka S(2013)AUGEMProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503219(1-12)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2503210.2503219

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents