More Web Proxy on the site http://driver.im/

Article

Automatic generation of fast BLAS3-GEMM: a portable compiler approach

Authors:

Jingling XueAuthors Info & Claims

CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

Pages 122 - 133

Published: 04 February 2017 Publication History

Abstract

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers.

We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in an architecture-independent manner, without involving domain experts. The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture-independent optimizations. The optimized micro-kernel drops easily in existing GEMM frameworks such as BLIS and OpenBLAS. Validation focuses on optimizing GEMM in double precision on two architectures. On Intel Sandybridge and AArch64 Cortex-A57, Poca's micro-kernels outperform expert-crafted assembly code by 2.35% and 7.54%, respectively, and both BLIS and OpenBLAS achieve competitive or better performance once their micro-kernels are replaced by Poca's.

References

[1]

R. C. Agarwal, F. G. Gustavson, and M. Zubair. Exploiting functional parallelism of POWER2 to design highperformance numerical algorithms. IBM Journal of Research & Development, 38(5):563–576, 1994.

Digital Library

[2]

G. Belter, J. G. Siek, I. Karlin, and E. R. Jessup. Automatic generation of tiled and parallel linear algebra routines. In IWAPT’ 10, pages 1811–1820.

[3]

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: A portable, highperformance, ansi c coding methodology. In SC ’97, pages 253–260.

Digital Library

[4]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI ’08, pages 101–113.

Digital Library

[5]

G. J. Chaitin. Register allocation and spilling via graph coloring. In CC ’82, pages 98–105.

Digital Library

[6]

H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng. Automatic library generation for BLAS3 on GPUs. In IPDPS ’11, pages 255–265.

Digital Library

[7]

H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. Extendable pattern-oriented optimization directives. ACM Trans. Archit. Code Optim., 9(3):14:1–14:37, Oct. 2012.

Digital Library

[8]

H. Cui, Q. Yi, J. Xue, and X. Feng. Layout-oblivious compiler optimization for matrix computations. ACM Trans. Archit. Code Optim., 9(4):35:1–35:20, Jan. 2013.

Digital Library

[9]

A. E. Eichenberger, P. Wu, and K. O’Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI ’04, pages 82–93.

Digital Library

[10]

C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: A new model for loop cyclic register allocation. In PACT ’95, pages 264–267.

Digital Library

[11]

L. Gao, Q. H. Nguyen, L. Li, J. Xue, and T. Ngai. Threadsensitive modulo scheduling for multicore processors. In ICPP ’08, pages 132–140.

Digital Library

[12]

K. Goto and R. a. V. D. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3):1–25, 2008.

Digital Library

[13]

L. J. Hendren, G. R. Gao, E. R. Altman, and C. Mukerji. A register allocation framework based on hierarchical cyclic interval graphs. In CC ’93, pages 176–191.

Digital Library

[14]

B. K˚agström, P. Ling, and C. van Loan. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw., 24(3):268–302, Sept. 1998.

Digital Library

[15]

V. Kelefouras, A. Kritikakou, and C. Goutis. A matrixmatrix multiplication methodology for single/multi-core architectures using SIMD. The Journal of Supercomputing, 68(3):1418–1440, jan 2014.

Digital Library

[16]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In PLDI ’13, pages 127–138.

Digital Library

[17]

M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In PLDI ’88, pages 318–328.

Digital Library

[18]

M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS ’91, pages 63–74.

Digital Library

[19]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI ’00, pages 145–156.

Digital Library

[20]

D. M. Lavery and W.-M. W. Hwu. Unrolling-based optimizations for modulo scheduling. In MICRO ’95, pages 327–337.

Digital Library

[21]

J. Llosa. Swing modulo scheduling: A lifetime-sensitive approach. In PACT ’96, pages 80–.

Digital Library

[22]

T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Ort´ı. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw., 43(2):12:1–12:18, Aug. 2016.

Digital Library

[23]

B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In MICRO ’94, pages 63–74.

Digital Library

[24]

I. Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. In GCC Developers’ Summit, pages 131–142, 2007.

[25]

M. D. Smith, N. Ramsey, and G. Holloway. A generalized algorithm for graph-coloring register allocation. In PLDI ’04, pages 277–288.

Digital Library

[26]

T. M. Smith, R. Van De Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In IPDPS ’14, pages 1049–1059.

Digital Library

[27]

D. G. Spampinato and M. Püschel. A basic linear algebra compiler. In CGO ’14, pages 23:23–23:32.

Digital Library

[28]

F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3):14:1–14:33, June 2015.

Digital Library

[29]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC’08, pages 31:1–31:11.

Digital Library

[30]

L. Wang, J. Xue, and X. Yang. Reuse-aware modulo scheduling for stream processors. In DATE ’10, pages 1112–1117.

Digital Library

[31]

Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In SC ’13, pages 25:1–25:12.

Digital Library

[32]

R. C. Whaley and J. J. Dongarra. Automatically Tuned Linear Algebra Software. In SC ’98, pages 1–27.

Digital Library

[33]

J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston, 2000.

Digital Library

[34]

Q. Yi. Automated programmable control and parameterization of compiler optimizations. In CGO ’11, pages 97–106.

Digital Library

[35]

Q. Yi, Q. Wang, and H. Cui. Specializing compiler optimizations through programmable composition for dense matrix computations. In MICRO ’14, pages 596–608.

Digital Library

[36]

K. Yotov, X. Li, G. Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358–386, Feb 2005.

[37]

X. Zhang, Q. Wang, and Y. Zhang. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In IPDPS ’12, pages 684–691.

[38]

H. Zhou and J. Xue. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In CGO ’16, pages 59–69.

Digital Library

[39]

H. Zhou and J. Xue. A compiler approach for exploiting partial simd parallelism. ACM Trans. Archit. Code Optim., 13(1):11:1–11:26, Mar. 2016.

Digital Library

Cited By

Kim JRoss KSedlar EStadler L(2023)AMULET: Adaptive Matrix-Multiplication-Like TasksProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595301(77-81)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3592980.3595301
Spampinato DJelovina DZhuang JYzelman AHenriksen TSinkarovs A(2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3589246.3595373
Tao XZhu YWang BXu JPang JZhao J(2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545031
Show More Cited By

Automatic generation of fast BLAS3-GEMM: a portable compiler approach
1. Software and its engineering
  1. Software notations and tools

Recommendations

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...
An Improved Magma Gemm For Fermi Graphics Processing Units

We present an improved matrixâ matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

February 2017

317 pages

ISBN:9781509049318

General Chair:
Vijay Janapa Reddi
University of Texas at Austin, USA
,
Program Chairs:
Aaron Smith
Microsoft Research, UK / University of Edinburgh, UK
,
Lingjia Tang
University of Michigan, USA

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication History

Published: 04 February 2017

Check for updates

Author Tags

Qualifiers

Article

Conference

CGO '17

Sponsor:

CGO '17: 15th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 4 - 8, 2017

Austin, USA

Acceptance Rates

CGO '17 Paper Acceptance Rate 26 of 116 submissions, 22%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim JRoss KSedlar EStadler L(2023)AMULET: Adaptive Matrix-Multiplication-Like TasksProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595301(77-81)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3592980.3595301
Spampinato DJelovina DZhuang JYzelman AHenriksen TSinkarovs A(2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3589246.3595373
Tao XZhu YWang BXu JPang JZhao J(2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545031
Su XLiao XJiang HYang CXue J(2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
https://dl.acm.org/doi/10.1145/3274654
Gareev RGrosser TKruse M(2018)High-Performance Generalized Tensor OperationsACM Transactions on Architecture and Code Optimization10.1145/323502915:3(1-27)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3235029

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents