[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3049832.3049846acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Automatic generation of fast BLAS3-GEMM: a portable compiler approach

Published: 04 February 2017 Publication History

Abstract

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers.
We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in an architecture-independent manner, without involving domain experts. The key insight is to leverage a wide range of architecture-specific abstractions already available in LLVM, by first generating a vectorized micro-kernel in the architecture-independent LLVM IR and then improving its performance by applying a series of domain-specific yet architecture-independent optimizations. The optimized micro-kernel drops easily in existing GEMM frameworks such as BLIS and OpenBLAS. Validation focuses on optimizing GEMM in double precision on two architectures. On Intel Sandybridge and AArch64 Cortex-A57, Poca's micro-kernels outperform expert-crafted assembly code by 2.35% and 7.54%, respectively, and both BLIS and OpenBLAS achieve competitive or better performance once their micro-kernels are replaced by Poca's.

References

[1]
R. C. Agarwal, F. G. Gustavson, and M. Zubair. Exploiting functional parallelism of POWER2 to design highperformance numerical algorithms. IBM Journal of Research & Development, 38(5):563–576, 1994.
[2]
G. Belter, J. G. Siek, I. Karlin, and E. R. Jessup. Automatic generation of tiled and parallel linear algebra routines. In IWAPT’ 10, pages 1811–1820.
[3]
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: A portable, highperformance, ansi c coding methodology. In SC ’97, pages 253–260.
[4]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI ’08, pages 101–113.
[5]
G. J. Chaitin. Register allocation and spilling via graph coloring. In CC ’82, pages 98–105.
[6]
H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng. Automatic library generation for BLAS3 on GPUs. In IPDPS ’11, pages 255–265.
[7]
H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. Extendable pattern-oriented optimization directives. ACM Trans. Archit. Code Optim., 9(3):14:1–14:37, Oct. 2012.
[8]
H. Cui, Q. Yi, J. Xue, and X. Feng. Layout-oblivious compiler optimization for matrix computations. ACM Trans. Archit. Code Optim., 9(4):35:1–35:20, Jan. 2013.
[9]
A. E. Eichenberger, P. Wu, and K. O’Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI ’04, pages 82–93.
[10]
C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: A new model for loop cyclic register allocation. In PACT ’95, pages 264–267.
[11]
L. Gao, Q. H. Nguyen, L. Li, J. Xue, and T. Ngai. Threadsensitive modulo scheduling for multicore processors. In ICPP ’08, pages 132–140.
[12]
K. Goto and R. a. V. D. Geijn. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3):1–25, 2008.
[13]
L. J. Hendren, G. R. Gao, E. R. Altman, and C. Mukerji. A register allocation framework based on hierarchical cyclic interval graphs. In CC ’93, pages 176–191.
[14]
B. K˚agström, P. Ling, and C. van Loan. GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw., 24(3):268–302, Sept. 1998.
[15]
V. Kelefouras, A. Kritikakou, and C. Goutis. A matrixmatrix multiplication methodology for single/multi-core architectures using SIMD. The Journal of Supercomputing, 68(3):1418–1440, jan 2014.
[16]
M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet SIMD code generation. In PLDI ’13, pages 127–138.
[17]
M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In PLDI ’88, pages 318–328.
[18]
M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS ’91, pages 63–74.
[19]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI ’00, pages 145–156.
[20]
D. M. Lavery and W.-M. W. Hwu. Unrolling-based optimizations for modulo scheduling. In MICRO ’95, pages 327–337.
[21]
J. Llosa. Swing modulo scheduling: A lifetime-sensitive approach. In PACT ’96, pages 80–.
[22]
T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Ort´ı. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw., 43(2):12:1–12:18, Aug. 2016.
[23]
B. R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In MICRO ’94, pages 63–74.
[24]
I. Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. In GCC Developers’ Summit, pages 131–142, 2007.
[25]
M. D. Smith, N. Ramsey, and G. Holloway. A generalized algorithm for graph-coloring register allocation. In PLDI ’04, pages 277–288.
[26]
T. M. Smith, R. Van De Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In IPDPS ’14, pages 1049–1059.
[27]
D. G. Spampinato and M. Püschel. A basic linear algebra compiler. In CGO ’14, pages 23:23–23:32.
[28]
F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw., 41(3):14:1–14:33, June 2015.
[29]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC’08, pages 31:1–31:11.
[30]
L. Wang, J. Xue, and X. Yang. Reuse-aware modulo scheduling for stream processors. In DATE ’10, pages 1112–1117.
[31]
Q. Wang, X. Zhang, Y. Zhang, and Q. Yi. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In SC ’13, pages 25:1–25:12.
[32]
R. C. Whaley and J. J. Dongarra. Automatically Tuned Linear Algebra Software. In SC ’98, pages 1–27.
[33]
J. Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, Boston, 2000.
[34]
Q. Yi. Automated programmable control and parameterization of compiler optimizations. In CGO ’11, pages 97–106.
[35]
Q. Yi, Q. Wang, and H. Cui. Specializing compiler optimizations through programmable composition for dense matrix computations. In MICRO ’14, pages 596–608.
[36]
K. Yotov, X. Li, G. Ren, M. J. S. Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358–386, Feb 2005.
[37]
X. Zhang, Q. Wang, and Y. Zhang. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In IPDPS ’12, pages 684–691.
[38]
H. Zhou and J. Xue. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In CGO ’16, pages 59–69.
[39]
H. Zhou and J. Xue. A compiler approach for exploiting partial simd parallelism. ACM Trans. Archit. Code Optim., 13(1):11:1–11:26, Mar. 2016.

Cited By

View all
  • (2023)AMULET: Adaptive Matrix-Multiplication-Like TasksProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595301(77-81)Online publication date: 18-Jun-2023
  • (2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
  • (2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
  • Show More Cited By
  1. Automatic generation of fast BLAS3-GEMM: a portable compiler approach

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization
    February 2017
    317 pages
    ISBN:9781509049318

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 04 February 2017

    Check for updates

    Author Tags

    1. GEMM
    2. code optimization
    3. dense linear algebra

    Qualifiers

    • Article

    Conference

    CGO '17
    Sponsor:

    Acceptance Rates

    CGO '17 Paper Acceptance Rate 26 of 116 submissions, 22%;
    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)AMULET: Adaptive Matrix-Multiplication-Like TasksProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595301(77-81)Online publication date: 18-Jun-2023
    • (2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
    • (2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
    • (2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
    • (2018)High-Performance Generalized Tensor OperationsACM Transactions on Architecture and Code Optimization10.1145/323502915:3(1-27)Online publication date: 4-Sep-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media