Abstract
We present accurate time and energy piece-wise models of high-performance multi-threaded implementations for the general matrix multiplication, triangular system solve with multiple right-hand sides, and symmetric rank-k update. Furthermore, these are then assembled to provide accurate models of the Cholesky factorization built on top of these Level-3 BLAS operations. Our models consider the costs, in terms of time and energy, of the floating-point operations involved in the routines as well as the overhead due to data movements across the levels of the memory hierarchy. The accuracy of the multi-threaded models is tested on an Intel Xeon E5-2620 processor, reporting relative errors for the Cholesky factorization that are, respectively, around 2.4 and 2.9 % on average for time and energy.
Similar content being viewed by others
References
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley, Tech. Rep. UCB/EECS-2006-183, University of California at Berkeley, Electrical Engineering and Computer Sciences
Yotov K, Li X, Garzarán MJ, Padua D, Pingali K, Stodghill P (2005) Is search really necessary to generate high-performance BLAS? In: Proceedings of the IEEE, special issue on “Program Generation, Optimization, and Adaptation”, vol 93, no 2
Low TM, Igual FD, Smith TM, Quintana-Ortí ES (2015) Analytical modeling is enough for high performance BLIS, Tech. Rep. FLAWN #74, Department of Computer Sciences, The University of Texas at Austin. ACM Trans. Math. Softw. http://www.cs.utexas.edu/users/flame/
Choi JW, Bedard D, Fowler R, Vuduc R (2013) A roofline model of energy. In: Parallel distributed processing (IPDPS), 2013 IEEE 27th international symposium on, 2013, pp 661–672. doi:10.1109/IPDPS.2013.77
Bertran R, Gonzalez M, Martorell X, Navarro N, Ayguade E (2010) Decomposable and responsive power models for multicore processors using performance counters. In: Proceedings of the 24th ACM Int. conference on supercomputing, ICS ’10, 2010, pp 147–158
Goel B, McKee S, Gioiosa R, Singh K, Bhadauria M, Cesati M (2010) Portable, scalable, per-core power estimation for intelligent resource management. In: Int. Conf. on Green Computing, pp 135–146
Kestor G, Gioiosa R, Kerbyson DJ, Hoisie A (2013) Quantifying the energy cost of data movement in scientific applications. In: IEEE Int. Symp. on Workload Characterization (IISWC), pp 56–65
Liu Q, Moreto M, Jimenez V, Abella J, Cazorla F, Valero M (2013) Hardware support for accurate per-task energy metering in multicore systems. ACM Trans Archit Code Optim 10(4):34:1–34:27
Van Zee FG, van de Geijn RA (2015) BLIS: a framework for generating BLAS-like libraries. ACM Trans Math Softw 41(3):14:1–14:33
Intel Corp., Intel math kernel library (MKL) 11.0. http://software.intel.com/en-us/intel-mkl
AMD (2012) AMD Core Math Library. http://developer.amd.com/tools/cpu/acml/pages/default.aspx
Alonso P, Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2015) Time and energy modeling of high-performance level-3 BLAS on x86 architectures. Simul Model Pract Theory 55:77–94
Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2015) Time and energy modeling of high-performance multi-threaded matrix multiplication. In: 15th International conference computational and mathematical methods in science and engineering—CMMSE 2015, vol 1, pp 311–316
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. The Johns Hopkins University Press, Baltimore
David H, Gorbatov E, Le C (2010) RAPL: memory power estimation and capping. In: 2010 ACM/IEEE International symposium on low-power electronics and design (ISLPED), pp 189–194
Kågström B, Ling P, van Loan C (1998) Gemm-based level 3 blas: high-performance model implementations and performance evaluation benchmark. ACM Trans Math Softw 24(3):268–302
Acknowledgments
This work was supported by the CICYT projects TIN2014-53495-R and CICYT-TIN 2012-32180 of the MINECO and FEDER, the EU FET Project FP7 318793 “EXA2GREEN”, and the FPU program of MECD.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Catalán, S., Igual, F.D., Mayo, R. et al. Time and energy modeling of a high-performance multi-threaded Cholesky factorization. J Supercomput 73, 139–151 (2017). https://doi.org/10.1007/s11227-016-1654-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1654-6