Abstract
Practical simulators require high-performance iterative methods and efficient boundary conditions, especially in the field of computational fluid dynamics. In this paper, we propose a novel bit-representation technique to enhance the performance of such simulators. The technique is applied to an iterative kernel implementation that treats various boundary conditions in a stencil computation on a structured grid system. This approach reduces traffic from the main memory to CPU, and effectively utilizes Single Instruction–Multiple Data (SIMD) stream units with cache because of the bit-representation and compression of matrix elements. The proposed implementation also replaces if-branch statements with mask operations using the bit expression. This promotes the optimization of code during compilation and runtime. To evaluate the performance of the proposed implementation, we employ the Red–Black SOR and BiCGstab algorithms. Experimental results show that the proposed approach is up to 3.5 times faster than a naïve implementation on both the Intel and Fujitsu Sparc architectures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Here, we count 1 flop for multiplication, addition, and subtraction operators, and 8 flops for the division operator.
- 2.
The number of loads depends on the size of the cache line. In this case, there are 3 loads for array p owing to the L3 cache.
References
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Willcock, J., Lumsdaine, A.: Accelerating sparse matrix computations via data compression. In: Proceedings of the 20th Annual ICS 2006, pp. 307–316 (2006)
Tang, W.T., et al.: Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes. In: Proceedings of SC 2013, vol. 26, pp. 1–12 (2013)
Van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Yokokawa, M.: Vector-parallel processing of the successive overrelaxation method. Japan Atomic Energy Research Institute JAERI-M Report No. 88–017 (1988) (in Japanese)
Ono, K., Kawashima, Y., Kawanabe, T.: Data centric framework for large-scale high-performance parallel computation. Procedia Comput. Sci. 29, 2336–2350 (2014)
Acknowledgments
We thank the RIKEN Advanced Institute for Computational Science for allowing us to use the K computer to obtain our results. Part of this research was supported by a grant for the “Strategic Program on HPCI Field No. 4: Industrial Innovations” from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) “Development and Use of Advanced, High-Performance, General-Purpose Supercomputers Project,” and was carried out in partnership with the University of Tokyo.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ono, K., Chiba, S., Inoue, S., Minami, K. (2015). Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-17353-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)