Abstract
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.
Similar content being viewed by others
References
General-purpose computation on graphics hardware. http://gpgpu.org. Accessed Dec 2012
Fan Z, Qiu F, Kaufman A, Yoakum-Stover S (2004) GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. ISBN:0-7695-2153. doi:10.1109/SC.2004.26
Kindratenko VV, Enos J, Shi G, Showerman MT, Arnold GW, Stone JE, Phillips JC, Hwu W (2009) GPU clusters for high-performance computing. In: Proceedings of the IEEE international conference on cluster computing and workshops, CLUSTER, pp 1-8. ISBN:978-1-4244-5011-4. doi:10.1109/CLUSTR.2009.5289128
Top 500 supercomputing sites. http://www.top500.org. Accessed Dec 2012
Laosooksathit S, Naksinehaboon N, Leangsuksan C, Dhungana A, Chandler C, Chanchio K, Farbin A (2010) Lightweight checkpoint mechanism and modeling in gpgpu environment. Computing (HPC Syst) , vol 12, pp 13-20
Laosooksathit S, Naksinehaboon N, Leangsuksan C (2011) Two level checkpoint/restart modeling for GPGPU. In: Proceedings of 9th IEEE/ACS international conference on computer systems and applications (AICCSA), pp 276–283 .ISBN:9781457704758. http://dx.doi.org/10.1109/AICCSA.2011.6126619
NVIDIA (2011) CUDA C Programming Guide Version 4.0. Reliability-aware performance model for optimal GPU-enabled cluster environment 11
Laosooksathit S, Baggag A, Chandler C (2009) Stream experiments: toward latency hiding in GPGPU. In: Proceedings of the 9th IASTED international conference, vol 676, p 240
Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the 2nd IEEE international parallel and distributed processing symposium (IPDPS 2008), Miami, Florida, pp 1–9. ISBN: 978-1-4244-1693-6. doi:10.1109/IPDPS.2008.4536279
Paun M, Naksinehaboon N, Nassar R, Leangsuksun C, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(03):329
Gottumukkala NR, Leangsuksun CB, Liu Y, Nassar R, Scott SL (2006) Reliability analysis in HPC clusters. In: Proceedings of high avalability and performance workshop (HAPCS). Conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe
Gottumukkala NR, Nassar R, Paun M, Leangsuksun CB, Scott SL (2010) Reliability of a system of \(k\) nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169
Thanakornworakij T, Nassar R, Leangsuksun C, Paun M (2012) Reliability model of a system of k nodes with simultaneous failures for high performance computing applications. Int J High Perform Comput Appl
Barney B (2013) Introduction to parallel computing. https://computing.llnl.gov/tutorials/parallel_comp/. Accessed Jan 2013
Hill MD, Marty MR (2008) Amdahls law in the multicore era. In: IEEE Computer Society, pp 33 - 38. http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf
Gustafson JL, Montry GR, Benner RE, Gear CW, Gustafson JL, Montry GR, Benner E (1988) Development of parallel methods for a 1024-processor hypercube. SIAM J Sci Stat Comput 9:609638
Gustafson JL (1988) Reevaluating Amdahl’s law. Commun ACM 31:532533
CUDA Toolkit and SDK. https://developer.nvidia.com/cuda-downloads. Accessed Dec 2012
Laosooksathit S (2013) Performance Modeling and Optimization for GPGPU, Dissertation, Louisiana Tech University
Acknowledgments
This work was partially supported by the grants CNS-0834483, EPS-1003897 and TE97/2010.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Laosooksathit, S., Nassar, R., Leangsuksun, C. et al. Reliability-aware performance model for optimal GPU-enabled cluster environment. J Supercomput 68, 1630–1651 (2014). https://doi.org/10.1007/s11227-014-1128-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1128-7