Abstract
Reliability has become an issue to the Tianhe supercomputer series with the scaling of the system. Proactive fault-tolerance based on failure prediction turns into an effective way to improve the system’s fault tolerance ability. Data collection is the basis of the failure prediction which has a great impact on the prediction accuracy, while current data collection methods for failure prediction only got limited data with large overhead. This paper presents DDC data collection framework for failure prediction in Tianhe supercomputers. DDC adopts a distributed data collection architecture which can fully collect the data related to the compute nodes’ health with high efficiency. Through the testing for DDC which ran on TH-1A, the results indicated that DDC had the advantage of low cost and good scalability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Philp, I.R.: Software failures and the road to a petaflop machine. In: Proceedings of the 1st Workshop on High Performance Computing Reliability Issues, San Francisco, CA, USA (2005)
Chen, Y., Plank, J.S., Li, K.: CLIP: a checkpointing tool for message-passing parallel programs. In: SC 1997, NY, USA (1997)
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys: Conf. Ser. 46(1), 494–499 (2006)
Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: DSN 2006, Washington, DC, USA, pp. 425–434 (2006)
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: The Seventh IEEE International Conference on Data Mining, pp. 583–588 (2007)
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: An adaptive semantic filter for Blue Gene/L failure log analysis. In: IEEE International Parallel and Distributed Processing Symposium, pp. 1–8 (2007)
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: CCGRID 2006, Washington, DC, USA, pp. 531–538 (2006)
Lan, Z., Gu, J., Zheng, Z., Thakur, R., Coghlan, S.: A study of dynamic meta-learning for failure prediction in large-scale systems. J. Parallel Distrib. Comput. 70(6), 630–643 (2010)
Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: IPDPS 2011, pp. 840–851 (2011)
Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD 2003, NY, USA, pp. 426–435 (2003)
Oliner, A., Rudolph, L., Sahoo, R.: Cooperative checkpointing theory. In: IPDPS 2006, Washington, DC, USA, pp. 132–141 (2006)
Oliner, A., Ganapathi, A., Xu, W.: Advances and challenges in log analysis. Commun. ACM 55(2), 55–61 (2012)
Yamanishi, K., Maruyama, Y.: Dynamic syslog mining for network failure monitoring. In: KDD 2005, New York, NY, USA, pp. 499–508 (2005)
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: SOSP 2009, NY, USA, pp. 117–132 (2009)
Vaarandi, R.: A breadth-first algorithm for mining frequent patterns from event logs. In: Aagesen, F.A., Anutariya, C., Wuwongse, V. (eds.) INTELLCOMM 2004. LNCS, vol. 3283, pp. 293–308. Springer, Heidelberg (2004)
Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault prediction under the microscope: a closer look into HPC systems. In: SC 2012, Los Alamitos, CA, USA (2012)
Scott, S.L., Engelmann, C., Vallee, G.R., Naughton, T., Tikotekar, A., Ostrouchov, G., et al.: A tunable holistic resiliency approach for high-performance computing systems. In: PPoPP 2009, NY, USA, pp. 305–306 (2009)
Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for HPC with Xen virtualization. In: ICS 2007, NY, USA, pp. 23–32 (2007)
Rajachandrasekar, R., Besseron, X., Panda, D.K.: Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI. In: IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1136–1143 (2012)
Buyya, R.: PARMON: a portable and scalable monitoring system for clusters. Softw. Pract. Exper. 30(7), 723–739 (2000)
Massie, M.L., Chun, B.N., Culler, D.E.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J.R., Pebay, P.P., Thompson, D., et al.: Ovis-2: a robust distributed architecture for scalable RAS. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)
Acknowledgments
This paper is supported by the National Natural Science Foundation of China (NSFC) No. 61272141, No. 61120106005 and the National High Technology Research and Development Program of China (863 Program) No. 2012AA01A301.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hu, W., Jiang, Y., Liu, G., Dong, W., Cai, G. (2015). DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. In: Chen, Y., Ienne, P., Ji, Q. (eds) Advanced Parallel Processing Technologies. APPT 2015. Lecture Notes in Computer Science(), vol 9231. Springer, Cham. https://doi.org/10.1007/978-3-319-23216-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-23216-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23215-7
Online ISBN: 978-3-319-23216-4
eBook Packages: Computer ScienceComputer Science (R0)