Abstract
Fault Tolerant MPI (FT-MPI) [6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing (1999)
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: Mpi/ftTM: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid held in Melbourne, Australia (2001)
Beck, M., Dongarra, J.J., Fagg, G.E., Geist, G.A., Gray, P., Kohl, J., Migliardi, M., Moore, K., Moore, T., Papadopoulous, P., Scott, S.L., Sunderam, V.: HARNESS:a next generation distributed virtual machine. Future Generation Computer Systems 15 (1999)
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., Selikhov, A.: MPICH-v: Toward a scalable fault tolerant MPI for volatile nodes. In: SuperComputing, Baltimore USA (November 2002)
Burns, G., Daoud, R.: Robust MPI message delivery through guaranteed resources. In: MPI Developers Conference (June 1995)
Fagg, G.E., Bukovsky, A., Dongarra, J.J.: HARNESS and fault tolerant MPI. Parallel Computing 27, 1479–1496 (2001)
Fagg, G.E., Moore, K., Dongarra, J.J.: Scalable networked information processing environment (SNIPE). Future Generation Computing Systems 15, 571–582 (1999)
Graham, R.L., Choi, S.-E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. In: ICS, New York, USA, June 22-26 (2002)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: Mpi-ft: Portable fault tolerance scheme for MPI. In: Parallel Processing Letters, vol. 10(4), pp. 371–382. World Scientific Publishing Company, Singapore (2000)
Message Passing Interface Forum. MPI: A Message Passing Interface Standard (June 1995), http://www.mpi-forum.org/
Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface (July 1997), http://www.mpi-forum.org/
Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Performance modeling for self-adapting collective communications for MPI. In: LACSI Symposium, Eldorado Hotel, Santa Fe, NM, October 15-18. Springer, Heidelberg (2001)
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In: LACSI Symposium, Santa Fe, NM (October 2003)
Gabriel, E., Fagg, G.E., Bosilica, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Proceedings 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungry (2004)
Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 1999), May 1999, vol. 34(8), pp. 131–140 (1999)
Tanenbaum, A.S., van Steen, M.: Distributed Systems: Principles and Paradigms. Prentice Hall, Englewood Cliffs (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fagg, G.E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.J. (2005). Scalable Fault Tolerant MPI: Extending the Recovery Algorithm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2005. Lecture Notes in Computer Science, vol 3666. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557265_13
Download citation
DOI: https://doi.org/10.1007/11557265_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29009-4
Online ISBN: 978-3-540-31943-6
eBook Packages: Computer ScienceComputer Science (R0)