Abstract
There is a growing interest in deploying MPI over multiple, heterogenous and geographically distributed resources for performing very large scale computations. However, increasing the amount of geographical distribution and resources creates problems with interoperability and fault-tolerance. FT-MPI presents an interesting solution for adding fault-tolerance to MPI, but suffers from interoperability limitations and potential single points of failure when crossing multiple administrative domains. We propose to overcome these limitations by adding “pluggability” for one potential single point of failure – the name service used by FT-MPI – and combining FT-MPI with the H2O metacomputing framework.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of th HCW 2005-14th Heterogeneous Computing Workshop (2005) (accepted)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC 2003 Conference, p. 25 (2003)
Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997). Available at http://citeseerist.psu.edu/chen97clip.html
Chin, J., Coveney, P.V.: Towards tractable toolkits for the Grid: a plea for lightweight, usable middleware. Available at http://www.realitygrid.org/lgpaper21.pdf
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)
Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. In: International Journal for High Performance Applications and Supercomputing (2004)
Imamura, T., Tsujita, Y., Koide, H., Takemiya, H.: An architecture of Stampi: MPI library on a cluster of parallel computers. In: 7th European PVM/MPI Users’ Group Meeting, pp. 4–18 (2000)
Karonis, N., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing (JPDC) 63(5), 551–563 (2003)
Keller, R., Krammer, B., Mueller, M.S., Resch, M.M., Gabriel, E.: MPI development tools and applications for the grid. In: Workshop on Grid Applications and Programming Tools (2003)
Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)
Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, 526–531 (1996)
Tyrakowski, T., Sunderam, V.S., Migliardi, M.: Distributed Name Service in Harness. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 345–354. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dewolfs, D., Kurzyniec, D., Sunderam, V., Broeckhove, J., Dhaene, T., Fagg, G. (2005). Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2005. Lecture Notes in Computer Science, vol 3666. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557265_36
Download citation
DOI: https://doi.org/10.1007/11557265_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29009-4
Online ISBN: 978-3-540-31943-6
eBook Packages: Computer ScienceComputer Science (R0)