[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Challenges and Issues of the Integration of RADIC into Open MPI

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2009)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5759))

Abstract

Parallel machines are growing in complexity and number of components which increases fault probability. Thus, MPI applications running on these machines may not reach completion. This paper presents RADIC/OMPI, which is the integration of RADIC fault tolerance architecture into Open MPI. RADIC/OMPI relies on uncoordinated checkpoints combined with pessimistic receiver-based message logs in a distributed way without the need to use any central or stable elements. Due to this, it assures the application completion automatically and transparently for users and administrators. We concluded that within certain applications RADIC/OMPI provides fault tolerance with an acceptable overhead even in the presence of consecutive faults.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Duarte, A., Rexachs, D., Luque, E.: Increasing the cluster availability using RADIC. Cluster Computing, 1–8 (2006)

    Google Scholar 

  2. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B.W., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004), http://www.springerlink.com/content/rdd34k92rx3nqux1

    Chapter  Google Scholar 

  3. Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In: IPDPS, March 26-30, 2007, pp. 1–8 (2007)

    Google Scholar 

  4. Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. Int. J. High Perform. Comput. Appl. 18(3), 363–372 (2004)

    Article  Google Scholar 

  5. Bouteiller, A., Cappello, F., Herault, T., Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Supercomputing 25 (2003)

    Google Scholar 

  6. Santos, G., Duarte, A., Rexachs, D.I., Luque, E.: Providing non-stop service for message-passing based parallel applications with RADIC. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 58–67. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fialho, L., Santos, G., Duarte, A., Rexachs, D., Luque, E. (2009). Challenges and Issues of the Integration of RADIC into Open MPI. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03770-2_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03769-6

  • Online ISBN: 978-3-642-03770-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics