[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/CLUSTER.2015.106guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Building a Fault Tolerant Application Using the GASPI Communication Layer

Published: 08 September 2015 Publication History

Abstract

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend applications to allow integrate a low cost fault detection mechanism and, if necessary, recover the application on the fly. The aspects of process management, the restoration of groups and the recovery mechanism is presented in detail. We use a sparse matrix vector multiplication based application to perform the analysis of the overhead introduced by such modifications. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

Cited By

View all
  • (2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing
September 2015
876 pages
ISBN:9781467365987

Publisher

IEEE Computer Society

United States

Publication History

Published: 08 September 2015

Author Tags

  1. GASPI
  2. GPI
  3. checkpoint-restart
  4. fault detection
  5. fault recovery
  6. fault tolerance
  7. pre-allocated spare processes

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media