[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2465813acmconferencesBook PagePublication PageshpdcConference Proceedingsconference-collections
FTXS '13: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
ACM2013 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing New York New York USA 18 June 2013
ISBN:
978-1-4503-1983-6
Published:
18 June 2013
Sponsors:
University of Arizona, SIGARCH

Reflects downloads up to 11 Dec 2024Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to the 2013 Fault-Tolerance for HPC at Extreme Scale -- FTXS 2013. For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and manycore will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan -- the most difficult under-addressed issue facing HPC.

FTXS 2013's program committee contains representation from at least seven countries across the world with leaders in government, academia, and industry. We hope that you will find this program interesting and thought-provoking and that the symposium will provide you with a valuable opportunity to share ideas with other researchers and practitioners from institutions around the world.

Skip Table Of Content Section
SESSION: Algorithms and applications
abstract
Toward resilient algorithms and applications

Large-scale computing platforms have always dealt with unreliability coming from many sources. In contrast applications for large-scale systems have generally assumed a fairly simplistic failure model: The computer is a reliable digital machine, with ...

research-article
Fault tolerance using lower fidelity data in adaptive mesh applications

Many high performance scientific simulation codes use checkpointing for multiple reasons. In addition to having the flexibility to complete the simulation in multiple job submissions, it has also provided an adequate recovery mechanism up to the current ...

SESSION: Hardware issues
abstract
Circuits for resilient systems

Timing and functional failures caused by process, voltage and temperature (PVT) variations pose major challenges to achieving energy-efficient performance in multi-core & many-core processor designs in nanoscale CMOS. Radiation-induced soft errors and ...

research-article
Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units

In this paper, we compare the radiation response of GPUs executing matrix multiplication and FFT algorithms. The provided experimental results demonstrate that for both algorithms, in the majority of cases, the output is affected by multiple errors. The ...

SESSION: Injection, detection, and replication
research-article
Using unreliable virtual hardware to inject errors in extreme-scale systems

Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. ...

research-article
Fault detection in multi-core processors using chaotic maps

Exascale systems built using multi-core processors are expected to experience several component faults during code executions lasting for hours. It is important to detect faults in processor cores so that faulty cores can be removed from scheduler pools,...

research-article
Replication for send-deterministic MPI HPC applications

Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the ...

SESSION: Energy and checkpointing
research-article
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Both energy efficiency and system reliability are significant concerns towards exa-scale high-performance computing. In such large HPC systems, applications are required to conduct massive I/O operations to local storage devices (e.g. a NAND flash ...

research-article
When is multi-version checkpointing needed?

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, ...

Contributors
  • Los Alamos National Laboratory
  • Sandia National Laboratories, New Mexico
  • Argonne National Laboratory
Index terms have been assigned to the content through auto-classification.
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Acceptance Rates

FTXS '13 Paper Acceptance Rate 7 of 10 submissions, 70%;
Overall Acceptance Rate 16 of 25 submissions, 64%
YearSubmittedAcceptedRate
FTXS '1515960%
FTXS '1310770%
Overall251664%