It is our great pleasure to welcome you to the 2013 Fault-Tolerance for HPC at Extreme Scale -- FTXS 2013. For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and manycore will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.
Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan -- the most difficult under-addressed issue facing HPC.
FTXS 2013's program committee contains representation from at least seven countries across the world with leaders in government, academia, and industry. We hope that you will find this program interesting and thought-provoking and that the symposium will provide you with a valuable opportunity to share ideas with other researchers and practitioners from institutions around the world.
Proceeding Downloads
Toward resilient algorithms and applications
Large-scale computing platforms have always dealt with unreliability coming from many sources. In contrast applications for large-scale systems have generally assumed a fairly simplistic failure model: The computer is a reliable digital machine, with ...
Fault tolerance using lower fidelity data in adaptive mesh applications
Many high performance scientific simulation codes use checkpointing for multiple reasons. In addition to having the flexibility to complete the simulation in multiple job submissions, it has also provided an adequate recovery mechanism up to the current ...
Circuits for resilient systems
Timing and functional failures caused by process, voltage and temperature (PVT) variations pose major challenges to achieving energy-efficient performance in multi-core & many-core processor designs in nanoscale CMOS. Radiation-induced soft errors and ...
Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units
In this paper, we compare the radiation response of GPUs executing matrix multiplication and FFT algorithms. The provided experimental results demonstrate that for both algorithms, in the majority of cases, the output is affected by multiple errors. The ...
Using unreliable virtual hardware to inject errors in extreme-scale systems
Fault tolerance is a key obstacle to next generation extreme-scale systems. As systems scale, the Mean Time To Interrupt (MTTI) decreases proportionally. As a result, extreme-scale systems are likely to experience higher rates of failure in the future. ...
Fault detection in multi-core processors using chaotic maps
Exascale systems built using multi-core processors are expected to experience several component faults during code executions lasting for hours. It is important to detect faults in processor cores so that faulty cores can be removed from scheduler pools,...
Replication for send-deterministic MPI HPC applications
Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the ...
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
Both energy efficiency and system reliability are significant concerns towards exa-scale high-performance computing. In such large HPC systems, applications are required to conduct massive I/O operations to local storage devices (e.g. a NAND flash ...
When is multi-version checkpointing needed?
The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, ...
Index Terms
- Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale