Transient errors in computer systems can cause abnormal behavior and degrade system reliability, data integrity and availability. This is especially true in a space environment where transient errors are a major cause of concern. Fault avoidance techniques such as radiation hardening and shielding have been major approaches to obtaining the required reliability. Recently, unhardened Commercial Off-The-Shelf (COTS) components have been investigated for space applications because of their higher density, faster clock rate, lower power consumption and lower price.
Since COTS components are not radiation hardened, and it is desirable to avoid shielding, Software-Implemented Hardware Fault Tolerance (SIHFT) has been proposed to increase the data integrity and availability of COTS systems. This dissertation presents three new SIHFT techniques for error detection: Control Flow Checking by Software Signatures (CFCSS), Error Detection by Duplicated Instructions (EDDI), and Error Detection by Diverse Data and Duplicated Instructions (ED 4 I).
Previously studied software techniques are either inadequate or require assistance from special hardware, but CFCSS, EDDI and ED 4 I are pure software methods. In CFCSS, signatures are embedded into the program during compilation and compared with run-time signatures during execution. In EDDI, instructions are duplicated at compile-time, and scheduled by exploiting Instruction-Level Parallelism (ILP) to reduce performance overhead. CFCSS and EDDI detect transient errors but not permanent faults. However, in ED 4 I, a program is compiled to a new program with diverse data so that it can detect a permanent fault.
Our fault injection experiment simulating bit flips in memory shows that EDDI provides over 98% fault coverage without any extra hardware. Because of instruction duplication, code size overhead is approximately 100%, but by exploiting ILP, we reduce the performance overhead down to 61% on average. For control flow checking experiment simulating branching faults, CFCSS provides 97% fault coverage. In addition, when we duplicate programs or instructions, we can use ED4I to enhance data integrity in the system.
Furthermore, for space experiments, we implemented EDDI and CFCSS in sort and FFT programs running in the ARGOS satellite. During a 136 day period, our techniques detected a total of 198 out of 203 errors, and show 98% error detection coverage.
Recommendations
A New Approach to Software-Implemented Fault Tolerance
A new approach for providing fault detection and correction capabilities by using software techniques only is described. The approach is suitable for developing safety-critical applications exploiting unhardened commercial-off-the-shelf processor-based ...
Software-Implemented Fault Injection Methodology for Design and Validation of System Fault Tolerance
DSN '01: Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)Abstract: In this paper, we present our experience in developing a methodology and tool at the Jet Propulsion Laboratory (JPL) for Software-Implemented Fault Injection (SWIFI) into a parallel processing supercomputer, which is being designed for use in ...
Combining Software-Implemented and Simulation-Based Fault Injection into a Single Fault Injection Method
FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant ComputingAbstract: Fault/error injection has emerged as a valuable means for evaluating the dependability of a system. In particular, software-based techniques (which can be described as software-implemented and simulation-based techniques) have become very ...