Abstract
Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.
Similar content being viewed by others
Notes
Named by the integers \(A\) (constant key) and \(N\) (value).
Super \(A\)s: 58,659, 59,665, 63,157, 63,859, and 63,877.
Based on a brute-force search. Code available at http://www4.cs.fau.de/Research/CoRed
This is by definition the case for RISC systems. For a CISC architecture, like IA32, this has to be ensured explicitly.
The signatures were chosen by the methods discussed in Sect. 3.2 and have a pairwise minimal Hamming distance of six.
Five input parameter sets for the four equality sets, and one for signal_due().
References
Aidemark, J., Vinter, J., Folkesson, P., & Karlsson, J. (2002). Experimental evaluation of time-redundant execution for a brake-by-wire application. 32nd International Conference on Dependable Systems & Networks (DSN ’02) (pp. 210–215). doi:10.1109/DSN.2002.1028902.
Avižienis, A., Gilley, G., Mathur, F. P., Rennels, D., Rohr, J., & Rubin, D. (1971). The star (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, 20(11), 1312–1321. doi:10.1109/T-C.1971.223133.
Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., et al. (2011). The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 1–7. doi:10.1145/2024716.2024718.
Borkar, S. Y. (2005). Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6), 10–16.
Braun, J., Geyer, D., & Mottok, J. (2012). Alternative measure for safety-related software. ATZelektronik Worldwide, 7(4), 40–43. doi:10.1365/s38314-012-0106-1.
Chang, J., Reis, G., & August, D. (2006). Automatic instruction-level software-only recovery. 36th International Conference on Dependable Systems & Networks (DSN ’06), IEEE (pp. 83–92). Washington, DC, USA. doi:10.1109/DSN.2006.15.
Cho, H., Mirkhani, S., Cher, C.Y., Abraham, J., & Mitra, S. (2013). Quantitative evaluation of soft error injection techniques for robust system design. Proceedings of the 50th annual Design Automation Conference (pp. 1–10).
Dodd, P. E., & Massengill, L. W. (2003). Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Transactions on Nuclear Science, 50(3), 583–602. doi:10.1109/TNS.2003.813129.
Engel, M., & Döbel, B. (2012). The reliable computing base: A paradigm for software-based reliability. 1st International W’shop on Software-Based Methods for Robust Emb. Sys. (SOBRES ’12). LNCS. Gesellschaft für Informatik.
Forin, P. (1989). Vital coded microprocessor principles and application for various transit systems. Symposium on Control, Computers, Communication in Transportation (CCCT ’89) (pp. 79–84).
Frohwerk, R. A. (1977). Signature analysis: A new digital field service method. Hewlett-Packard Journal, 28(9), 2–8.
Goloubeva, O., Rebaudengo, M., Reorda, M. S., & Violante, M. (2006). Software-Implemented Hardware Fault Tolerance (1st ed.). New York, NY: Springer.
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160.
Hoffmann, M., Dietrich, C., & Lohmann, D. (2013). dOSEK: A dependable RTOS for automotive applications. 19th International Symposium on Dependable Computing (PRDC ’13). IEEE. Washington, DC, USA. doi:10.1109/PRDC.2013.22. http://www.danceos.org/publications/PRDC-FAST-2013-Hoffmann.pdf. Fast abstract.
Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H., Lohmann, D., & Schröder-Preikschat, W. (2014). A practitioner’s guide to software-based soft-error mitigation using AN-codes. 15th IEEE International Symposium on High-Assurance Systems Engineering (HASE ’14), IEEE (pp. 33–40). Miami, Florida, USA. doi:10.1109/HASE.2014.14.
Kanawati, G. A., Kanawati, N. A., & Abraham, J. A. (1995). Ferrari: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44, 248–260.
Lawton, K. P. (1996). Bochs: A portable PC emulator for Unix/X. Linux Journal, 1996(29es), 7.
Li, X., Shen, K., Huang, M.C., & Chu, L. (2007). A memory soft error measurement on production systems. In: 2007 USENIX ATC, pp. 1–14. USENIX, Berkeley, CA, USA.
Maiz, J., Hareland, S., Zhang, K., & Armstrong, P. (2003). Characterization of multi-bit soft error events in advanced SRAMs. International Electron Devices Meeting (IEDM ’03). IEEE Press, New York, NY, USA. doi:10.1109/IEDM.2003.1269335.
Mandelbaum, D. (1967). Arithmetic codes with large distance. IEEE Transactions on Information Theory, 13(2), 237–242. doi:10.1109/TIT.1967.1054015.
Massey, J. L. (1964). Survey of residue coding for arithmetic errors. International Computation Center Bulletin, 3(4), 3–17.
Medwed, M., & Schmidt, J.M. (2009). Coding schemes for arithmetic and logic operations - how robust are they? In: H. Youm, M. Yung (eds.) Information Security Applications, Lecture Notes in Computer Science, vol. 5932, pp. 51–65. Springer, Heidelberg. doi:10.1007/978-3-642-10838-9_5.
Oh, N., Mitra, S., & McCluskey, E. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2), 180–199. doi:10.1109/12.980007.
Peterson, W. W., & Weldon, E. J. (1972). Error-correcting codes (2nd ed.). Cambridge, MA, USA: MIT Press.
Rao, T. R. N. (1974). Error coding for arithmetic processors (1st ed.). Orlando, FL: Academic Press.
Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., & Mukherjee, S. (2005). Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO ’05), 2(4), 366–396. doi:10.1145/1113841.1113843.
Schiffel, U. (2011). Hardware error detection using AN-codes. Ph.D. thesis, Technische Universität Dresden, Fakultät Informatik.
Schiffel, U., Schmitt, A., Süßkraut, M., & Fetzer, C. (2010). ANB- and ANBDmem-encoding: detecting hardware errors in software. In: E. Schoitsch (ed.) 29th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’10) (pp. 169–182). Springer, Heidelberg, Germany. doi:10.1007/978-3-642-15651-9_13.
Schirmeier, H., Hoffmann, M., Kapitza, R., Lohmann, D., & Spinczyk, O. (2012). FAIL*: Towards a versatile fault-injection experiment framework. 25th International Conference on Architecture of Computer Systems, Lecture Notes in Informatics, vol. 200. Gesellschaft für Informatik.
Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., & Connors, D.A. (2007). Using process-level redundancy to exploit multiple cores for transient fault tolerance. 37th International Conference on Dependable Systems & Networks (DSN ’07), IEEE (pp. 297–306). Washington, DC, USA. doi:10.1109/DSN.2007.98.
Steindl, M., Mottok, J., & Meier, H. (2010). Ses-based framework for fault-tolerant systems. Proceedings of the 8th Workshop on Intelligent Solutions in Embedded Systems (WISES ’10) (pp. 12–16). doi:10.1109/WISES.2010.5548427.
Ulbrich, P., Hoffmann, M., Kapitza, R., Lohmann, D., Schröder-Preikschat, W., & Schmid, R. (2012). Eliminating single points of failure in software-based redundancy. 9th Europe Dep. Computing Conference (EDCC ’12), IEEE (pp. 49–60). Washington, DC, USA. doi:10.1109/EDCC.2012.21.
Ulbrich, P., Kapitza, R., Harkort, C., Schmid, R., & Schröder- reikschat, W. (2011). I4Copter: An adaptable and modular quadrotor platform. 26th ACM Symposium on Applied Computing (SAC ’11), ACM (pp. 380–396). New York, NY, USA.
Wappler, U., & Fetzer, C. (2007). Software encoded processing: Building dependable systems with commodity hardware. In: F. Saglietti, N. Oster (eds.) 26th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’07) (pp. 356–369). Springer, Heidelberg, Germany. doi:10.1007/978-3-540-75101-4_34.
Acknowledgments
This work was partly supported by the Bavarian Ministry of State for Economics, Traffic, and Technology under the (EU EFRE funds) Grant No. 0704/883 25 and the German Research Foundation (DFG) priority program SPP 1500 under grant no. LO 1719/1-2 and SP 968/5-2. Implementation and further experimental results: http://www4.cs.fau.de/Research/CoRed.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hoffmann, M., Ulbrich, P., Dietrich, C. et al. Experiences with software-based soft-error mitigation using AN codes. Software Qual J 24, 87–113 (2016). https://doi.org/10.1007/s11219-014-9260-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11219-014-9260-4