More Web Proxy on the site http://driver.im/

article

Adaptive Fault Tolerance for Scalable Cluster Computing in Space

Authors:

Andrew A. Shapiro,

Paul L. Springer,

Hans P. ZimaAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 23, Issue 3

Pages 227 - 241

https://doi.org/10.1177/1094342009106190

Published: 01 August 2009 Publication History

Abstract

Future missions of deep-space exploration face the challenge of building more capable autonomous spacecraft and planetary rovers. Given the communication latencies and bandwidth limitations for such missions, the need for increased autonomy becomes mandatory, along with the requirement for enhanced on-board computational capabilities while in deep-space or time-critical situations. This will result in dramatic changes in the way missions are conducted and supported by on-board computing systems. Specifically, the traditional approach of relying exclusively on radiation-hardened hardware and modular redundancy will not be able to deliver the required computational power. As a consequence, such systems are expected to include high-capability low-power components based on emerging commercial-off-the-shelf (COTS) multi-core technology. In this paper we describe the design of a generic framework for introspection that supports runtime monitoring and analysis of program execution as well as a feedback-oriented recovery from faults. Our focus is on providing flexible software fault tolerance matched to the requirements and properties of applications by exploiting knowledge that is either contained in an application knowledge base, provided by users, or automatically derived from specifications. A prototype implementation is currently in progress at the Jet Propulsion Laboratory, California Institute of Technology, targeting a cluster of cell broadband engines.

References

[1]

Aggarwal, N., Ranganathan, P., Jouppi, N.P. and Smith, J.E. 2007. Isolation in commodity multicore processors. IEEE Computer 40(6):49-59.

Digital Library

[2]

Aho, A.V., Hopcroft, J.E. and Ullman, J.D. 1974. The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA.

Digital Library

[3]

Amnell, T., Behrmann, G., Bengtsson, J., D'Argenio, P.R., David, A., Fehnker, A., Hune, T., Jeannet, B., Larsen, K.G., Möller, M.O., Pettreson, P., Weise, C. and Yi, W. 2000. UPPAAL-now, next, and future. In Proceedings of Modelling and Verification of Parallel Processes (MOVEP'2k), Lecture Notes in Computer Science 2067, Springer, Berlin, pp.100-125.

Digital Library

[4]

Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C. and Judd, M. 2007. OASIS: Onboard Autonomous Science Investigation System for opportunistic rover science. Journal of Field Robotics 24(5):379-397.

Digital Library

[5]

Castano, R., Mazzoni, D., Tang, N., Doggett, T., Chien, S., Greely, R., Cichy, B. and Davies, A. 2005. Learning classifiers for science event detection in remote sensing imagery. In Proceedings of the 8th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (i-SAIRAS 2005), ESA SP-603.

[6]

Dechant, D.J. 1990. The Advanced Onboard Signal Processor (AOSP). Advances in VLSI Computer Systems 2(2):69-78.

Digital Library

[7]

Dennis, J.B. 1980. Dataflow supercomputers. IEEE Computer 13(11):93-100.

Digital Library

[8]

Dijkstra, E.W. 1972. Notes on structured programming. In Dahl, O.-J., Dijkstra, E. W. and Hoare, C. (eds), Structured Programming, Academic Press, London, pp. 1-82

Digital Library

[9]

Eichenberger, A.E., O'Brien, J.K., O'Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y. and Koo, R. 2006. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal 45(1):59-84.

Digital Library

[10]

Gannon, J.D. 2004. Verification and validation. In Tucker, A. B. (ed.), Computer Science Handbook, <ed>2</ed>nd edn, Chapman & Hall/CRC, Boca Raton, FL.

[11]

Goldberg, A., Havelund, K. and McGann, C. 2005. Runtime verification for autonomous spacecraft software . In Proceedings 2005 IEEE Aerospace Conference, pp. 507-516.

[12]

Goldberg, D., Li, M., Tao, W. and Tamir, Y. 2001. The Design and Implementation of a Fault-tolerant Cluster Manager. Technical Report CSD-010040, Computer Science Department, University of California, Los Angeles, CA, USA.

[13]

Havelund, K. and Goldberg, A. 2005. Verify your runs. In Proceedings Verified Software: Theories, Tools, Experiments (VSTTE'05).

[14]

Holzmann, G.J. 2003. The SPIN Model Checker. Primer and Reference Manual, Addison-Wesley, Reading, MA.

Digital Library

[15]

Hopcroft, J.E., Motwani, R. and Ullman, J.D. 2006. Introduction to Automata Theory, Languages, and Computation, <ed>3</ed>rd edn, Addison-Wesley, Reading, MA.

Digital Library

[16]

Iacoponi, M.J. and McDonald, S.F. 1991. Distributed Reconfiguration and Recovery in the Advanced Architecture On-Board Processor. In Proceedings of the Fault-Tolerant Computing Symposium (FTCS-21), pp. 436-443.

[17]

Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.-M.W., Klemperer, P. and Farivar, R. 2007. Toward application-aware security and reliability . IEEE Security and Privacy 5(1):57-62.

Digital Library

[18]

James, M.L. and Zima, H.P. 2008. An introspection framework for fault tolerance in support of autonomous space systems. In Proceedings 2008 IEEE Aerospace Conference.

[19]

Johns, C.R. and Brokenshire, D.A. 2007. Introduction to the Cell Broadband Engine architecture. IBM Journal of Reserach and Development 51(5):503-519.

Digital Library

[20]

Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S. and Whisnant, K. 1999. Chameleon: a software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distributed Systems 10(6):560-579.

Digital Library

[21]

Lamport, L., Shostak, R. and Pease, M. 1982. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3):382-401.

Digital Library

[22]

Li, M., Tao, W., Goldberg, D., Hsu, I. and Tamir, Y. 2002. Design and validation of portable communication infrastructure for fault-tolerant cluster middleware. In Cluster'02: Proceedings of the IEEE International Conference on Cluster Computing, IEEE Computer Society, Washington, DC, p. 266.

Digital Library

[23]

Lowry, M. and Dvorak, D. 1998. Analytic verification of flight software. IEEE Intelligent Systems 13(5):45-49.

Digital Library

[24]

Mansouri-Samani, M., Pasareanu, C.S., Penix, J.J., Mehlitz, P.C., O'Malley, O., Visser, W.C., Brat, G.P., Markosian, L.Z. and Pressburger, T.T. 2007. Program Model Checking. A Practitioner's Guide, Version 1.0. Technical Report, Intelligent Systems Division, NASA Ames Research Center.

[25]

McMillin, B.M. 1997. Fault Tolerance for Multicomputers: The Application Oriented Paradigm, Ablex, Norwood, NJ.

Digital Library

[26]

Mehlitz, P.C. and Penix, J. 2005. Design for verification with dynamic assertions . In Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop (SEW'05).

Digital Library

[27]

Nielson, F., Nielson, H.R. and Hankin, C. 1999. Principles of Program Analysis, Springer, Berlin.

Digital Library

[28]

Ramos, J., Samson, J., Lupia, D., Troxel, I., Subramaniyan, R., Jacobs, A., Greco, J., Cieslewski, G., Curreri, J., Fischer, M., Grobelny, E., Aggarwal, V., Patel, V. and Some, R. 2006. High performance dependable multiprocessor. In Proceedings 2006 IEEE Aerospace Conference.

[29]

Rice, E.B. and Lev-Tov, S.J. 2008. Optimized spacecraft fault protection for the WISE mission. In Proceedings 2008 IEEE Aerospace Conference.

[30]

Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z. and Some, R. 2007. High performance dependable multiprocessor II. In Proceedings 2007 IEEE Aerospace Conference, pp. 1-22.

[31]

Shirvani, P.P. 2001. Fault-Tolerant Computing for Radiation Environments. Ph.D. Thesis (Technical Report 01-6), Center for Reliable Computing, Stanford University, Stanford, CA.

Digital Library

[32]

Some, R. and Ngo, D. 1999. REE: a COTS-based fault tolerant parallel processing supercomputer for spacecraft onboard scientific data analysis. In Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1-7.B.3-12.

[33]

Wooldridge, M. 1999. Intelligent agents. In Weiss, G. (ed.), Multiagent Systems. A Modern Approach to Distributed Artificial Intelligence, The MIT Press, Cambridge, MA, pp. 27-78.

Digital Library

[34]

Zima, H.P. 2004. Introspection in a massively parallel PIMbased architecture . In Joubert, G. R. (ed.), Advances in Parallel Computing, Vol. 13, Elsevier, Amsterdam, pp. 441-448.

[35]

Zima, H.P. and Chapman, B.M. 1991. Supercompilers for Parallel and Vector Computers, Frontier Series, ACM Press, New York.

Cited By

Chen JEbnenasir AKulkarni S(2014)The Complexity of Adding MultitoleranceACM Transactions on Autonomous and Adaptive Systems10.1145/26296649:3(1-33)Online publication date: 7-Oct-2014
https://dl.acm.org/doi/10.1145/2629664
James MSpringer PZima H(2010)Adaptive fault tolerance for many-core based space-borne computingProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885304(260-274)Online publication date: 31-Aug-2010
https://dl.acm.org/doi/10.5555/1885276.1885304

Recommendations

Introspection-Based Fault Tolerance for COTS-Based High-Capability Computation in Space
IWIA '08: Proceedings of the 2008 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

Future missions of deep space exploration face the challenge of designing, building,and operating progressively more capable autonomous spacecraft and planetary rovers. Given the communication latencies and bandwidth limitations for such missions, the ...
Fault-tolerant on-board computing for robotic space missions

This paper describes an approach to providing software fault tolerance for future deep-space robotic National Aeronautics and Space Administration missions, which will require a high degree of autonomy supported by an enhanced on-board computational ...
Fault-Tolerance Analysis Algorithm for SpaceWire Onboard Networks
FRUCT'21: Proceedings of the 21st Conference of Open Innovations Association FRUCT

The paper presents algorithms for fault tolerance evaluation which will be applied in a new computer-aided design system for SpaceWire onboard networks. We give general notions on fault tolerance for onboard networks, and introduce our algorithm for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 23, Issue 3

August 2009

105 pages

ISSN:1094-3420

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 August 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen JEbnenasir AKulkarni S(2014)The Complexity of Adding MultitoleranceACM Transactions on Autonomous and Adaptive Systems10.1145/26296649:3(1-33)Online publication date: 7-Oct-2014
https://dl.acm.org/doi/10.1145/2629664
James MSpringer PZima H(2010)Adaptive fault tolerance for many-core based space-borne computingProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885304(260-274)Online publication date: 31-Aug-2010
https://dl.acm.org/doi/10.5555/1885276.1885304

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents