[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Adaptive Fault Tolerance for Scalable Cluster Computing in Space

Published: 01 August 2009 Publication History

Abstract

Future missions of deep-space exploration face the challenge of building more capable autonomous spacecraft and planetary rovers. Given the communication latencies and bandwidth limitations for such missions, the need for increased autonomy becomes mandatory, along with the requirement for enhanced on-board computational capabilities while in deep-space or time-critical situations. This will result in dramatic changes in the way missions are conducted and supported by on-board computing systems. Specifically, the traditional approach of relying exclusively on radiation-hardened hardware and modular redundancy will not be able to deliver the required computational power. As a consequence, such systems are expected to include high-capability low-power components based on emerging commercial-off-the-shelf (COTS) multi-core technology. In this paper we describe the design of a generic framework for introspection that supports runtime monitoring and analysis of program execution as well as a feedback-oriented recovery from faults. Our focus is on providing flexible software fault tolerance matched to the requirements and properties of applications by exploiting knowledge that is either contained in an application knowledge base, provided by users, or automatically derived from specifications. A prototype implementation is currently in progress at the Jet Propulsion Laboratory, California Institute of Technology, targeting a cluster of cell broadband engines.

References

[1]
Aggarwal, N., Ranganathan, P., Jouppi, N.P. and Smith, J.E. 2007. Isolation in commodity multicore processors. IEEE Computer 40(6):49-59.
[2]
Aho, A.V., Hopcroft, J.E. and Ullman, J.D. 1974. The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA.
[3]
Amnell, T., Behrmann, G., Bengtsson, J., D'Argenio, P.R., David, A., Fehnker, A., Hune, T., Jeannet, B., Larsen, K.G., Möller, M.O., Pettreson, P., Weise, C. and Yi, W. 2000. UPPAAL-now, next, and future. In Proceedings of Modelling and Verification of Parallel Processes (MOVEP'2k), Lecture Notes in Computer Science 2067, Springer, Berlin, pp.100-125.
[4]
Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C. and Judd, M. 2007. OASIS: Onboard Autonomous Science Investigation System for opportunistic rover science. Journal of Field Robotics 24(5):379-397.
[5]
Castano, R., Mazzoni, D., Tang, N., Doggett, T., Chien, S., Greely, R., Cichy, B. and Davies, A. 2005. Learning classifiers for science event detection in remote sensing imagery. In Proceedings of the 8th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (i-SAIRAS 2005), ESA SP-603.
[6]
Dechant, D.J. 1990. The Advanced Onboard Signal Processor (AOSP). Advances in VLSI Computer Systems 2(2):69-78.
[7]
Dennis, J.B. 1980. Dataflow supercomputers. IEEE Computer 13(11):93-100.
[8]
Dijkstra, E.W. 1972. Notes on structured programming. In Dahl, O.-J., Dijkstra, E. W. and Hoare, C. (eds), Structured Programming, Academic Press, London, pp. 1-82
[9]
Eichenberger, A.E., O'Brien, J.K., O'Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y. and Koo, R. 2006. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal 45(1):59-84.
[10]
Gannon, J.D. 2004. Verification and validation. In Tucker, A. B. (ed.), Computer Science Handbook, <ed>2</ed>nd edn, Chapman &amp; Hall/CRC, Boca Raton, FL.
[11]
Goldberg, A., Havelund, K. and McGann, C. 2005. Runtime verification for autonomous spacecraft software . In Proceedings 2005 IEEE Aerospace Conference, pp. 507-516.
[12]
Goldberg, D., Li, M., Tao, W. and Tamir, Y. 2001. The Design and Implementation of a Fault-tolerant Cluster Manager. Technical Report CSD-010040, Computer Science Department, University of California, Los Angeles, CA, USA.
[13]
Havelund, K. and Goldberg, A. 2005. Verify your runs. In Proceedings Verified Software: Theories, Tools, Experiments (VSTTE'05).
[14]
Holzmann, G.J. 2003. The SPIN Model Checker. Primer and Reference Manual, Addison-Wesley, Reading, MA.
[15]
Hopcroft, J.E., Motwani, R. and Ullman, J.D. 2006. Introduction to Automata Theory, Languages, and Computation, <ed>3</ed>rd edn, Addison-Wesley, Reading, MA.
[16]
Iacoponi, M.J. and McDonald, S.F. 1991. Distributed Reconfiguration and Recovery in the Advanced Architecture On-Board Processor. In Proceedings of the Fault-Tolerant Computing Symposium (FTCS-21), pp. 436-443.
[17]
Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.-M.W., Klemperer, P. and Farivar, R. 2007. Toward application-aware security and reliability . IEEE Security and Privacy 5(1):57-62.
[18]
James, M.L. and Zima, H.P. 2008. An introspection framework for fault tolerance in support of autonomous space systems. In Proceedings 2008 IEEE Aerospace Conference.
[19]
Johns, C.R. and Brokenshire, D.A. 2007. Introduction to the Cell Broadband Engine architecture. IBM Journal of Reserach and Development 51(5):503-519.
[20]
Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S. and Whisnant, K. 1999. Chameleon: a software infrastructure for adaptive fault tolerance. IEEE Transactions on Parallel and Distributed Systems 10(6):560-579.
[21]
Lamport, L., Shostak, R. and Pease, M. 1982. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3):382-401.
[22]
Li, M., Tao, W., Goldberg, D., Hsu, I. and Tamir, Y. 2002. Design and validation of portable communication infrastructure for fault-tolerant cluster middleware. In Cluster'02: Proceedings of the IEEE International Conference on Cluster Computing, IEEE Computer Society, Washington, DC, p. 266.
[23]
Lowry, M. and Dvorak, D. 1998. Analytic verification of flight software. IEEE Intelligent Systems 13(5):45-49.
[24]
Mansouri-Samani, M., Pasareanu, C.S., Penix, J.J., Mehlitz, P.C., O'Malley, O., Visser, W.C., Brat, G.P., Markosian, L.Z. and Pressburger, T.T. 2007. Program Model Checking. A Practitioner's Guide, Version 1.0. Technical Report, Intelligent Systems Division, NASA Ames Research Center.
[25]
McMillin, B.M. 1997. Fault Tolerance for Multicomputers: The Application Oriented Paradigm, Ablex, Norwood, NJ.
[26]
Mehlitz, P.C. and Penix, J. 2005. Design for verification with dynamic assertions . In Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop (SEW'05).
[27]
Nielson, F., Nielson, H.R. and Hankin, C. 1999. Principles of Program Analysis, Springer, Berlin.
[28]
Ramos, J., Samson, J., Lupia, D., Troxel, I., Subramaniyan, R., Jacobs, A., Greco, J., Cieslewski, G., Curreri, J., Fischer, M., Grobelny, E., Aggarwal, V., Patel, V. and Some, R. 2006. High performance dependable multiprocessor. In Proceedings 2006 IEEE Aerospace Conference.
[29]
Rice, E.B. and Lev-Tov, S.J. 2008. Optimized spacecraft fault protection for the WISE mission. In Proceedings 2008 IEEE Aerospace Conference.
[30]
Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z. and Some, R. 2007. High performance dependable multiprocessor II. In Proceedings 2007 IEEE Aerospace Conference, pp. 1-22.
[31]
Shirvani, P.P. 2001. Fault-Tolerant Computing for Radiation Environments. Ph.D. Thesis (Technical Report 01-6), Center for Reliable Computing, Stanford University, Stanford, CA.
[32]
Some, R. and Ngo, D. 1999. REE: a COTS-based fault tolerant parallel processing supercomputer for spacecraft onboard scientific data analysis. In Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1-7.B.3-12.
[33]
Wooldridge, M. 1999. Intelligent agents. In Weiss, G. (ed.), Multiagent Systems. A Modern Approach to Distributed Artificial Intelligence, The MIT Press, Cambridge, MA, pp. 27-78.
[34]
Zima, H.P. 2004. Introspection in a massively parallel PIMbased architecture . In Joubert, G. R. (ed.), Advances in Parallel Computing, Vol. 13, Elsevier, Amsterdam, pp. 441-448.
[35]
Zima, H.P. and Chapman, B.M. 1991. Supercompilers for Parallel and Vector Computers, Frontier Series, ACM Press, New York.

Cited By

View all
  • (2014)The Complexity of Adding MultitoleranceACM Transactions on Autonomous and Adaptive Systems10.1145/26296649:3(1-33)Online publication date: 7-Oct-2014
  • (2010)Adaptive fault tolerance for many-core based space-borne computingProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885304(260-274)Online publication date: 31-Aug-2010

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 23, Issue 3
August 2009
105 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 August 2009

Author Tags

  1. fault tolerance
  2. introspection
  3. space-borne computing
  4. verification and validation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2014)The Complexity of Adding MultitoleranceACM Transactions on Autonomous and Adaptive Systems10.1145/26296649:3(1-33)Online publication date: 7-Oct-2014
  • (2010)Adaptive fault tolerance for many-core based space-borne computingProceedings of the 16th international Euro-Par conference on Parallel processing: Part II10.5555/1885276.1885304(260-274)Online publication date: 31-Aug-2010

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media