[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ISCA.2005.21acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Design and Evaluation of Hybrid Fault-Detection Systems

Published: 01 May 2005 Publication History

Abstract

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.

References

[1]
{1} R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17-22, March 2001.
[2]
{2} A. Benso, S. D. Carlo, G. D. Natale, and P. Prinetto. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE International On-Line Testing Symposium, 2003.
[3]
{3} D. C. Bossen. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 07.1-121_07.6, April 2002.
[4]
{4} E. W. Czeck and D. Siewiorek. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing, pages 236-243, June 1990.
[5]
{5} M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98-109. ACM Press, 2003.
[6]
{6} R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216-226, May 1990.
[7]
{7} S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 416-425, September 2002.
[8]
{8} A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160-174, 1988.
[9]
{9} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th annual international symposium on Computer architecture , pages 99-110. IEEE Computer Society, 2002.
[10]
{10} S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture , page 29. IEEE Computer Society, 2003.
[11]
{11} T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41-49, January 1996.
[12]
{12} N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111-122, March 2002.
[13]
{13} N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63-75, March 2002.
[14]
{14} J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995.
[15]
{15} D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of flexible validated processor models. Technical Report Liberty-04- 03, Liberty Research Group, Princeton University, November 2004.
[16]
{16} J. Ray, J. C. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, pages 214-224. IEEE Computer Society, 2001.
[17]
{17} M. Rebaudengo, M. S. Reorda, M. Violante, and M. Torchiano. A source-to-source compiler for generating dependable software. pages 33-42, 2001.
[18]
{18} S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th annual international symposium on Computer architecture, pages 25-36. ACM Press, 2000.
[19]
{19} G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005.
[20]
{20} P. P. Shirvani, N. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273-284, 2000.
[21]
{21} T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12-23, March 1999.
[22]
{22} M. Vachharajani, N. Vachharajani, and D. I. August. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 195-206, June 2004.
[23]
{23} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), pages 271-282, November 2002.
[24]
{24} R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137-143, July 2003.
[25]
{25} T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th annual international symposium on Computer architecture, pages 87-98. IEEE Computer Society, 2002.
[26]
{26} N. Wang, M. Fertig, and S. J. Patel. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 56-67, September 2003.
[27]
{27} N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks, pages 61-72, June 2004.
[28]
{28} C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004.
[29]
{29} Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293-307, February 1996.

Cited By

View all
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
  • (2024)Can GPU performance increase faster than the code error rate?The Journal of Supercomputing10.1007/s11227-024-06119-480:12(16918-16946)Online publication date: 1-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
June 2005
541 pages
ISBN:076952270X
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
    ISCA 2005
    May 2005
    531 pages
    ISSN:0163-5964
    DOI:10.1145/1080695
    Issue’s Table of Contents

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 May 2005

Check for updates

Qualifiers

  • Article

Conference

ISCA05
Sponsor:

Acceptance Rates

ISCA '05 Paper Acceptance Rate 45 of 194 submissions, 23%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 28-Jun-2024
  • (2024)Assessing the Impact of Compiler Optimizations on GPUs ReliabilityACM Transactions on Architecture and Code Optimization10.1145/363824921:2(1-22)Online publication date: 12-Jan-2024
  • (2024)Can GPU performance increase faster than the code error rate?The Journal of Supercomputing10.1007/s11227-024-06119-480:12(16918-16946)Online publication date: 1-Aug-2024
  • (2018)Adaptive and polymorphic VLIW processor to optimize fault tolerance, energy consumption, and performanceProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203238(54-61)Online publication date: 8-May-2018
  • (2017)Error-Efficient Computing SystemsFoundations and Trends in Electronic Design Automation10.1561/100000004911:4(362-461)Online publication date: 18-Dec-2017
  • (2016)An Accurate Cross-Layer Approach for Online Architectural Vulnerability EstimationACM Transactions on Architecture and Code Optimization10.1145/297558813:3(1-27)Online publication date: 17-Sep-2016
  • (2016)Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMRACM Transactions on Embedded Computing Systems10.1145/293066716:2(1-26)Online publication date: 19-Dec-2016
  • (2016)nZDCProceedings of the 53rd Annual Design Automation Conference10.1145/2897937.2898054(1-6)Online publication date: 5-Jun-2016
  • (2016)A Case for Acoustic Wave Detectors for Soft-ErrorsIEEE Transactions on Computers10.1109/TC.2015.241965265:1(5-18)Online publication date: 1-Jan-2016
  • (2015)CloverACM SIGPLAN Notices10.1145/2808704.275495950:5(1-10)Online publication date: 4-Jun-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media