[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2968455.2968508acmotherconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

COMET: communication-optimised multi-threaded error-detection technique

Published: 01 October 2016 Publication History

Abstract

Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade performance considerably, limiting their uptake.
This paper focuses on software redundant multi-threading error detection, a compiler-based technique that makes use of redundant cores within a multi-core system to perform error checking. Implementations of this scheme feature two threads that execute almost the same code: the main thread runs the original code and the checker thread executes code to verify the correctness of the original. The main thread communicates the values that require checking to the checker thread to use in its comparisons.
We identify a major performance bottleneck in existing schemes: poorly performing inter-core communication and the generated code associated with it. Our study shows this is a major performance impediment within existing techniques since the two threads require extremely fine-grained communication, on the order of every few instructions. We alleviate this bottleneck with a series of code generation optimisations at the compiler level. We propose COMET (Communication-Optimised Multi-threaded Error-detection Technique), which improves performance across the NAS parallel benchmarks by 31.4% (on average) compared to the state-of-the-art, without affecting fault-coverage.

References

[1]
GCC: GNU Compiler Collection. http://gcc.gnu.org.
[2]
The LLVM Compiler Infrastructure. http://llvm.org.
[3]
NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.
[4]
PERF: Linux Profiling With Performance Counters. https://perf.wiki.kernel.org.
[5]
D. Bernick, B. Bruckert, P. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. NonStop Advanced Architecture. In DSN 2005.
[6]
J. Chang, G. Reis, and D. August. Automatic Instruction-Level Software-Only Recovery. In DSN 2006.
[7]
C. Constantinescu. Trends and Challenges in VLSI Circuit Reliability. IEEE Micro 2003.
[8]
M. L. Fair, C. R. Conklin, S. Swaney, P. Meaney, W. Clarke, L. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber. Reliability, Availability, and Serviceability (RAS) of the IBM eServer Z990. IBM Journal of Research and Development 2004.
[9]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In ASPLOS 2010.
[10]
K. Gharachorloo and P. B. Gibbons. Detecting Violations of Sequential Consistency. In Proceedings of SPAA 1991.
[11]
W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. the Journal of Supercomputing 1993.
[12]
T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In Proceedings of EPIC Workshop 2010.
[13]
L. Lamport. Specifying Concurrent Program Modules. TOPLAS 1983.
[14]
P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Shared Ring Buffer for Multi-Core Architectures. In ANCS 2009.
[15]
P. P. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. In IPDPS 2010.
[16]
P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. Lichtenstein, R. P. Nix, J. S. O'donnell, and J. C. Ruttenberg. The Multiflow Trace Scheduling Compiler. The journal of Supercomputing, 1993.
[17]
S. A. Mahlke, W. Y. Chen, W.-m. W. Hwu, B. R. Rau, and M. S. Schlansker. Sentinel Scheduling for VLIW and Superscalar Processors. In ASPLOS 1992.
[18]
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability 2005.
[19]
K. Mitropoulou, V. Porpodas, and M. Cintra. DRIFT: Decoupled compileR-based Instruction-level Fault-Tolerance. In LCPC 2013.
[20]
K. Mitropoulou, V. Porpodas, X. Zhang, and T. M. Jones. Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication. In ICS 2016.
[21]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In ISCA 2002.
[22]
N. Oh, P. Shirvani, and E. McCluskey. Error Detection by Duplicated Instructions in Super-scalar Processors. IEEE Transactions on Reliability 2002.
[23]
S. K. Reinhardt and S. S. Mukherjee. Transient Fault Detection via Simultaneous Multithreading. In ISCA 2000.
[24]
G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. SWIFT: Software Implemented Fault Tolerance. In CGO 2005.
[25]
E. Rotenberg. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In International Symposium on Fault-Tolerant Computing 1999.
[26]
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In DSN 2002.
[27]
A. Shye, T. Moseley, V. Reddi, J. Blomstedt, and D. Connors. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance. In DSN 2007.
[28]
D. J. Sorin. Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture,2009.
[29]
J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The Impact of Technology Scaling on Lifetime Reliability. In DSN 2004.
[30]
C. Wang, H.-S. Kim, Y. Wu, and V. Ying. Compiler-Managed Software-Based Redundant Multi-Threading for Transient Fault Detection. In CGO 2007.
[31]
Y. Zhang, S. Ghosh, J. Huang, J. W. Lee, S. A. Mahlke, and D. I. August. Runtime Asynchronous Fault Tolerance via Speculation. In CGO 2012.
[32]
Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. DAFT: Decoupled Acyclic Fault Tolerance. In PACT 2010.

Cited By

View all
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
  • (2023)Bare-Metal Redundant Multi-Threading on Multicore SoCs Under Neutron IrradiationIEEE Transactions on Nuclear Science10.1109/TNS.2023.324712970:8(1643-1651)Online publication date: Aug-2023
  • (2022)Survey of Software-Implemented Soft Error ProtectionElectronics10.3390/electronics1103045611:3(456)Online publication date: 3-Feb-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems
October 2016
187 pages
ISBN:9781450344821
DOI:10.1145/2968455
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code generation
  2. communication optimisations
  3. error detection
  4. soft errors

Qualifiers

  • Research-article

Conference

ESWEEK'16
ESWEEK'16: TWELFTH EMBEDDED SYSTEM WEEK
October 1 - 7, 2016
Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
  • (2023)Bare-Metal Redundant Multi-Threading on Multicore SoCs Under Neutron IrradiationIEEE Transactions on Nuclear Science10.1109/TNS.2023.324712970:8(1643-1651)Online publication date: Aug-2023
  • (2022)Survey of Software-Implemented Soft Error ProtectionElectronics10.3390/electronics1103045611:3(456)Online publication date: 3-Feb-2022
  • (2022)Hybrid Lockstep Technique for Soft Error MitigationIEEE Transactions on Nuclear Science10.1109/TNS.2022.314986769:7(1574-1581)Online publication date: Jul-2022
  • (2022)SoK: A Survey on Redundant Execution Technology2021 International Conference on Advanced Computing and Endogenous Security10.1109/IEEECONF52377.2022.10013333(1-14)Online publication date: 21-Apr-2022
  • (2021)Turnpike: Lightweight Soft Error Resilience for In-Order CoresMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480042(654-666)Online publication date: 18-Oct-2021
  • (2021)BROFY: Towards Essential Integrity Protection for Microservices2021 40th International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS53918.2021.00024(154-163)Online publication date: Sep-2021
  • (2021)FERNANDO: A Software Transient Fault Tolerance Approach for Embedded Systems Based on Redundant Multi-ThreadingIEEE Access10.1109/ACCESS.2021.30771909(67154-67166)Online publication date: 2021
  • (2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
  • (2021)Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreadingThe Journal of Supercomputing10.1007/s11227-021-03804-6Online publication date: 10-May-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media