The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

228 Accesses
6 Citations
Explore all metrics

Abstract

Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Closer Look at Fault Tolerance

Article 15 May 2017

Brief Announcement: Byzantine-Tolerant Detection of Causality in Synchronous Systems

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Article 12 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

NYT: gone in minutes, out for hours: outage shakes facebook (2021) https://www.nytimes.com/2021/10/04/technology/facebook-down.html
Codestone: the true impact of IT failures (2017) https://www.codestone.net/our-thoughts/true-impact-of-it-failures
Neumann J, Shannon CE, McCarthy J (1956) Probabilistic logics and the synthesis of reliable organisms from unreliable components. Princeton University Press, Princeton, pp 43–98
Google Scholar
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dep Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2
Article Google Scholar
Beyer B, Jones C, Petoff J, Murphy NR (2016) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, United States http://landing.google.com/sre/book.html
Jha NK (1996) Fault-tolerant computer system design. IEEE Parallel Distrib Technol Syst Appl 4(4):84–84. https://doi.org/10.1109/MPDT.1996.7102341
Article Google Scholar
Duarte Jr EP, Santini R, Cohen J (2004) Delivering packets during the routing convergence latency interval through highly connected detours. In: DSN, pp 495–504. https://doi.org/10.1109/DSN.2004.1311919
Reynal M (2005) A short introduction to failure detectors for asynchronous distributed systems. SIGACT News 36(1):53–70. https://doi.org/10.1145/1052796.1052806
Article Google Scholar
Preparata FP, Metze G, Chien RT (1967) On the connection assignment problem of diagnosable systems. IEEE Trans Electron Comput 16(6):848–854. https://doi.org/10.1109/PGEC.1967.264748
Article MATH Google Scholar
Masson GM, Blough DM, Sullivan GF, Pradhan DK (1996) System diagnosis. Prentice-Hall Inc, USA, pp 478–536
Google Scholar
Duarte EP, Ziwich RP, Albini LCP (2011) A survey of comparison-based system-level diagnosis. ACM Comput Surv. https://doi.org/10.1145/1922649.1922659
Article MATH Google Scholar
Fischer MJ, Lynch NA (1985) Impossibility of distributed consensus with one faulty process. J ACM 32(2):374–382. https://doi.org/10.1145/3149.214121
Article MathSciNet MATH Google Scholar
Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM 43(2):225–267. https://doi.org/10.1145/226643.226647
Article MathSciNet MATH Google Scholar
Bertier M, Marin O, Sens P (2002) Implementation and performance evaluation of an adaptable failure detector. In: DSN, pp 354–363. https://doi.org/10.1109/DSN.2002.1028920
Turchetti RC, Duarte EP, Arantes L, Sens P (2016) A QoS-configurable failure detection service for internet applications. J Internet Serv Appl (JISA) 7(1):1–14. https://doi.org/10.1186/s13174-016-0051-y
Article Google Scholar
Turchetti RC, Duarte EP (2017) NFV-FD: implementation of a failure detector using network virtualization technology. Int J Netw Manag 27(6):1988. https://doi.org/10.1002/nem.1988
Article Google Scholar
Gupta I, Chandra TD, Goldszmidt GS (2001) On scalable and efficient distributed failure detectors. In: 20th PODCP, ACM, New York, pp 170–179 https://doi.org/10.1145/383962.384010
Hakimi SL, Amin AT (1974) Characterization of connection assignment of diagnosable systems. IEEE Trans Comput 23(1):86–88. https://doi.org/10.1109/T-C.1974.223782
Article MathSciNet MATH Google Scholar
Hakimi N (1984) On adaptive system diagnosis. IEEE Trans Comput 33(3):234–240. https://doi.org/10.1109/TC.1984.1676420
Article MathSciNet MATH Google Scholar
Hosseini, Kuhl, Reddy (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans Comput 33(3):223–233. https://doi.org/10.1109/TC.1984.1676419
Bianchini RP, Buskens RW (1992) Implementation of online distributed system-level diagnosis theory. IEEE Trans Comput 41(5):616–626. https://doi.org/10.1109/12.142688
Article Google Scholar
Duarte EP, Nanya T (1998) A hierarchical adaptive distributed system-level diagnosis algorithm. IEEE Trans Comput 47(1):34–45. https://doi.org/10.1109/12.656078
Article Google Scholar
Duarte EP, De Bona LCE (2002) A dependable snmp-based tool for distributed network management. In: DSN, IEEE, pp 279–284. https://doi.org/10.1109/DSN.2002.1028911
Duarte EP, Bona LCE, Ruoso VK (2014) Vcube: a provably scalable distributed diagnosis algorithm. In: 2014 5th Workshop on latest advances in scalable algorithms for large-scale systems, pp 17–22. https://doi.org/10.1109/ScalA.2014.14
Rodrigues LA, Arantes L, Duarte EP (2016) An autonomic majority quorum system. In: 2016 IEEE 30th international conference on advanced information networking and applications (AINA), IEEE, pp 524–531. https://doi.org/10.1109/AINA.2016.73
Araujo JP, Arantes L, Duarte EP Jr, Rodrigues LA, Sens P (2019) VCube-PS: a causal broadcast topic-based publish/subscribe system. J Parallel Distrib Comput 125:18–30. https://doi.org/10.1016/j.jpdc.2018.10.011
Article Google Scholar
Duarte EP, Weber A, Fonseca KVO (2012) Distributed diagnosis of dynamic events in partitionable arbitrary topology networks. IEEE Trans Parallel Distrib 23(8):1415–1426. https://doi.org/10.1109/TPDS.2011.284
Article Google Scholar
Camargo ET, Duarte EP (2018) Running resilient MPI applications on a dynamic group of recommended processes. J Braz Comput Soc 24(1):1–16. https://doi.org/10.1186/s13173-018-0069-z
Article Google Scholar
Ziwich RP (2016) A nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topology. IEEE Trans Parallel Distrib 27(11):3131–3143. https://doi.org/10.1109/TPDS.2016.2524004
Article Google Scholar
Ziwich RP, Duarte EP, Albini LCP (2005) Distributed integrity checking for systems with replicated data. In: 11th ICPADS’05, vol 1, pp 363–3691. https://doi.org/10.1109/ICPADS.2005.130
Song J, Lin L, Huang Y, Hsieh SY (2023) Intermittent fault diagnosis of split-star networks and its applications. IEEE Trans Parallel Distrib Syst 34(4):1253–1264. https://doi.org/10.1109/TPDS.2023.3242089
Article Google Scholar
Guo C, Wu C, Xiao Z, Lu J, Liu Z (2023) The intermittent diagnosability for two families of interconnection networks under the PMC model and mm* model. Discret Appl Math 339:89–106. https://doi.org/10.1016/j.dam.2023.05.029
Article Google Scholar
Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzilacos V, Kouznetsov P, Toueg S (2004) The weakest failure detectors to solve certain fundamental problems in distributed computing, ACM, New York, pp. 338–346 https://doi.org/10.1145/1011767.1011818
Chandra TD, Hadzilacos V, Toueg S (1996) The weakest failure detector for solving consensus. J ACM 43(4):685–722. https://doi.org/10.1145/234533.234549
Article MathSciNet MATH Google Scholar
Chen W, Toueg S, Aguilera MK (2002) On the quality of service of failure detectors. IEEE Trans Comput 51(1):13–32. https://doi.org/10.1109/12.980014
Article MathSciNet MATH Google Scholar
Urban P, Defago X, Schiper A (2001) Neko: a single environment to simulate and prototype distributed algorithms. In: 15th ICOIN, pp 503–511. https://doi.org/10.1109/ICOIN.2001.905471
Jan SU, Lee YD, Koo IS (2021) A distributed sensor-fault detection and diagnosis framework using machine learning. Inf Sci 547:777–796. https://doi.org/10.1016/j.ins.2020.08.068
Article MathSciNet Google Scholar
Bui KT, Van Vo L, Nguyen CM, Pham TV, Tran HC (2020) A fault detection and diagnosis approach for multi-tier application in cloud computing. J Commun Net 22(5):399–414. https://doi.org/10.1109/JCN.2020.000023
Article Google Scholar
Zhang W, Lu Q, Yu Q et al (2020) Blockchain-based federated learning for device failure detection in industrial IoT. IEEE Internet Things J 8(7):5926–5937. https://doi.org/10.1109/JIOT.2020.3032544
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the Brazilian Research Council (CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico) Grant 308959/2020-5 and FAPESP/MCTIC/CGI Grant 2021/06923-0

Author information

Authors and Affiliations

Federal University of Paraná, Curitiba, Brazil
Elias P. Duarte Jr.
Western Paraná State University (UNIOESTE), Cascavel, Brazil
Luiz A. Rodrigues
Technological Federal University of Paraná (UTFPR), Toledo, Brazil
Edson T. Camargo
Federal University of Santa Maria (UFSM), Santa Maria, Brazil
Rogério C. Turchetti

Authors

Elias P. Duarte Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Luiz A. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Edson T. Camargo
View author publications
You can also search for this author in PubMed Google Scholar
Rogério C. Turchetti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elias P. Duarte Jr..

Ethics declarations

Conflict of interest

The authors hereby ensure that there are no conflicts of interest regarding this manuscript and its publication on Computing. The research/paper is fully compliant with all ethical standards. Elias P. Duarte Jr. is an Associate Editor of the Computing journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Duarte, E.P., Rodrigues, L.A., Camargo, E.T. et al. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing 105, 2821–2845 (2023). https://doi.org/10.1007/s00607-023-01211-8

Download citation

Received: 20 October 2022
Accepted: 07 August 2023
Published: 18 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00607-023-01211-8

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Abstract

Access this article

Subscribe and save

Buy Now