Abstract
Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
NYT: gone in minutes, out for hours: outage shakes facebook (2021) https://www.nytimes.com/2021/10/04/technology/facebook-down.html
Codestone: the true impact of IT failures (2017) https://www.codestone.net/our-thoughts/true-impact-of-it-failures
Neumann J, Shannon CE, McCarthy J (1956) Probabilistic logics and the synthesis of reliable organisms from unreliable components. Princeton University Press, Princeton, pp 43–98
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dep Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2
Beyer B, Jones C, Petoff J, Murphy NR (2016) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, United States http://landing.google.com/sre/book.html
Jha NK (1996) Fault-tolerant computer system design. IEEE Parallel Distrib Technol Syst Appl 4(4):84–84. https://doi.org/10.1109/MPDT.1996.7102341
Duarte Jr EP, Santini R, Cohen J (2004) Delivering packets during the routing convergence latency interval through highly connected detours. In: DSN, pp 495–504. https://doi.org/10.1109/DSN.2004.1311919
Reynal M (2005) A short introduction to failure detectors for asynchronous distributed systems. SIGACT News 36(1):53–70. https://doi.org/10.1145/1052796.1052806
Preparata FP, Metze G, Chien RT (1967) On the connection assignment problem of diagnosable systems. IEEE Trans Electron Comput 16(6):848–854. https://doi.org/10.1109/PGEC.1967.264748
Masson GM, Blough DM, Sullivan GF, Pradhan DK (1996) System diagnosis. Prentice-Hall Inc, USA, pp 478–536
Duarte EP, Ziwich RP, Albini LCP (2011) A survey of comparison-based system-level diagnosis. ACM Comput Surv. https://doi.org/10.1145/1922649.1922659
Fischer MJ, Lynch NA (1985) Impossibility of distributed consensus with one faulty process. J ACM 32(2):374–382. https://doi.org/10.1145/3149.214121
Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM 43(2):225–267. https://doi.org/10.1145/226643.226647
Bertier M, Marin O, Sens P (2002) Implementation and performance evaluation of an adaptable failure detector. In: DSN, pp 354–363. https://doi.org/10.1109/DSN.2002.1028920
Turchetti RC, Duarte EP, Arantes L, Sens P (2016) A QoS-configurable failure detection service for internet applications. J Internet Serv Appl (JISA) 7(1):1–14. https://doi.org/10.1186/s13174-016-0051-y
Turchetti RC, Duarte EP (2017) NFV-FD: implementation of a failure detector using network virtualization technology. Int J Netw Manag 27(6):1988. https://doi.org/10.1002/nem.1988
Gupta I, Chandra TD, Goldszmidt GS (2001) On scalable and efficient distributed failure detectors. In: 20th PODCP, ACM, New York, pp 170–179 https://doi.org/10.1145/383962.384010
Hakimi SL, Amin AT (1974) Characterization of connection assignment of diagnosable systems. IEEE Trans Comput 23(1):86–88. https://doi.org/10.1109/T-C.1974.223782
Hakimi N (1984) On adaptive system diagnosis. IEEE Trans Comput 33(3):234–240. https://doi.org/10.1109/TC.1984.1676420
Hosseini, Kuhl, Reddy (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans Comput 33(3):223–233. https://doi.org/10.1109/TC.1984.1676419
Bianchini RP, Buskens RW (1992) Implementation of online distributed system-level diagnosis theory. IEEE Trans Comput 41(5):616–626. https://doi.org/10.1109/12.142688
Duarte EP, Nanya T (1998) A hierarchical adaptive distributed system-level diagnosis algorithm. IEEE Trans Comput 47(1):34–45. https://doi.org/10.1109/12.656078
Duarte EP, De Bona LCE (2002) A dependable snmp-based tool for distributed network management. In: DSN, IEEE, pp 279–284. https://doi.org/10.1109/DSN.2002.1028911
Duarte EP, Bona LCE, Ruoso VK (2014) Vcube: a provably scalable distributed diagnosis algorithm. In: 2014 5th Workshop on latest advances in scalable algorithms for large-scale systems, pp 17–22. https://doi.org/10.1109/ScalA.2014.14
Rodrigues LA, Arantes L, Duarte EP (2016) An autonomic majority quorum system. In: 2016 IEEE 30th international conference on advanced information networking and applications (AINA), IEEE, pp 524–531. https://doi.org/10.1109/AINA.2016.73
Araujo JP, Arantes L, Duarte EP Jr, Rodrigues LA, Sens P (2019) VCube-PS: a causal broadcast topic-based publish/subscribe system. J Parallel Distrib Comput 125:18–30. https://doi.org/10.1016/j.jpdc.2018.10.011
Duarte EP, Weber A, Fonseca KVO (2012) Distributed diagnosis of dynamic events in partitionable arbitrary topology networks. IEEE Trans Parallel Distrib 23(8):1415–1426. https://doi.org/10.1109/TPDS.2011.284
Camargo ET, Duarte EP (2018) Running resilient MPI applications on a dynamic group of recommended processes. J Braz Comput Soc 24(1):1–16. https://doi.org/10.1186/s13173-018-0069-z
Ziwich RP (2016) A nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topology. IEEE Trans Parallel Distrib 27(11):3131–3143. https://doi.org/10.1109/TPDS.2016.2524004
Ziwich RP, Duarte EP, Albini LCP (2005) Distributed integrity checking for systems with replicated data. In: 11th ICPADS’05, vol 1, pp 363–3691. https://doi.org/10.1109/ICPADS.2005.130
Song J, Lin L, Huang Y, Hsieh SY (2023) Intermittent fault diagnosis of split-star networks and its applications. IEEE Trans Parallel Distrib Syst 34(4):1253–1264. https://doi.org/10.1109/TPDS.2023.3242089
Guo C, Wu C, Xiao Z, Lu J, Liu Z (2023) The intermittent diagnosability for two families of interconnection networks under the PMC model and mm* model. Discret Appl Math 339:89–106. https://doi.org/10.1016/j.dam.2023.05.029
Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzilacos V, Kouznetsov P, Toueg S (2004) The weakest failure detectors to solve certain fundamental problems in distributed computing, ACM, New York, pp. 338–346 https://doi.org/10.1145/1011767.1011818
Chandra TD, Hadzilacos V, Toueg S (1996) The weakest failure detector for solving consensus. J ACM 43(4):685–722. https://doi.org/10.1145/234533.234549
Chen W, Toueg S, Aguilera MK (2002) On the quality of service of failure detectors. IEEE Trans Comput 51(1):13–32. https://doi.org/10.1109/12.980014
Urban P, Defago X, Schiper A (2001) Neko: a single environment to simulate and prototype distributed algorithms. In: 15th ICOIN, pp 503–511. https://doi.org/10.1109/ICOIN.2001.905471
Jan SU, Lee YD, Koo IS (2021) A distributed sensor-fault detection and diagnosis framework using machine learning. Inf Sci 547:777–796. https://doi.org/10.1016/j.ins.2020.08.068
Bui KT, Van Vo L, Nguyen CM, Pham TV, Tran HC (2020) A fault detection and diagnosis approach for multi-tier application in cloud computing. J Commun Net 22(5):399–414. https://doi.org/10.1109/JCN.2020.000023
Zhang W, Lu Q, Yu Q et al (2020) Blockchain-based federated learning for device failure detection in industrial IoT. IEEE Internet Things J 8(7):5926–5937. https://doi.org/10.1109/JIOT.2020.3032544
Acknowledgements
This work was partially supported by the Brazilian Research Council (CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico) Grant 308959/2020-5 and FAPESP/MCTIC/CGI Grant 2021/06923-0
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors hereby ensure that there are no conflicts of interest regarding this manuscript and its publication on Computing. The research/paper is fully compliant with all ethical standards. Elias P. Duarte Jr. is an Associate Editor of the Computing journal.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Duarte, E.P., Rodrigues, L.A., Camargo, E.T. et al. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing 105, 2821–2845 (2023). https://doi.org/10.1007/s00607-023-01211-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-023-01211-8
Keywords
- Distributed systems
- Fault tolerance
- System-level diagnosis
- Failure detection
- Fault management
- Fault monitoring