[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. NYT: gone in minutes, out for hours: outage shakes facebook (2021) https://www.nytimes.com/2021/10/04/technology/facebook-down.html

  2. Codestone: the true impact of IT failures (2017) https://www.codestone.net/our-thoughts/true-impact-of-it-failures

  3. Neumann J, Shannon CE, McCarthy J (1956) Probabilistic logics and the synthesis of reliable organisms from unreliable components. Princeton University Press, Princeton, pp 43–98

    Google Scholar 

  4. Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dep Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2

    Article  Google Scholar 

  5. Beyer B, Jones C, Petoff J, Murphy NR (2016) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, United States http://landing.google.com/sre/book.html

  6. Jha NK (1996) Fault-tolerant computer system design. IEEE Parallel Distrib Technol Syst Appl 4(4):84–84. https://doi.org/10.1109/MPDT.1996.7102341

    Article  Google Scholar 

  7. Duarte Jr EP, Santini R, Cohen J (2004) Delivering packets during the routing convergence latency interval through highly connected detours. In: DSN, pp 495–504. https://doi.org/10.1109/DSN.2004.1311919

  8. Reynal M (2005) A short introduction to failure detectors for asynchronous distributed systems. SIGACT News 36(1):53–70. https://doi.org/10.1145/1052796.1052806

    Article  Google Scholar 

  9. Preparata FP, Metze G, Chien RT (1967) On the connection assignment problem of diagnosable systems. IEEE Trans Electron Comput 16(6):848–854. https://doi.org/10.1109/PGEC.1967.264748

    Article  MATH  Google Scholar 

  10. Masson GM, Blough DM, Sullivan GF, Pradhan DK (1996) System diagnosis. Prentice-Hall Inc, USA, pp 478–536

    Google Scholar 

  11. Duarte EP, Ziwich RP, Albini LCP (2011) A survey of comparison-based system-level diagnosis. ACM Comput Surv. https://doi.org/10.1145/1922649.1922659

    Article  MATH  Google Scholar 

  12. Fischer MJ, Lynch NA (1985) Impossibility of distributed consensus with one faulty process. J ACM 32(2):374–382. https://doi.org/10.1145/3149.214121

    Article  MathSciNet  MATH  Google Scholar 

  13. Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM 43(2):225–267. https://doi.org/10.1145/226643.226647

    Article  MathSciNet  MATH  Google Scholar 

  14. Bertier M, Marin O, Sens P (2002) Implementation and performance evaluation of an adaptable failure detector. In: DSN, pp 354–363. https://doi.org/10.1109/DSN.2002.1028920

  15. Turchetti RC, Duarte EP, Arantes L, Sens P (2016) A QoS-configurable failure detection service for internet applications. J Internet Serv Appl (JISA) 7(1):1–14. https://doi.org/10.1186/s13174-016-0051-y

    Article  Google Scholar 

  16. Turchetti RC, Duarte EP (2017) NFV-FD: implementation of a failure detector using network virtualization technology. Int J Netw Manag 27(6):1988. https://doi.org/10.1002/nem.1988

    Article  Google Scholar 

  17. Gupta I, Chandra TD, Goldszmidt GS (2001) On scalable and efficient distributed failure detectors. In: 20th PODCP, ACM, New York, pp 170–179 https://doi.org/10.1145/383962.384010

  18. Hakimi SL, Amin AT (1974) Characterization of connection assignment of diagnosable systems. IEEE Trans Comput 23(1):86–88. https://doi.org/10.1109/T-C.1974.223782

    Article  MathSciNet  MATH  Google Scholar 

  19. Hakimi N (1984) On adaptive system diagnosis. IEEE Trans Comput 33(3):234–240. https://doi.org/10.1109/TC.1984.1676420

    Article  MathSciNet  MATH  Google Scholar 

  20. Hosseini, Kuhl, Reddy (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans Comput 33(3):223–233. https://doi.org/10.1109/TC.1984.1676419

  21. Bianchini RP, Buskens RW (1992) Implementation of online distributed system-level diagnosis theory. IEEE Trans Comput 41(5):616–626. https://doi.org/10.1109/12.142688

    Article  Google Scholar 

  22. Duarte EP, Nanya T (1998) A hierarchical adaptive distributed system-level diagnosis algorithm. IEEE Trans Comput 47(1):34–45. https://doi.org/10.1109/12.656078

    Article  Google Scholar 

  23. Duarte EP, De Bona LCE (2002) A dependable snmp-based tool for distributed network management. In: DSN, IEEE, pp 279–284. https://doi.org/10.1109/DSN.2002.1028911

  24. Duarte EP, Bona LCE, Ruoso VK (2014) Vcube: a provably scalable distributed diagnosis algorithm. In: 2014 5th Workshop on latest advances in scalable algorithms for large-scale systems, pp 17–22. https://doi.org/10.1109/ScalA.2014.14

  25. Rodrigues LA, Arantes L, Duarte EP (2016) An autonomic majority quorum system. In: 2016 IEEE 30th international conference on advanced information networking and applications (AINA), IEEE, pp 524–531. https://doi.org/10.1109/AINA.2016.73

  26. Araujo JP, Arantes L, Duarte EP Jr, Rodrigues LA, Sens P (2019) VCube-PS: a causal broadcast topic-based publish/subscribe system. J Parallel Distrib Comput 125:18–30. https://doi.org/10.1016/j.jpdc.2018.10.011

    Article  Google Scholar 

  27. Duarte EP, Weber A, Fonseca KVO (2012) Distributed diagnosis of dynamic events in partitionable arbitrary topology networks. IEEE Trans Parallel Distrib 23(8):1415–1426. https://doi.org/10.1109/TPDS.2011.284

    Article  Google Scholar 

  28. Camargo ET, Duarte EP (2018) Running resilient MPI applications on a dynamic group of recommended processes. J Braz Comput Soc 24(1):1–16. https://doi.org/10.1186/s13173-018-0069-z

    Article  Google Scholar 

  29. Ziwich RP (2016) A nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topology. IEEE Trans Parallel Distrib 27(11):3131–3143. https://doi.org/10.1109/TPDS.2016.2524004

    Article  Google Scholar 

  30. Ziwich RP, Duarte EP, Albini LCP (2005) Distributed integrity checking for systems with replicated data. In: 11th ICPADS’05, vol 1, pp 363–3691. https://doi.org/10.1109/ICPADS.2005.130

  31. Song J, Lin L, Huang Y, Hsieh SY (2023) Intermittent fault diagnosis of split-star networks and its applications. IEEE Trans Parallel Distrib Syst 34(4):1253–1264. https://doi.org/10.1109/TPDS.2023.3242089

    Article  Google Scholar 

  32. Guo C, Wu C, Xiao Z, Lu J, Liu Z (2023) The intermittent diagnosability for two families of interconnection networks under the PMC model and mm* model. Discret Appl Math 339:89–106. https://doi.org/10.1016/j.dam.2023.05.029

    Article  Google Scholar 

  33. Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzilacos V, Kouznetsov P, Toueg S (2004) The weakest failure detectors to solve certain fundamental problems in distributed computing, ACM, New York, pp. 338–346 https://doi.org/10.1145/1011767.1011818

  34. Chandra TD, Hadzilacos V, Toueg S (1996) The weakest failure detector for solving consensus. J ACM 43(4):685–722. https://doi.org/10.1145/234533.234549

    Article  MathSciNet  MATH  Google Scholar 

  35. Chen W, Toueg S, Aguilera MK (2002) On the quality of service of failure detectors. IEEE Trans Comput 51(1):13–32. https://doi.org/10.1109/12.980014

    Article  MathSciNet  MATH  Google Scholar 

  36. Urban P, Defago X, Schiper A (2001) Neko: a single environment to simulate and prototype distributed algorithms. In: 15th ICOIN, pp 503–511. https://doi.org/10.1109/ICOIN.2001.905471

  37. Jan SU, Lee YD, Koo IS (2021) A distributed sensor-fault detection and diagnosis framework using machine learning. Inf Sci 547:777–796. https://doi.org/10.1016/j.ins.2020.08.068

    Article  MathSciNet  Google Scholar 

  38. Bui KT, Van Vo L, Nguyen CM, Pham TV, Tran HC (2020) A fault detection and diagnosis approach for multi-tier application in cloud computing. J Commun Net 22(5):399–414. https://doi.org/10.1109/JCN.2020.000023

    Article  Google Scholar 

  39. Zhang W, Lu Q, Yu Q et al (2020) Blockchain-based federated learning for device failure detection in industrial IoT. IEEE Internet Things J 8(7):5926–5937. https://doi.org/10.1109/JIOT.2020.3032544

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Brazilian Research Council (CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico) Grant 308959/2020-5 and FAPESP/MCTIC/CGI Grant 2021/06923-0

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elias P. Duarte Jr..

Ethics declarations

Conflict of interest

The authors hereby ensure that there are no conflicts of interest regarding this manuscript and its publication on Computing. The research/paper is fully compliant with all ethical standards. Elias P. Duarte Jr. is an Associate Editor of the Computing journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duarte, E.P., Rodrigues, L.A., Camargo, E.T. et al. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing 105, 2821–2845 (2023). https://doi.org/10.1007/s00607-023-01211-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-023-01211-8

Keywords

Mathematics Subject Classification

Navigation