Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2210.02847 (cs)

[Submitted on 6 Oct 2022]

Title:A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors

Authors:Elias P. Duarte Jr., Luiz A. Rodrigues, Edson T. Camargo, Rogerio Turchetti

View PDF

Abstract:Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the \textit{de facto} standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Results are presented for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2210.02847 [cs.DC]
	(or arXiv:2210.02847v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2210.02847

Submission history

From: Elias Procopio Duarte Jr. [view email]
[v1] Thu, 6 Oct 2022 12:04:35 UTC (735 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Distributed System-level Diagnosis Model for the Implementation of Unreliable Failure Detectors

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators