Google Scholar

Detecting failures in distributed systems with the falcon spy network

JB Leners, H Wu, WL Hung, MK Aguilera… - Proceedings of the …, 2011 - dl.acm.org

JB Leners, H Wu, WL Hung, MK Aguilera, M Walfish

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, 2011•dl.acm.org

A common way for a distributed system to tolerate crashes is to explicitly detect them and
then recover from them. Interestingly, detection can take much longer than recovery, as a
result of many advances in recovery techniques, making failure detection the dominant
factor in these systems' unavailability when a crash occurs. This paper presents the design,
implementation, and evaluation of Falcon, a failure detector with several features. First,
Falcon's common-case detection time is sub-second, which keeps unavailability low …

This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

ACM Digital Library

Show moreShow less

Save Cite Cited by 155 Related articles All 23 versions

Cite

Advanced search

Saved to My library

Detecting failures in distributed systems with the falcon spy network