More Web Proxy on the site http://driver.im/

research-article

Detecting failures in distributed systems with the Falcon spy network

Authors:

Joshua B. Leners,

Marcos K. Aguilera,

Michael WalfishAuthors Info & Claims

SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

Pages 279 - 294

https://doi.org/10.1145/2043556.2043583

Published: 23 October 2011 Publication History

Abstract

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs.

This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

References

[1]

http://hadoop.apache.org.

[2]

http://www.managementsoftware.hp.com.

[3]

http://www.bmc.com/products/brand/patrol.html.

[4]

http://www.ibm.com/software/tivoli.

[5]

DomUClusters -- Linux-HA. linux-ha.org/wiki/DomUClusters.

[6]

Linux-HA, High-Availability software for Linux. http://www.linux-ha.org.

[7]

M. K. Aguilera, G. L. Lann, and S. Toueg. On the impact of fast failure detectors on real-time fault-tolerant systems. In International Conference on Distributed Computing (DISC), pages 354--370, Oct. 2002.

Digital Library

[8]

M. K. Aguilera and M. Walfish. No time for asynchrony. In Workshop on Hot Topics in Operating Systems (HotOS), May 2009.

Digital Library

[9]

P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In International Conference on Software Engineering (ICSE), pages 562--570, 1976.

Digital Library

[10]

M. Ben-Yehuda, M. D. Day, Z. Dubitzky, M. Factor, N. Har'El, A. Gordon, A. Liguori, O. Wasserman, and B.-A. Yassour. The Turtles project: Design and implementation of nested virtualization. In Symposium on Operating Systems Design and Implementation (OSDI), pages 423--436, Oct. 2010.

Digital Library

[11]

M. Bertier, O. Marin, and P. Sens. Implementation and performance evaluation of an adaptable failure detector. In International Conference on Dependable Systems and Networks (DSN), pages 354--363, June 2002.

Digital Library

[12]

K. P. Birman and T. A. Joseph. Exploiting virtual synchrony in distributed systems. In ACM Symposium on Operating Systems Principles (SOSP), pages 123--138, Nov. 1987.

Digital Library

[13]

W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In Symposium on Networked Systems Design and Implementation (NSDI), pages 141--154, Apr. 2011.

Digital Library

[14]

M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Symposium on Operating Systems Design and Implementation (OSDI), pages 335--350, Dec. 2006.

Digital Library

[15]

G. Candea, J. Cutler, and A. Fox. Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal, 56(1--4):213--248, Mar. 2004.

Digital Library

[16]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot---a technique for cheap recovery. In Symposium on Operating Systems Design and Implementation (OSDI), pages 31--44, Dec. 2004.

Digital Library

[17]

The Apache Cassandra project. http://wiki.apache.org/cassandra/ArchitectureInternals#Failure_detection.

[18]

T. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In ACM Symposium on Principles of Distributed Computing (PODC), pages 398--407, Aug. 2007.

Digital Library

[19]

T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, Mar. 1996.

Digital Library

[20]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Symposium on Operating Systems Design and Implementation (OSDI), pages 205--218, Nov. 2006.

Digital Library

[21]

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(5):561--580, May 2002.

Digital Library

[22]

B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: High availability via asynchronous virtual machine replication. In Symposium on Networked Systems Design and Implementation (NSDI), pages 161--174, Apr. 2008.

Digital Library

[23]

DD-WRT firmware. http://www.dd-wrt.com.

[24]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, Dec. 2004.

Digital Library

[25]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, Oct. 2007.

Digital Library

[26]

C. Dwork, N. A. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288--323, Apr. 1988.

Digital Library

[27]

C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Transactions on Computers, 52(2):99--112, Feb. 2003.

Digital Library

[28]

M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, Apr. 1985.

Digital Library

[29]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In ACM Symposium on Operating Systems Principles (SOSP), pages 29--43, Oct. 2003.

Digital Library

[30]

N. Hayashibara, X. Défago, R. Yared, and T. Katayama. The φ accrual failure detector. In IEEE Symposium on Reliable Distributed Systems (SRDS), pages 66--78, Oct. 2004.

Digital Library

[31]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In USENIX Annual Technical Conference, pages 145--158, June 2010.

Digital Library

[32]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Mar. 2007.

Digital Library

[33]

J. P. John, E. Katz-Bassett, A. Krishnamurthy, T. Anderson, and A. Venkataramani. Consensus routing: The Internet as a distributed system. In Symposium on Networked Systems Design and Implementation (NSDI), pages 351--364, Apr. 2008.

Digital Library

[34]

J. Kirsch and Y. Amir. Paxos for system builders: an overview. In International Workshop on Large Scale Distributed Systems and Middleware (LADIS), Sept. 2008.

Digital Library

[35]

L. Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2): 133--169, May 1998.

Digital Library

[36]

L. Lamport. Paxos made simple. Distributed Computing Column of ACM SIGACT News, 32(4):51--58, Dec. 2001.

[37]

B. Lampson. The ABCD's of Paxos. In ACM Symposium on Principles of Distributed Computing (PODC), page 13, Aug. 2001.

Digital Library

[38]

M. Larrea, A. Fernández, and S. Arévalo. On the impossibility of implementing perpetual failure detectors in partially synchronous systems. In Euromicro Workshop on Parallel, Distributed and Network-based Processing, pages 99--105, Jan. 2002.

Digital Library

[39]

E. K. Lee and C. Thekkath. Petal: Distributed virtual disks. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 84--92, Dec. 1996.

Digital Library

[40]

H. C. Li, A. Clement, A. S. Aiyer, and L. Alvisi. The Paxos register. In IEEE Symposium on Reliable Distributed Systems (SRDS), pages 114--126, Oct. 2007.

Digital Library

[41]

libvirt: The virtualization API. http://libvirt.org/.

[42]

Linux kernel dump test module. http://kernel.org/doc/Documentation/fault-injection/provoke-crashes.txt.

[43]

J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and L. Zhou. Boxwood: Abstractions as the foundation for storage infrastructure. In Symposium on Operating Systems Design and Implementation (OSDI), pages 105--120, Dec. 2004.

Digital Library

[44]

D. Mazières. Paxos made practical, http://www.scs.stanford.edu/~dm/home/papers/paxos.pdf, as of Sept. 2011.

[45]

R. D. Prisco, B. Lampson, and N. Lynch. Revisiting the Paxos algorithm. Theoretical Computer Science, 243(1--2):35--91, July 2000.

Digital Library

[46]

Kernel based virtual machine. http://www.linux-kvm.org/.

[47]

J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems (TOCS), 2(4):277--288, Nov. 1984.

Digital Library

[48]

N. Schiper, S. Toueg, and D. Ivan. Leader elector source code. http://www.inf.usi.ch/phd/schiper/LeaderElection.

[49]

J. Stribling, Y. Sovran, I. Zhang, X. Pretzer, J. Li, M. F. Kaashoek, and R. Morris. Flexible, wide-area storage for distributed systems with WheelFS. In Symposium on Networked Systems Design and Implementation (NSDI), pages 43--58, Apr. 2009.

Digital Library

[50]

R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In International Middleware Conference (Middleware), pages 55--70, Sept. 1998.

Digital Library

[51]

P. Veríssimo. Uncertainty and predictability: Can they be reconciled? In Future Directions in Distributed Computing (FuDiCo), pages 108--113. Springer-Verlag LNCS 2584, May 2003.

[52]

P. Veríssimo and A. Casimiro. The Timely Computing Base model and architecture. IEEE Transactions on Computers, 51 (8):916--930. Aug. 2002.

Digital Library

[53]

P. Veríssimo, A. Casimiro, and C. Fetzer. The Timely Computing Base: Timely actions in the presence of uncertain timeliness. In International Conference on Dependable Systems and Networks (DSN), pages 533--542, June 2000.

Digital Library

[54]

D. A. Wheeler. SLOCCount. http://www.dwheeler.com/sloccount/.

[55]

GSoC 2010: ZooKeeper Failure Detector model. http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector.

Cited By

Zhu ZNi NHuang YSun YJia ZKim NWitchel E(2024)Lupin: Tolerating Partial Failures in a CXL PodProceedings of the 2nd Workshop on Disruptive Memory Systems10.1145/3698783.3699377(41-50)Online publication date: 3-Nov-2024
https://dl.acm.org/doi/10.1145/3698783.3699377
Xiao JLi QZhao DZuo XTang WJiang Y(2024)Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localizationComputer Networks10.1016/j.comnet.2024.110836255(110836)Online publication date: Dec-2024
https://doi.org/10.1016/j.comnet.2024.110836
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu JNaor DGoel A(2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585942
Show More Cited By

Index Terms

Detecting failures in distributed systems with the Falcon spy network
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
    2. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Consensus in anonymous asynchronous systems with crash-recovery and omission failures
Abstract
In anonymous distributed systems, processes are indistinguishable because they have no identity and execute the same algorithm. Currently, anonymous systems are receiving a lot of attention mainly because they preserve privacy, which is an ... $^{}$ $^{}$ $^{}$
Unreliable failure detectors for asynchronous distributed systems
On the Quality of Service of Crash-Recovery Failure Detectors

We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

October 2011

417 pages

ISBN:9781450309776

DOI:10.1145/2043556

General Chair:
Ted Wobber
MSR Silicon Valley
,
Program Chair:
Peter Druschel
MPI-SWS

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

INESC: Systems and Computer Engineering Institute
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SOSP '11

Sponsor:

INESC
SIGOPS

SOSP '11: ACM SIGOPS 23nd Symposium on Operating Systems Principles

October 23 - 26, 2011

Cascais, Portugal

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

71
Total Citations
View Citations
993
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu ZNi NHuang YSun YJia ZKim NWitchel E(2024)Lupin: Tolerating Partial Failures in a CXL PodProceedings of the 2nd Workshop on Disruptive Memory Systems10.1145/3698783.3699377(41-50)Online publication date: 3-Nov-2024
https://dl.acm.org/doi/10.1145/3698783.3699377
Xiao JLi QZhao DZuo XTang WJiang Y(2024)Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localizationComputer Networks10.1016/j.comnet.2024.110836255(110836)Online publication date: Dec-2024
https://doi.org/10.1016/j.comnet.2024.110836
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu JNaor DGoel A(2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585942
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu J(2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3617690
Stein GRodrigues LDuarte Jr. EArantes L(2023)Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed SystemsProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615420(40-49)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3615366.3615420
Tennage PBasescu CKokoris-Kogias LSyta EJovanovic PEstrada-Galinanes VFord BDruschel PKaufmann AMace JFlinn JSeltzer M(2023)QuePaxa: Escaping the tyranny of timeouts in consensusProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613150(281-297)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613150
Aguilera MBen-David NGuerraoui RMurat AXygkis AZablotchi IAamodt TJerger NSwift M(2023)uBFT: Microsecond-Scale BFT using Disaggregated MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575732(862-877)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575732
Qi WZhao J(2023)Machine learning applied to failure detectionSixth International Conference on Computer Information Science and Application Technology (CISAT 2023)10.1117/12.3004136(210)Online publication date: 11-Oct-2023
https://doi.org/10.1117/12.3004136
Man YLi SXia WLi YYu BLong YPan Y(2023)Detective-Dee: A Non-Intrusive In Situ Anomaly Detection and Fault Localization Framework2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00032(243-253)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00032
Dong JXin RBerger HMarin O(2023)A Production Suite for Failure Detectors2023 International Conference on Intelligent Computing and Next Generation Networks（ICNGN)10.1109/ICNGN59831.2023.10396751(1-6)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICNGN59831.2023.10396751
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents