[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2043556.2043583acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Detecting failures in distributed systems with the Falcon spy network

Published: 23 October 2011 Publication History

Abstract

A common way for a distributed system to tolerate crashes is to explicitly detect them and then recover from them. Interestingly, detection can take much longer than recovery, as a result of many advances in recovery techniques, making failure detection the dominant factor in these systems' unavailability when a crash occurs.
This paper presents the design, implementation, and evaluation of Falcon, a failure detector with several features. First, Falcon's common-case detection time is sub-second, which keeps unavailability low. Second, Falcon is reliable: it never reports a process as down when it is actually up. Third, Falcon sometimes kills to achieve reliable detection but aims to kill the smallest needed component. Falcon achieves these features by coordinating a network of spies, each monitoring a layer of the system. Falcon's main cost is a small amount of platform-specific logic. Falcon is thus the first failure detector that is fast, reliable, and viable. As such, it could change the way that a class of distributed systems is built.

References

[1]
http://hadoop.apache.org.
[2]
http://www.managementsoftware.hp.com.
[3]
http://www.bmc.com/products/brand/patrol.html.
[4]
http://www.ibm.com/software/tivoli.
[5]
DomUClusters -- Linux-HA. linux-ha.org/wiki/DomUClusters.
[6]
Linux-HA, High-Availability software for Linux. http://www.linux-ha.org.
[7]
M. K. Aguilera, G. L. Lann, and S. Toueg. On the impact of fast failure detectors on real-time fault-tolerant systems. In International Conference on Distributed Computing (DISC), pages 354--370, Oct. 2002.
[8]
M. K. Aguilera and M. Walfish. No time for asynchrony. In Workshop on Hot Topics in Operating Systems (HotOS), May 2009.
[9]
P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In International Conference on Software Engineering (ICSE), pages 562--570, 1976.
[10]
M. Ben-Yehuda, M. D. Day, Z. Dubitzky, M. Factor, N. Har'El, A. Gordon, A. Liguori, O. Wasserman, and B.-A. Yassour. The Turtles project: Design and implementation of nested virtualization. In Symposium on Operating Systems Design and Implementation (OSDI), pages 423--436, Oct. 2010.
[11]
M. Bertier, O. Marin, and P. Sens. Implementation and performance evaluation of an adaptable failure detector. In International Conference on Dependable Systems and Networks (DSN), pages 354--363, June 2002.
[12]
K. P. Birman and T. A. Joseph. Exploiting virtual synchrony in distributed systems. In ACM Symposium on Operating Systems Principles (SOSP), pages 123--138, Nov. 1987.
[13]
W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In Symposium on Networked Systems Design and Implementation (NSDI), pages 141--154, Apr. 2011.
[14]
M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Symposium on Operating Systems Design and Implementation (OSDI), pages 335--350, Dec. 2006.
[15]
G. Candea, J. Cutler, and A. Fox. Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal, 56(1--4):213--248, Mar. 2004.
[16]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot---a technique for cheap recovery. In Symposium on Operating Systems Design and Implementation (OSDI), pages 31--44, Dec. 2004.
[17]
The Apache Cassandra project. http://wiki.apache.org/cassandra/ArchitectureInternals#Failure_detection.
[18]
T. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In ACM Symposium on Principles of Distributed Computing (PODC), pages 398--407, Aug. 2007.
[19]
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, Mar. 1996.
[20]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Symposium on Operating Systems Design and Implementation (OSDI), pages 205--218, Nov. 2006.
[21]
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(5):561--580, May 2002.
[22]
B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. Remus: High availability via asynchronous virtual machine replication. In Symposium on Networked Systems Design and Implementation (NSDI), pages 161--174, Apr. 2008.
[23]
DD-WRT firmware. http://www.dd-wrt.com.
[24]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, Dec. 2004.
[25]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In ACM Symposium on Operating Systems Principles (SOSP), pages 205--220, Oct. 2007.
[26]
C. Dwork, N. A. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288--323, Apr. 1988.
[27]
C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Transactions on Computers, 52(2):99--112, Feb. 2003.
[28]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, Apr. 1985.
[29]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In ACM Symposium on Operating Systems Principles (SOSP), pages 29--43, Oct. 2003.
[30]
N. Hayashibara, X. Défago, R. Yared, and T. Katayama. The φ accrual failure detector. In IEEE Symposium on Reliable Distributed Systems (SRDS), pages 66--78, Oct. 2004.
[31]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In USENIX Annual Technical Conference, pages 145--158, June 2010.
[32]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Mar. 2007.
[33]
J. P. John, E. Katz-Bassett, A. Krishnamurthy, T. Anderson, and A. Venkataramani. Consensus routing: The Internet as a distributed system. In Symposium on Networked Systems Design and Implementation (NSDI), pages 351--364, Apr. 2008.
[34]
J. Kirsch and Y. Amir. Paxos for system builders: an overview. In International Workshop on Large Scale Distributed Systems and Middleware (LADIS), Sept. 2008.
[35]
L. Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2): 133--169, May 1998.
[36]
L. Lamport. Paxos made simple. Distributed Computing Column of ACM SIGACT News, 32(4):51--58, Dec. 2001.
[37]
B. Lampson. The ABCD's of Paxos. In ACM Symposium on Principles of Distributed Computing (PODC), page 13, Aug. 2001.
[38]
M. Larrea, A. Fernández, and S. Arévalo. On the impossibility of implementing perpetual failure detectors in partially synchronous systems. In Euromicro Workshop on Parallel, Distributed and Network-based Processing, pages 99--105, Jan. 2002.
[39]
E. K. Lee and C. Thekkath. Petal: Distributed virtual disks. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 84--92, Dec. 1996.
[40]
H. C. Li, A. Clement, A. S. Aiyer, and L. Alvisi. The Paxos register. In IEEE Symposium on Reliable Distributed Systems (SRDS), pages 114--126, Oct. 2007.
[41]
libvirt: The virtualization API. http://libvirt.org/.
[42]
Linux kernel dump test module. http://kernel.org/doc/Documentation/fault-injection/provoke-crashes.txt.
[43]
J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and L. Zhou. Boxwood: Abstractions as the foundation for storage infrastructure. In Symposium on Operating Systems Design and Implementation (OSDI), pages 105--120, Dec. 2004.
[44]
D. Mazières. Paxos made practical, http://www.scs.stanford.edu/~dm/home/papers/paxos.pdf, as of Sept. 2011.
[45]
R. D. Prisco, B. Lampson, and N. Lynch. Revisiting the Paxos algorithm. Theoretical Computer Science, 243(1--2):35--91, July 2000.
[46]
Kernel based virtual machine. http://www.linux-kvm.org/.
[47]
J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems (TOCS), 2(4):277--288, Nov. 1984.
[48]
N. Schiper, S. Toueg, and D. Ivan. Leader elector source code. http://www.inf.usi.ch/phd/schiper/LeaderElection.
[49]
J. Stribling, Y. Sovran, I. Zhang, X. Pretzer, J. Li, M. F. Kaashoek, and R. Morris. Flexible, wide-area storage for distributed systems with WheelFS. In Symposium on Networked Systems Design and Implementation (NSDI), pages 43--58, Apr. 2009.
[50]
R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In International Middleware Conference (Middleware), pages 55--70, Sept. 1998.
[51]
P. Veríssimo. Uncertainty and predictability: Can they be reconciled? In Future Directions in Distributed Computing (FuDiCo), pages 108--113. Springer-Verlag LNCS 2584, May 2003.
[52]
P. Veríssimo and A. Casimiro. The Timely Computing Base model and architecture. IEEE Transactions on Computers, 51 (8):916--930. Aug. 2002.
[53]
P. Veríssimo, A. Casimiro, and C. Fetzer. The Timely Computing Base: Timely actions in the presence of uncertain timeliness. In International Conference on Dependable Systems and Networks (DSN), pages 533--542, June 2000.
[54]
D. A. Wheeler. SLOCCount. http://www.dwheeler.com/sloccount/.
[55]
GSoC 2010: ZooKeeper Failure Detector model. http://wiki.apache.org/hadoop/ZooKeeper/GSoCFailureDetector.

Cited By

View all
  • (2024)Lupin: Tolerating Partial Failures in a CXL PodProceedings of the 2nd Workshop on Disruptive Memory Systems10.1145/3698783.3699377(41-50)Online publication date: 3-Nov-2024
  • (2024)Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localizationComputer Networks10.1016/j.comnet.2024.110836255(110836)Online publication date: Dec-2024
  • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
October 2011
417 pages
ISBN:9781450309776
DOI:10.1145/2043556
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. STONITH
  2. failure detectors
  3. high availability
  4. layer-specific monitors
  5. layer-specific probes
  6. reliable detection

Qualifiers

  • Research-article

Funding Sources

Conference

SOSP '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)7
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Lupin: Tolerating Partial Failures in a CXL PodProceedings of the 2nd Workshop on Disruptive Memory Systems10.1145/3698783.3699377(41-50)Online publication date: 3-Nov-2024
  • (2024)Themis: A passive-active hybrid framework with in-network intelligence for lightweight failure localizationComputer Networks10.1016/j.comnet.2024.110836255(110836)Online publication date: Dec-2024
  • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
  • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
  • (2023)Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed SystemsProceedings of the 12th Latin-American Symposium on Dependable and Secure Computing10.1145/3615366.3615420(40-49)Online publication date: 16-Oct-2023
  • (2023)QuePaxa: Escaping the tyranny of timeouts in consensusProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613150(281-297)Online publication date: 23-Oct-2023
  • (2023)uBFT: Microsecond-Scale BFT using Disaggregated MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575732(862-877)Online publication date: 27-Jan-2023
  • (2023)Machine learning applied to failure detectionSixth International Conference on Computer Information Science and Application Technology (CISAT 2023)10.1117/12.3004136(210)Online publication date: 11-Oct-2023
  • (2023)Detective-Dee: A Non-Intrusive In Situ Anomaly Detection and Fault Localization Framework2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00032(243-253)Online publication date: 25-Sep-2023
  • (2023)A Production Suite for Failure Detectors2023 International Conference on Intelligent Computing and Next Generation Networks(ICNGN)10.1109/ICNGN59831.2023.10396751(1-6)Online publication date: 17-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media