[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2018436.2018477acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free access

Understanding network failures in data centers: measurement, analysis, and implications

Published: 15 August 2011 Publication History

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Supplementary Material

MP4 File (sigcomm_11_1.mp4)

References

[1]
Cisco: Data center: Load balancing data center services, 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.
[2]
H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, 2010.
[3]
M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008.
[4]
M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010.
[5]
T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010.
[6]
T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look at problems in the cloud. In HotCloud, 2010.
[7]
J. Brodkin. Amazon EC2 outage calls "availability zones" into question, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.
[8]
X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarative configuration management for complex and dynamic networks. In CoNEXT, 2010.
[9]
Cisco. UniDirectional Link Detection (UDLD). http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.
[10]
Cisco. Spanning tree protocol root guard enhancement, 2011. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.
[11]
D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In OSDI, 2010.
[12]
A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.
[13]
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, 2008.
[14]
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009.
[15]
D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008.
[16]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM, 2010.
[17]
C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008.
[18]
C. Labovitz and A. Ahuja. Experimental study of internet stability and wide-area backbone failures. In The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999.
[19]
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. Characterization of failures in an operational IP backbone network. IEEE/ACM Transactions on Networking, 2008.
[20]
N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. In SIGCOMM CCR, 2008.
[21]
R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.
[22]
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006.
[23]
B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In FAST, 2007.
[24]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009.
[25]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002.
[26]
D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010.
[27]
K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010.
[28]
D. Watson, F. Jahanian, and C. Labovitz. Experiences with monitoring OSPF on a regional service provider network. In ICDCS, 2003.

Cited By

View all
  • (2025)FRRL: A reinforcement learning approach for link failure recovery in a hybrid SDNJournal of Network and Computer Applications10.1016/j.jnca.2024.104054234(104054)Online publication date: Feb-2025
  • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
  • (2024)Comprehensive Performance and Robustness Analysis of Expander-Based Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330697121:1(670-683)Online publication date: Feb-2024
  • Show More Cited By

Index Terms

  1. Understanding network failures in data centers: measurement, analysis, and implications

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference
    August 2011
    502 pages
    ISBN:9781450307970
    DOI:10.1145/2018436
    • cover image ACM SIGCOMM Computer Communication Review
      ACM SIGCOMM Computer Communication Review  Volume 41, Issue 4
      SIGCOMM '11
      August 2011
      480 pages
      ISSN:0146-4833
      DOI:10.1145/2043164
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 August 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data centers
    2. network reliability

    Qualifiers

    • Research-article

    Conference

    SIGCOMM '11
    Sponsor:
    SIGCOMM '11: ACM SIGCOMM 2011 Conference
    August 15 - 19, 2011
    Ontario, Toronto, Canada

    Acceptance Rates

    SIGCOMM '11 Paper Acceptance Rate 32 of 223 submissions, 14%;
    Overall Acceptance Rate 462 of 3,389 submissions, 14%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)797
    • Downloads (Last 6 weeks)96
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)FRRL: A reinforcement learning approach for link failure recovery in a hybrid SDNJournal of Network and Computer Applications10.1016/j.jnca.2024.104054234(104054)Online publication date: Feb-2025
    • (2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
    • (2024)Comprehensive Performance and Robustness Analysis of Expander-Based Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330697121:1(670-683)Online publication date: Feb-2024
    • (2024)FERN: Leveraging Graph Attention Networks for Failure Evaluation and Robust Network DesignIEEE/ACM Transactions on Networking10.1109/TNET.2023.331167832:2(1003-1018)Online publication date: Apr-2024
    • (2024)Which Link Matters? Maintaining Connectivity of Uncertain Networks Under Adversarial AttackIEEE Transactions on Mobile Computing10.1109/TMC.2023.324862923:3(2039-2053)Online publication date: Mar-2024
    • (2024)Energy- and Reliability-Aware Provisioning of Parallelized Service Function Chains With Delay GuaranteesIEEE Transactions on Green Communications and Networking10.1109/TGCN.2023.33179278:1(205-223)Online publication date: Mar-2024
    • (2024)Deep Reinforcement Learning Based Dynamic Flowlet Switching for DCNIEEE Transactions on Cloud Computing10.1109/TCC.2024.338213212:2(580-593)Online publication date: Apr-2024
    • (2024)SafeDRL: Dynamic Microservice Provisioning With Reliability and Latency Guarantees in Edge EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.332919473:1(235-248)Online publication date: Jan-2024
    • (2024)State-of-the-Art Security Schemes for the Internet of Underwater Things: A Holistic SurveyIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34742905(6561-6592)Online publication date: 2024
    • (2024)Avoiding "Hot Potato" Problems in Internet Service ProvidersNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575322(1-6)Online publication date: 6-May-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media