More Web Proxy on the site http://driver.im/

research-article

Free access

Understanding network failures in data centers: measurement, analysis, and implications

Authors:

Nachiappan NagappanAuthors Info & Claims

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

Pages 350 - 361

https://doi.org/10.1145/2018436.2018477

Published: 15 August 2011 Publication History

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Supplementary Material

MP4 File (sigcomm_11_1.mp4)

Download
107.91 MB

References

[1]

Cisco: Data center: Load balancing data center services, 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.

[2]

H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, 2010.

Digital Library

[3]

M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008.

Digital Library

[4]

M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010.

Digital Library

[5]

T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010.

Digital Library

[6]

T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look at problems in the cloud. In HotCloud, 2010.

Digital Library

[7]

J. Brodkin. Amazon EC2 outage calls "availability zones" into question, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.

[8]

X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarative configuration management for complex and dynamic networks. In CoNEXT, 2010.

Digital Library

[9]

Cisco. UniDirectional Link Detection (UDLD). http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.

[10]

Cisco. Spanning tree protocol root guard enhancement, 2011. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.

[11]

D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In OSDI, 2010.

Digital Library

[12]

A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.

Digital Library

[13]

C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, 2008.

Digital Library

[14]

C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009.

Digital Library

[15]

D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008.

Digital Library

[16]

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM, 2010.

Digital Library

[17]

C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008.

Digital Library

[18]

C. Labovitz and A. Ahuja. Experimental study of internet stability and wide-area backbone failures. In The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999.

Digital Library

[19]

A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. Characterization of failures in an operational IP backbone network. IEEE/ACM Transactions on Networking, 2008.

Digital Library

[20]

N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. In SIGCOMM CCR, 2008.

Digital Library

[21]

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.

Digital Library

[22]

V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006.

Digital Library

[23]

B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In FAST, 2007.

Digital Library

[24]

B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009.

Digital Library

[25]

A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002.

Digital Library

[26]

D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010.

Digital Library

[27]

K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010.

Digital Library

[28]

D. Watson, F. Jahanian, and C. Labovitz. Experiences with monitoring OSPF on a regional service provider network. In ICDCS, 2003.

Digital Library

Cited By

Ma YGuo YYang RLuo H(2025)FRRL: A reinforcement learning approach for link failure recovery in a hybrid SDNJournal of Network and Computer Applications10.1016/j.jnca.2024.104054234(104054)Online publication date: Feb-2025
https://doi.org/10.1016/j.jnca.2024.104054
AlQiam AYao YWang ZAhuja SZhang YRao SRibeiro BTawarmalani MSekar VYu MSeneviratne AVeitch D(2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672237
Adraa MAssi CAlmekhlafi MKhabbaz MPelekhaty VFrankel M(2024)Comprehensive Performance and Robustness Analysis of Expander-Based Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330697121:1(670-683)Online publication date: Feb-2024
https://doi.org/10.1109/TNSM.2023.3306971
Show More Cited By

Index Terms

Understanding network failures in data centers: measurement, analysis, and implications
1. Networks
  1. Network services
    1. Network management

Recommendations

A Large Scale Study of Data Center Network Reliability
IMC '18: Proceedings of the Internet Measurement Conference 2018

The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness ...
Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
Reliability in layered networks with random link failures

We consider network reliability in layered networks where the lower layer experiences random link failures. In layered networks, each failure at the lower layer may lead to multiple failures at the upper layer. We generalize the classical polynomial ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

August 2011

502 pages

ISBN:9781450307970

DOI:10.1145/2018436

General Chairs:
Srinivasan Keshav
University of Waterloo, Canada
,
Jörg Liebeherr
University of Toronto, Canada
,
Program Chairs:
John Byers
Boston University, USA
,
Jeffrey Mogul
HP Labs, USA

ACM SIGCOMM Computer Communication Review Volume 41, Issue 4
SIGCOMM '11
August 2011
480 pages
ISSN:0146-4833
DOI:10.1145/2043164
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGCOMM '11

Sponsor:

SIGCOMM

SIGCOMM '11: ACM SIGCOMM 2011 Conference

August 15 - 19, 2011

Ontario, Toronto, Canada

Acceptance Rates

SIGCOMM '11 Paper Acceptance Rate 32 of 223 submissions, 14%;

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

676
Total Citations
View Citations
5,394
Total Downloads

Downloads (Last 12 months)797
Downloads (Last 6 weeks)96

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma YGuo YYang RLuo H(2025)FRRL: A reinforcement learning approach for link failure recovery in a hybrid SDNJournal of Network and Computer Applications10.1016/j.jnca.2024.104054234(104054)Online publication date: Feb-2025
https://doi.org/10.1016/j.jnca.2024.104054
AlQiam AYao YWang ZAhuja SZhang YRao SRibeiro BTawarmalani MSekar VYu MSeneviratne AVeitch D(2024)Transferable Neural WAN TE for Changing TopologiesProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672237(86-102)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672237
Adraa MAssi CAlmekhlafi MKhabbaz MPelekhaty VFrankel M(2024)Comprehensive Performance and Robustness Analysis of Expander-Based Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.330697121:1(670-683)Online publication date: Feb-2024
https://doi.org/10.1109/TNSM.2023.3306971
Liu CAggarwal VLan TGeng NYang YXu MLi Q(2024)FERN: Leveraging Graph Attention Networks for Failure Evaluation and Robust Network DesignIEEE/ACM Transactions on Networking10.1109/TNET.2023.331167832:2(1003-1018)Online publication date: Apr-2024
https://doi.org/10.1109/TNET.2023.3311678
Tang JFu LLong FWang XChen GZhou C(2024)Which Link Matters? Maintaining Connectivity of Uncertain Networks Under Adversarial AttackIEEE Transactions on Mobile Computing10.1109/TMC.2023.324862923:3(2039-2053)Online publication date: Mar-2024
https://doi.org/10.1109/TMC.2023.3248629
Chintapalli VKilli BPartani RTamma BMurthy C(2024)Energy- and Reliability-Aware Provisioning of Parallelized Service Function Chains With Delay GuaranteesIEEE Transactions on Green Communications and Networking10.1109/TGCN.2023.33179278:1(205-223)Online publication date: Mar-2024
https://doi.org/10.1109/TGCN.2023.3317927
Diao XGu HWei WJiang GLi B(2024)Deep Reinforcement Learning Based Dynamic Flowlet Switching for DCNIEEE Transactions on Cloud Computing10.1109/TCC.2024.338213212:2(580-593)Online publication date: Apr-2024
https://doi.org/10.1109/TCC.2024.3382132
Zeng YQu ZGuo SYe BZhang JLi JTang B(2024)SafeDRL: Dynamic Microservice Provisioning With Reliability and Latency Guarantees in Edge EnvironmentsIEEE Transactions on Computers10.1109/TC.2023.332919473:1(235-248)Online publication date: Jan-2024
https://doi.org/10.1109/TC.2023.3329194
Adam NAli MNaeem FGhazy AKaddoum G(2024)State-of-the-Art Security Schemes for the Internet of Underwater Things: A Holistic SurveyIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34742905(6561-6592)Online publication date: 2024
https://doi.org/10.1109/OJCOMS.2024.3474290
Dam KNdonda GLegay ASadre R(2024)Avoiding "Hot Potato" Problems in Internet Service ProvidersNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575322(1-6)Online publication date: 6-May-2024
https://doi.org/10.1109/NOMS59830.2024.10575322
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents