[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Packet-Level Telemetry in Large Datacenter Networks

Published: 17 August 2015 Publication History

Abstract

Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task.
We present Everflow, a packet-level network telemetry system for large DCNs. Everflow traces specific packets by implementing a powerful packet filter on top of "match and mirror" functionality of commodity switches. It shuffles captured packets to multiple analysis servers using load balancers built on switch ASICs, and it sends "guided probes" to test or confirm potential faults. We present experiments that demonstrate Everflow's scalability, and share experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.

Supplementary Material

WEBM File (p479-zhu.webm)

References

[1]
Data plane development kit. http://www.dpdk.org/.
[2]
Receive side scaling. https://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx.
[3]
A. Arefin, A. Khurshid, M. Caesar, and K. Nahrstedt. Scaling data-plane logging in large scale networks. In MILCOM, 2011.
[4]
P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In SIGCOMM, 2013.
[5]
J. Case, M. Fedor, M. Schoffstall, and J. Davin. RFC 1157: Simple network management protocol.
[6]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008.
[7]
B. Claise. RFC 3954: Cisco systems netflow services export version 9 (2004).
[8]
N. G. Duffield and M. Grossglauser. Trajectory sampling for direct traffic observation. IEEE/ACM Trans. Netw., June 2001.
[9]
S. K. Fayaz and V. Sekar. Testing stateful and dynamic data planes with flowtest. In HotSDN, 2014.
[10]
A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein. A general approach to network configuration analysis. In NSDI, 2015.
[11]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In NSDI, 2007.
[12]
R. Gandhi, H. H. Liu, Y. C. Hu, G. Lu, J. Padhye, L. Yuan, and M. Zhang. Duet: Cloud scale load balancing with hardware and software. In SIGCOMM, 2014.
[13]
N. Gvozdiev, B. Karp, and M. Handley. Loup: who's afraid of the big bad loop? In HotNets, 2012.
[14]
N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. I know what your packet did last hop: Using packet histories to troubleshoot networks. In NSDI, 2014.
[15]
C.-Y. Hong, M. Caesar, N. Duffield, and J. Wang. Tiresias: Online anomaly detection for hierarchical operational network data. In ICDCS, 2012.
[16]
C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven WAN. In SIGCOMM, 2013.
[17]
Infiniband Trade Association. InfiniBand Architecture Volume 1, General Specifications, Release 1.2.1, 2008.
[18]
Infiniband Trade Association. Supplement to infiniband architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (ip routable ROCE), 2014.
[19]
S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined WAN. In SIGCOMM, 2013.
[20]
V. Jeyakumar, M. Alizadeh, Y. Geng, C. Kim, and D. Mazières. Millions of little minions: Using packets for low latency network programming and visibility. In SIGCOMM, 2014.
[21]
S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of datacenter traffic: measurements & analysis. In IMC, 2009.
[22]
P. Kazemian, M. Chan, H. Zeng, G. Varghese, N. McKeown, and S. Whyte. Real time network policy checking using header space analysis. In NSDI, 2013.
[23]
A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey. Veriflow: Verifying network-wide invariants in real time. In NSDI, 2013.
[24]
T. Koponen, K. Amidon, P. Balland, M. Casado, A. Chanda, B. Fulton, I. Ganichev, J. Gross, N. Gude, P. Ingram, E. Jackson, A. Lambeth, R. Lenglet, S.-H. Li, A. Padmanabhan, J. Pettit, B. Pfaff, R. Ramanathan, S. Shenker, A. Shieh, J. Stribling, P. Thakkar, D. Wendlandt, A. Yip, and R. Zhang. Network virtualization in multi-tenant datacenters. In NSDI, 2014.
[25]
R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level Internet path diagnosis. In SOSP, 2003.
[26]
V. Mann, A. Vishnoi, and S. Bidkar. Living on the edge: Monitoring network flows at the edge in cloud data centers. In COMSNETS, 2013.
[27]
P. Marchetta, A. Botta, E. Katz-Bassett, and A. Pescapé. Dissecting round trip time on the slow path with a single packet. In PAM, 2014.
[28]
P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, and R. Kern. Ananta: cloud scale load balancing. In SIGCOMM, 2013.
[29]
P. Phaal, S. Panchen, and N. McKee. RFC 3176: Inmon corporation's sflow: A method for monitoring traffic in switched and routed networks, 2001.
[30]
T. Qiu, Z. Ge, D. Pei, J. Wang, and J. Xu. What happened in my network: mining network events from router syslogs. In IMC, 2010.
[31]
J. Rasley, B. Stephens, C. Dixon, E. Rozner, W. Felter, K. Agarwal, J. Carter, and R. Fonseca. Planck: Millisecond-scale monitoring and control for commodity networks. In SIGCOMM, 2014.
[32]
L. Rizzo. netmap: A novel framework for fast packet I/O. In USENIX ATC, 2012.
[33]
J. Suh, T. Kwon, C. Dixon, W. Felter, and J. Carter. Opensample: A low-latency, sampling-based measurement platform for SDN. In ICDCS, 2014.
[34]
W. Wu and P. Demar. Wirecap: a novel packet capture engine for commodity NICs in high-speed networks. In IMC, 2014.
[35]
A. Wundsam, D. Levin, S. Seetharaman, and A. Feldmann. OFRewind: Enabling record and replay troubleshooting for networks. In ATC, 2011.
[36]
M. Yu, L. Jose, and R. Miao. Software defined traffic measurement with opensketch. In NSDI, 2013.

Cited By

View all
  • (2024)Feasibility of Application Layer Header Parsing in eBPF and P42024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619855(475-481)Online publication date: 3-Jun-2024
  • (2024)Flow/Path Performance ConsistencyProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696887(255-263)Online publication date: 18-Nov-2024
  • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
  • Show More Cited By

Index Terms

  1. Packet-Level Telemetry in Large Datacenter Networks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGCOMM Computer Communication Review
    ACM SIGCOMM Computer Communication Review  Volume 45, Issue 4
    SIGCOMM'15
    October 2015
    659 pages
    ISSN:0146-4833
    DOI:10.1145/2829988
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
      August 2015
      684 pages
      ISBN:9781450335423
      DOI:10.1145/2785956
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 August 2015
    Published in SIGCOMM-CCR Volume 45, Issue 4

    Check for updates

    Author Tags

    1. datacenter network
    2. failure detection
    3. probe

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)496
    • Downloads (Last 6 weeks)72
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Feasibility of Application Layer Header Parsing in eBPF and P42024 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking62109.2024.10619855(475-481)Online publication date: 3-Jun-2024
    • (2024)Flow/Path Performance ConsistencyProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696887(255-263)Online publication date: 18-Nov-2024
    • (2024)INT-Label: Lightweight In-Band Network-Wide Telemetry via Distributed LabelingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336793335:5(751-767)Online publication date: May-2024
    • (2024) Marina : Realizing ML-Driven Real-Time Network Traffic Monitoring at Terabit Scale IEEE Transactions on Network and Service Management10.1109/TNSM.2024.338239321:3(2773-2790)Online publication date: Jun-2024
    • (2024)Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2024.338178632:4(3002-3017)Online publication date: Aug-2024
    • (2024)Accelerating Sketch-based End-Host Traffic Measurement with Automatic DPU OffloadingIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621293(171-180)Online publication date: 20-May-2024
    • (2024)INSERT: In-Network Stateful End-to-End RDMA TelemetryIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621203(1061-1070)Online publication date: 20-May-2024
    • (2024)FARM: Comprehensive Data Center Network Monitoring and Management2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00055(520-530)Online publication date: 23-Jul-2024
    • (2024) : Low-latency and reliable event collection in network measurement Journal of Network and Computer Applications10.1016/j.jnca.2024.103904228(103904)Online publication date: Aug-2024
    • (2024)DDQN-SFCAG: A service function chain recovery method against network attacks in 6G networksComputer Networks10.1016/j.comnet.2024.110748(110748)Online publication date: Aug-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media