[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3646547.3689028acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
short-paper
Open access

Understanding Incast Bursts in Modern Datacenters

Published: 04 November 2024 Publication History

Abstract

In datacenters, common incast traffic patterns are challenging because they violate the basic premise of bandwidth stability on which TCP congestion control convergence is built, overwhelming shallow switch buffers and causing packet losses and high latency. To understand why these challenges remain despite decades of research on datacenter congestion control, we conduct an in-depth investigation into high-degree incasts both in production workloads at Meta and in simulation. In addition to characterizing the bursty nature of these incasts and their impacts on the network, our findings demonstrate the shortcomings of widely deployed window-based congestion control techniques used to address incast problems. Furthermore, we find that hosts associated with a specific application or service exhibit similar and predictable incast traffic properties across hours, pointing the way toward solutions that predict and prevent incast bursts, instead of reacting to them.

References

[1]
Vamsi Addanki, Oliver Michel, and Stefan Schmid. 2022. PowerTCP: Pushing the Performance Limits of Datacenter Networks. In USENIX NSDI 2022. USENIX Association, Renton, WA, 51--70. https://www.usenix.org/conference/nsdi22/presentation/addanki
[2]
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). ACM SIGCOMM CCR, Vol. 40, 4 (Aug. 2010), 63--74. https://doi.org/10.1145/1851275.1851192
[3]
Serhat Arslan, Yuliang Li, Gautam Kumar, and Nandita Dukkipati. 2023. Bolt: Sub-RTT Congestion Control for Ultra-Low Latency. In USENIX NSDI 2023. USENIX Association, Boston, MA, 219--236. https://www.usenix.org/conference/nsdi23/presentation/arslan
[4]
Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In ACM IMC 2010 (Melbourne, Australia). ACM, New York, NY, USA, 267--280. https://doi.org/10.1145/1879141.1879175
[5]
Ethan Blanton, Vern Paxson, and Mark Allman. 2009. TCP Congestion Control. RFC 5681. https://doi.org/10.17487/RFC5681
[6]
Yanpei Chen, Rean Griffith, Junda Liu, Randy H. Katz, and Anthony D. Joseph. 2009. Understanding TCP Incast Throughput Collapse in Datacenter Networks. In Proceedings of the 1st ACM Workshop on Research on Enterprise Networking (Barcelona, Spain) (WREN '09). ACM, New York, NY, 73--82. https://doi.org/10.1145/1592681.1592693
[7]
Inho Cho, Keon Jang, and Dongsu Han. 2017. Credit-Scheduled Delay-Bounded Congestion Control for Datacenters. In ACM SIGCOMM 2017 (Los Angeles, CA). ACM, New York, NY, 239--252. https://doi.org/10.1145/3098822.3098840
[8]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient coflow scheduling with Varys. In ACM SIGCOMM 2014 (Chicago, IL). ACM, New York, NY, 443--454. https://doi.org/10.1145/2619239.2626315
[9]
Abhishek Dhamija, Balasubramanian Madhavan, Hechao Li, Jie Meng, Shrikrishna Khare, Madhavi Rao, Lawrence Brakmo, Neil Spring, Prashanth Kannan, Srikanth Sundaresan, and Soudeh Ghorbani. 2024. A large-scale deployment of DCTCP. In USENIX NSDI 2024. USENIX Association, Santa Clara, CA, 239--252. https://www.usenix.org/conference/nsdi24/presentation/dhamija
[10]
The Linux Kernel Documentation. 2023. Segmentation Offloads. https://www.kernel.org/doc/html/latest/networking/segmentation-offloads.html.
[11]
eBPF. 2022. eBPF - Introduction, Tutorials & Community Resources. https://ebpf.io.
[12]
Wesley Eddy. 2022. Transmission Control Protocol (TCP). RFC 9293. https://doi.org/10.17487/RFC9293
[13]
Sally Floyd, K. K. Ramakrishnan, and David L. Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168. https://doi.org/10.17487/RFC3168
[14]
Peter X. Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2015. PHost: Distributed near-Optimal Datacenter Transport over Commodity Network Fabric. In ACM CoNEXT 2015 (Heidelberg, Germany). ACM, New York, NY, Article 1, 12 pages. https://doi.org/10.1145/2716281.2836086
[15]
Ehab Ghabashneh, Yimeng Zhao, Cristian Lumezanu, Neil Spring, Srikanth Sundaresan, and Sanjay Rao. 2022. A Microscopic View of Bursts, Buffer Contention, and Loss in Data Centers. In ACM IMC 2022 (Nice, France). ACM, New York, NY, 567--580. https://doi.org/10.1145/3517745.3561430
[16]
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-Architecting Datacenter Networks and Stacks for Low Latency and High Performance. In ACM SIGCOMM 2017 (Los Angeles, CA). ACM, New York, NY, 29--42. https://doi.org/10.1145/3098822.3098825
[17]
Raj Joshi, Ting Qu, Mun Choon Chan, Ben Leong, and Boon Thau Loo. 2018. BurstRadar: Practical Real-time Microburst Monitoring for Datacenter Networks. In ACM APSys 2018 (Jeju Island, Republic of Korea). ACM, New York, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/3265723.3265731
[18]
Elie Krevat, Vijay Vasudevan, Amar Phanishayee, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, and Srinivasan Seshan. 2007. On Application-Level Approaches to Avoiding TCP Throughput Collapse in Cluster-Based Storage Systems. In Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing '07 (Reno, Nevada) (PDSW '07). ACM, New York, NY, 1--4. https://doi.org/10.1145/1374596.1374598
[19]
Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. 2020. Swift: Delay is Simple and Effective for Congestion Control in the Datacenter. In ACM SIGCOMM 2020 (Virtual Event). ACM, New York, NY, 514--528. https://doi.org/10.1145/3387514.3406591
[20]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. 2019. HPCC: High Precision Congestion Control. In ACM SIGCOMM 2019 (Beijing, China). ACM, New York, NY, 44--58. https://doi.org/10.1145/3341302.3342085
[21]
Shiyu Liu, Ahmad Ghalayini, Mohammad Alizadeh, Balaji Prabhakar, Mendel Rosenblum, and Anirudh Sivaraman. 2021. Breaking the Transience-Equilibrium Nexus: A New Approach to Datacenter Packet Transport. In USENIX NSDI 2021. USENIX Association, 47--63. https://www.usenix.org/conference/nsdi21/presentation/liu
[22]
Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ousterhout. 2018. Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities. In ACM SIGCOMM 2018 (Budapest, Hungary). ACM, New York, NY, 221--235. https://doi.org/10.1145/3230543.3230564
[23]
David Nagle, Denis Serenyi, and Abbie Matthews. 2004. The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage. In SC '04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing. 53. https://doi.org/10.1109/SC.2004.57
[24]
ns 3. 2020. ns-3: a discrete-event network simulator for Internet systems. https://www.nsnam.org.
[25]
Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. 2014. Fastpass: a centralized "zero-queue" datacenter network. ACM SIGCOMM CCR, Vol. 44, 4 (Aug. 2014), 307--318. https://doi.org/10.1145/2740070.2626309
[26]
Amar Phanishayee, Elie Krevat, Vijay Vasudevan, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, and Srinivasan Seshan. 2008. Measurement and Analysis of TCP Throughput Collapse in Cluster-Based Storage Systems. In USENIX FAST 2008 (San Jose, CA). USENIX Association, USA, Article 12, 14 pages.
[27]
The Linux Documentation Project. 2023. Introduction to Linux Traffic Control. https://tldp.org/HOWTO/Traffic-Control-HOWTO/intro.html.
[28]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. 2015. Inside the Social Network's (Datacenter) Network. In ACM SIGCOMM 2015 (London, United Kingdom). ACM, New York, NY, USA, 123--137. https://doi.org/10.1145/2785956.2787472
[29]
Srikanth Sundaresan, Neil Spring, and Yimeng Zhao. 2023. A fine-grained network traffic analysis with Millisampler. https://engineering.fb.com/2023/04/17/networking-traffic/millisampler-network-traffic-analysis/.
[30]
Balajee Vamanan, Jahangir Hasan, and T. N. Vijaykumar. 2012. Deadline-Aware Datacenter TCP (D2TCP). In ACM SIGCOMM 2012 (Helsinki, Finland). ACM, New York, NY, 115--126. https://doi.org/10.1145/2342356.2342388
[31]
Weitao Wang, Masoud Moshref, Yuliang Li, Gautam Kumar, T. S. Eugene Ng, Neal Cardwell, and Nandita Dukkipati. 2023. Poseidon: Efficient, Robust, and Practical Datacenter CC via Deployable INT. In USENIX NSDI 2023. USENIX Association, Boston, MA, 255--274. https://www.usenix.org/conference/nsdi23/presentation/wang-weitao
[32]
Haitao Wu, Zhenqian Feng, Chuanxiong Guo, and Yongguang Zhang. 2010. ICTCP: Incast Congestion Control for TCP in Data Center Networks. In ACM CoNEXT 2010 (Philadelphia, PA). ACM, New York, NY, Article 13, 12 pages. https://doi.org/10.1145/1921168.1921186
[33]
Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy. 2017. High-resolution measurement of data center microbursts. In ACM IMC 2017 (London, United Kingdom). ACM, New York, NY, USA, 78--85. https://doi.org/10.1145/3131365.3131375

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
IMC '24: Proceedings of the 2024 ACM on Internet Measurement Conference
November 2024
812 pages
ISBN:9798400705922
DOI:10.1145/3646547
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

  1. congestion control
  2. datacenter networks
  3. incast
  4. microbursts
  5. tcp

Qualifiers

  • Short-paper

Funding Sources

  • NSF

Conference

IMC '24
IMC '24: ACM Internet Measurement Conference
November 4 - 6, 2024
Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 315
    Total Downloads
  • Downloads (Last 12 months)315
  • Downloads (Last 6 weeks)245
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media