[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3603269.3610865acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
poster

Poster: Chameleon: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

Published: 01 September 2023 Publication History

Abstract

Datacenter Quantized Congestion Notification (DCQCN) [12] is the default congestion control algorithm for Mellanox RDMA (Remote Direct Memory Access) NICs [2] in RoCEv2 (RDMA over Converged Ethernet v2) networks, one of the most widely used NICs in leading industry companies [4, 5, 7, 9]. In DCQCN, firstly switches mark packets with ECN (Explicit Congestion Notification) when the queue length exceeds ECN thresholds, then receivers respond to ECN-marked packets with CNPs (Congestion Notification Packets), and finally senders reduce transmission rate when receiving CNPs. DCQCN has 10+ parameters at both NICs and switches, including Alpha Update, Rate Increase & Decrease, Notification Point and ECN thresholds [3], and these parameters have a non-negligible impact on the network performance. Our experiments also verify the network performance of common AI (Artificial Intelligence) training workloads in RoCEv2 networks (e.g., all-to-all collective communication) is greatly influenced by different DCQCN parameter settings (§3). Therefore, when deploying applications in practice, the DCQCN parameters need to be carefully tested and tuned to improve the network performance.

References

[1]
2020. High Precision Congestion Control. (2020). https://github.com/alibaba-edu/High-Precision-Congestion-Control
[2]
2022. DCQCN CC Algorithm. (2022). https://enterprise-support.nvidia.com/s/article/DCQCN-CC-algorithm
[3]
2023. DCQCN Parameters. (2023). https://enterprise-support.nvidia.com/s/article/dcqcn-parameters
[4]
Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al. 2023. Empowering Azure Storage with RDMA. In NSDI.
[5]
Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, et al. 2021. When Cloud Storage Meets RDMA. In NSDI.
[6]
Yixiao Gao, Yuchen Yang, Tian Chen, Jiaqi Zheng, Bing Mao, and Guihai Chen. 2018. Dcqcn+: Taming large-scale incast congestion in rdma over ethernet networks. In ICNP.
[7]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In OSDI.
[8]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. 2019. HPCC: High precision congestion control. In SIGCOMM.
[9]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In ISCA.
[10]
Kai Wang, Fang Dong, Dian Shen, Chengtian Zhang, Jinghui Zhang, and Junzhou Luo. 2021. Towards tunable RDMA parameter selection at runtime for datacenter applications. In CSCWD.
[11]
Siyu Yan, Xiaoliang Wang, Xiaolong Zheng, Yinben Xia, Derui Liu, and Weishan Deng. 2021. ACC: Automatic ECN tuning for high-speed datacenter networks. In SIGCOMM.
[12]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In SIGCOMM.

Cited By

View all
  • (2024)Necessary and Sufficient Condition for Triggering ECN Before PFC in Shared Memory SwitchesIEEE Networking Letters10.1109/LNET.2024.33829556:2(119-123)Online publication date: Jun-2024
  • (2024)BBQ: Dynamic-Buffer-Driven Automatic ECN Tunning in Datacenter2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682919(1-6)Online publication date: 19-Jun-2024
  • (2024)A Hybrid Solution to Provide End-to-End Flow Control and Congestion Management in High-Performance Interconnection Networks2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00011(8-17)Online publication date: 6-May-2024

Index Terms

  1. Poster: Chameleon: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
    September 2023
    1217 pages
    ISBN:9798400702365
    DOI:10.1145/3603269
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2023

    Check for updates

    Author Tags

    1. remote direct memory access
    2. congestion control

    Qualifiers

    • Poster

    Funding Sources

    Conference

    ACM SIGCOMM '23
    Sponsor:
    ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference
    September 10, 2023
    NY, New York, USA

    Acceptance Rates

    Overall Acceptance Rate 462 of 3,389 submissions, 14%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)233
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Necessary and Sufficient Condition for Triggering ECN Before PFC in Shared Memory SwitchesIEEE Networking Letters10.1109/LNET.2024.33829556:2(119-123)Online publication date: Jun-2024
    • (2024)BBQ: Dynamic-Buffer-Driven Automatic ECN Tunning in Datacenter2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682919(1-6)Online publication date: 19-Jun-2024
    • (2024)A Hybrid Solution to Provide End-to-End Flow Control and Congestion Management in High-Performance Interconnection Networks2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00011(8-17)Online publication date: 6-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media