research-article

Open access

COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign

Authors:

Ke Wu,

Dezun Dong,

Weixia XuAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 3

Article No.: 49, Pages 1 - 26

https://doi.org/10.1145/3660525

Published: 14 September 2024 Publication History

PDF eReader

Abstract

RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Cards (NICs) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NICs to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this article, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques —specifically, proactive congestion control —on RDMA NICs. We use COER to design offloading schemes for 11 congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load Message Passing Interface (MPI) programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared with a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.

1 Introduction

The application of modern cluster networks, including high-performance computing systems and data centers, is increasingly driving the demand for high-speed networks. Anticipated attributes of high-speed networks include elevated throughput, diminished latency, and reduced central processing units (CPU) burden [58]. For example, distributed machine learning training [14, 36] and cloud storage [18, 28] require high bandwidth. At the same time, search engines [6, 17] and databases [37, 41] require low latency. In addition, most applications require a network stack with low CPU overhead. Remote Direct Memory Access (RDMA) technology, which can simultaneously meet the requirements mentioned above, has been widely adopted in production systems [4, 19, 33].

In order to fully harness the advantages of RDMA, the exploration and research on Congestion Control (CC) protocols in RDMA networks have witnessed uninterrupted progress. CC protocol is an important aspect of CC technology, and we have introduced other aspects in related work. Classical classification methods categorize CC protocols into reactive and proactive types [32]. Among them, reactive protocols [15, 44, 47, 64] typically focus on rate control. They allow rapid rate increase in the initial stages of message transmission to occupy more bandwidth and then adjust the rate in response to congestion feedback from within the network to alleviate congestion. On the other hand, proactive protocols [27, 34, 59] employ a reservation-based approach to avoid congestion, especially endpoint congestion, by confirming the state of the network before the message is transmitted. However, the high throughput and low latency characteristics of RDMA lead to a sharp increase in the proportion of short messages (equal to or smaller than one round-trip time (RTT)) in network loads. These messages cannot be constrained by rate control during transmission, which renders reactive protocols gradually unable to meet application requirements and adapt to faster networks [49, 50]. Conversely, proactive protocols with more aggressive control strategies are better suited for RDMA networks to maintain high-speed characteristics.

Furthermore, proactive CC protocols are more consistent with the current technological trend of continuously improving the capacity of host network cards [38, 48, 51]. Offloading CC to the Network Interface Card (NIC) can alleviate the burden on the CPU and avoid the shortcomings of slow reaction speed caused by CC at the user space software level. Some simple reactive CC protocols have been offloaded to the NIC [10, 11, 12]. Proactive CC protocols that rely on efficient scheduling rather than rate-based algorithms are better suited to be offloaded to NICs than reactive CC protocols for fast response to congestion information. They only require simple functionality within the network (or switches), such as marking congestion or recording queuing time, or even no functional support at all (such as [61] and [27]), while having relatively complex processing procedures and higher on-chip storage requirements at the hosts. However, proactive CC protocols have not yet been widely used in RDMA networks. The higher deployment cost is a common but non-fundamental reason for data center networks. However, for high-performance interconnect networks, the iteration of the system inevitably brings opportunities for redesign and production of various components, including RDMA NICs (RNICs).

Therefore, the biggest challenge faced due to the above issues is how to integrate two conflicting protocols. RDMA requires establishing a connection with the destination node before transmitting data, and there are also some behaviors to ensure data correctness and communication with the upper layer. On the other hand, proactive CC protocols require a reservation, maintenance of message transmission status, and reaction to congestion information, finally completing the scheduling of the message transmission process. Both proactive CC protocols and RDMA connection protocols impose constraints on the process of sending data. Simply placing them at different levels of the network protocol stack will inevitably have adverse effects on network performance. In addition, commercial RNICs are black boxes —without understanding the structure and functional details of RNICs, it is impossible to address this challenge fundamentally.

We have identified an opportunity to address the challenge above through our profound understanding of proactive CC protocols and analysis of the RNIC architecture and functional details. We have two insights: (1) The reservation process of proactive CC protocols and the RDMA connection process are remarkably similar, or it can be said that the RDMA connection process is essentially a reservation process. This implies that we can leverage the functionality components of connection management in RNIC to accomplish the reservation behavior without requiring significant modifications to the RNIC itself. (2) The Work Queue Elements (WQEs) in the RNIC, which are messages (or flows, referred to as messages hereafter) essentially, align perfectly with the scheduling granularity of reservations. If the RNIC can support message-level connections, implementing proactive CC protocols on the RNIC becomes effortless.

In this work, we propose an RNIC structure called COER that can easily offload proactive CC protocols. COER supports message-level connections and leverages the existing functionality of RNICs, specifically connection establishment, to accomplish the reservation process. COER includes a sender-side connection management state machine, reservation resource management, and priority-based multiple queues, enabling flexible on-end scheduling to meet the requirements of various CC protocols. By utilizing COER, it is possible to rapidly design offloading solutions on RNIC for most proactive CC protocols. COER also supports adjustable sending rates, allowing for the implementation of most reactive CC protocols. As only minimal modifications are required for the RNIC, deploying CC protocols using COER offers a cost-effective solution.

The remaining content of this article is organized as follows. Section 2 introduces the motivation of this work, explaining that the RDMA connection process and the reservation process of proactive CC protocols are essentially the same, as well as the relationship between connection granularity and reservation granularity. Section 3 provides a detailed description of the design of COER, including its message-level connection process, sender management module, and redesign of the RDMA processing unit. Section 4 provides an example to illustrate how to use COER to implement a specific proactive CC protocol. We design 11 CC protocols using COER for offloading on the RNIC and evaluate them using our in-house emulator. The results are presented in Section 5. Section 6 introduces related work, and Section 7 provides our conclusions.

2 Motivation

The high throughput, low latency, and low CPU overhead of RDMA, a high-speed networking technology, are making it the de-facto standard for high-speed interconnection networks and data center networks. Modern data center networks commonly use RoCEv2 (RDMA over Converged Ethernet Version 2) [4], while high-performance interconnect networks generally adopt Infiniband (IB) [1] or specialized RDMA networks derived from IB.

Both high-performance interconnect networks and data center networks have been continuously and extensively researching how to perform CC protocols in RDMA networks. On the one hand, the RoCEv2 protocol in data center networks encapsulates IB messages into Ethernet packets for transmission. A more effective CC protocol is needed to avoid the high RDMA retransmission overhead caused by frequent packet loss within the network. Many CC protocols, such as DCQCN [64], ExpressPass [20], RCC [63], pHost [27], NDP [31], pFabric [16], Homa [45], and HPCC [42], are designed with this purpose in mind. On the other hand, many specialized RDMA networks in high-performance interconnect networks are organized as lossless networks. In lossless networks, the propagation speed of network congestion is faster and, if not effectively controlled, it will eventually lead to tree saturation. The goals of CC protocols such as Speculative Reservation Protocol (SRP) [34], Small-Message SRP (SMSRP) [35], BFRP [60], CRSP [61], and PCRP [59] are to solve this problem.

These protocols can be classified into two main categories: reactive and proactive, based on classical methods [32]. The main difference between the two lies in the fact that proactive protocols, prior to data transmission, use a reservation-based approach to confirm the receiving capability of the destination or the network’s internal capacity to avoid congestion. Proactive CC protocols are better suited for RDMA networks. One reason is that reactive protocols are gradually failing to meet the demands of applications and are not adapting to ultra-low-latency RDMA networks. Their congestion mitigation strategies typically involve rate adjustment based on feedback of congestion status within the network. However, in high-speed RDMA networks, the predominant short messages [45, 59] cannot be constrained by this approach, which often leads to ineffective congestion alleviation in the network. Proactive CC protocols, with their more aggressive control strategy towards congestion, result in lower congestion levels within the network. This point was validated in our experiments in Section 5. Compared with reactive protocols, proactive CC protocols make it easier for RDMA networks to maintain their high-speed characteristics.

Another reason is that proactive CC protocols are more in line with the current technical trend of increasing the capacity of host NICs. Modern NICs support various hardware acceleration features and Data Processing Units (DPUs) [13]. These technologies allow the offloading of tasks such as networking, storage, and security from the CPU. The development of the NIC has made offloading the CC protocol a reality. Offloading CC protocols to NICs helps compensate for the slow processing speed at the software layer. Additionally, recent research [53] indicates that compared with higher-level software operations such as the TCP/IP stack, lower-layer functionalities residing in the driver and NIC significantly impact the final shape of traffic on the wire. This necessitates implementing CC protocols in the network card to avoid the lower network processing stack layer masking the effects of software-based traffic shaping. However, NICs with limited computational resources are unsuitable for offloading reactive CC protocols that rely on complex computations. On the contrary, proactive CC protocols rely on sophisticated scheduling mechanisms to achieve efficient CC, making them more suitable for offloading to NICs. Many proactive protocols are designed with the assumption that they need to be deployed closer to the network interface, ensuring quick responsiveness to congestion information through their relatively complex mechanisms.

However, the proactive CC protocol is not actually used in production systems. Apart from cost issues, a significant challenge has not been considered and addressed in using CC protocols in RDMA networks: how to integrate the RDMA network protocol with the CC protocol. RDMA networks have a complete protocol hierarchy, and placing the CC protocols at any position in the stack would result in additional steps and constraints for packets, leading to performance loss. Figure 1 shows the protocol hierarchy of RoCEv2 and IB on the NIC. The CC protocols of the Ethernet are primarily deployed at the network layer. Packets first go through the IB protocol layer of RoCEv2 and then are constrained by the network layer CC protocol. At the same time, dedicated RDMA networks such as IB have a complete set of specifications, but they still separate the RDMA connection protocol at the transport layer from the CC at the network layer.

Fig. 1.

We conducted a detailed analysis of the RNIC’s functionality and studied the RDMA connection protocol process [43].¹ We discovered a clever opportunity that can address the challenge mentioned above: the RDMA connection process is consistent with the reservation process of the proactive CC protocol. This allows us to leverage this characteristic to integrate the proactive CC protocol with the RDMA protocol in the RNIC. As shown in Figure 2, the RDMA protocol does not require CPU involvement and establishes a reliable connection with the destination before transmission. This is essentially the same as the core of proactive CC protocols — reservation. Utilizing the RDMA protocol’s connection process and additional information exchange, such as reservation, grant, or rate guidance, is equivalent to completing the proactive CC protocols process. Additionally, the similarity between these two processes allows us to leverage the original functional components of the RNIC, significantly reducing the cost of offloading proactive CC protocols.

Fig. 2.

Another issue is that the connection granularity of RDMA and the reservation granularity of proactive CC protocols are not the same. The RDMA connection generally refers to the connection between basic communication units called Queue Pairs (QPs), while the reservation granularity of proactive CC protocols is usually at the message level. In addition to the destination, the message has a specific size (length), which is an essential parameter for the scheduling algorithm of many proactive CC protocols. In RDMA networks, the concept of WQE is similar to that of a message. The WQE not only contains information about the length of the message but also includes the message’s local and remote memory addresses. If the RDMA network can support connection at the message (WQE) level, it can better integrate with proactive CC protocols. In other words, with minimal modifications to the original RDMA network, achieving a situation in which the RDMA protocol and proactive CC protocols coexist and integrate is possible.

This inspires us to move further forward on the problem of CC protocols in RDMA networks by offloading the proactive CC protocol to the RNIC and combining the transport and network layers on the RNIC into a single layer (as shown in Figure 1), making it more customized and efficient. This article proposes COER RNIC, which supports message-level connections and transfers the connection maintenance from memory to the RNIC. COER redesigns various components of the RNIC to support a proactive CC protocol. Based on COER, it is easy to design an offloading solution for proactive CC protocols on RNIC and achieve the integration of CC protocols and RDMA. Additionally, COER supports the adjustment of the sending window, enabling support for reactive protocols as well.

3 The Design of COER

Although different proactive CC protocols have very different styles in terms of performance and applicable scenarios, their mechanisms and steps are similar. The vast majority of CC protocols can consist of several coarse-grained parts: Reservation Request, Reservation Feedback, Speculative Sending, Receiver Scheduling, Sender Scheduling, Rate Limiter, and Router Support. COER enables the offloading of proactive CC protocols by supporting these parts. Router Support is also an important part of the CC protocols, but it cannot be implemented on the NIC and is beyond the scope of COER’s design. This section describes the structure of COER, the message-level connection pattern process, and how it supports these parts.

3.1 Overview

This section discusses the design of the COER RNIC architecture. This design supports components for proactive CC protocols on the RNIC, making it suitable for complete proactive CC protocol offloading. Figure 3 summarizes the structure of COER, with the lower layer representing the actual data path and the upper layer abstracting its functionality. COER does not introduce any new modules; rather, it maximizes the utilization of the RNIC’s built-in components and adds corresponding functionalities to achieve proactive CC protocols.

Fig. 3.

WQEs can be received by the RDMA NIC from the work queue (WQ). Data is extracted from memory and packaged into packets based on WQE information before being sent out. Several specific functional modules are involved in the RNIC data path, some of which offer opportunities for CC protocols:

—

WQE Management. This module receives and verifies WQEs from software. WQEs contain information such as the destination address, local address, and data length. The module uses a multi-queue structure, assigning each queue to a virtual port per process. A round-robin scheduling algorithm processes all queues’ requests.

—

Protocol Processing. Performs WQE processing, taking data from memory in the form of DMA and assembling it into packets. If the data is contained in a WQE, the packet is generated directly.

—

Peripheral Component Interconnect express (PCIe) Interface. Handles all operations related to accessing memory space.

—

Sending Connection Management and Receiving Connection Management. These modules manage connections on the RNIC, maintain connection contexts, and control WQE status changes, forming the basis for implementing state machines in CC protocols.

—

Network Interface. This module, connected to the router, has a small buffer space and is responsible for exchanging packets between the RNIC and the network, making it ideal for transmission rate adjustment.

The Sending and Receiving Connection Management modules play a central role in implementing most of the components supported by COER. Although modifications to many functional modules on the RNIC will often be necessary, significant changes to the data path are not required thanks to RNIC support for the connection mode and connection management. COER provides an architecture similar to a game engine that allows hardware designers to quickly integrate CC protocols with the RNIC for collaborative design and specific solutions. COER should ideally be a flexible architecture that supports multiple CC protocols without the hardware modifications. However, hardware limitations cannot be changed on the fly, and programmable NICs are not widely adopted in large-scale cluster networks. Therefore, COER is built on top of the data path.

3.2 Message-Level Connections

COER utilizes the connection process to achieve the reservation of CC protocols. The connection process of the RDMA protocol is highly similar to the reservation process, as the connection essentially reserves remote connection resources. However, the general IB protocol establishes a connection for each pair of QPs, which includes multiple WQEs with the same destination. This does not meet the granularity requirements of proactive CC protocols, as the reservation granularity is generally at the level of messages. To address this contradiction, we improved the RDMA connection mechanism by establishing a connection for each WQE, enabling the RNIC to complete the connection process and maintain the connection context. This mechanism not only matches the requirements of CC but also has other benefits: (1) connecting at the level of WQEs and maintaining connection resources by the NIC can establish and release connections more quickly; (2) saving connection context resources; and (3) even when faced with the situation of consecutive WQEs with the same destination, the connection overhead of subsequent WQEs can be masked through pipelining, as the next WQE can establish a connection during the transmission of previous WQEs.

Let us now take a full-connection RDMA Write (RDMAW) operation as an example to illustrate how our RNIC handles a WQE, as shown in Figure 4. The WQE enters the WQE Management module cache through the PCIe Interface. After going through the WQE Management module’s arbitration, it reaches the WQE Dispatcher. The WQE Dispatcher sends it to the Sending Connection Management module after confirming that there are available connection resources and free RDMAW units. The Sending Connection Management module then stores it, establishes a connection context, and sends a connection request to the receiver. After receiving the request, the receiver establishes a connection context in the Receiving Connection Management module and returns the connection acknowledgment (ACK). Upon receiving the connection ACK, the sender RNIC sends the WQE to the RDMAW unit allocated by the WQE dispatcher for processing. The RDMAW unit uses Direct Memory Access (DMA) to fetch data from memory based on the address information carried in the WQE, encapsulates it into packets, and then injects it into the network through the Network Interface module. After receiving the packet, the receiver writes it to memory using DMA through the Writeback Processing module and notifies the Receiving Connection Management module to update the number of bytes received. After receiving all data, the Receiving Connection Management module generates a completed queue element (CQE), notifying the receiver and sender that the RDMAW operation is complete. Finally, both parties release the connection context.

Fig. 4.

3.3 Support for Reservation Request and Reservation Feedback

Supporting the Reservation Request and Reservation Feedback on our RNIC does not require additional functional modules. We utilize the full-connection mode’s connection request as the reservation request and the full-connection mode’s connection ACK/NACK as the reservation feedback, as shown in Steps 3 and 4 of Figure 4. By adding fields to the RDMA connection request to enable more information to be carried, the Receiving Connection Management module can schedule the data transmission based on specific CC protocols. Conversely, the Sending Connection Management module for the sender can schedule the WQEs based on feedback from the receiver (reservation time, credit, etc.) and specific CC protocols.

3.4 Support for Sender Scheduling and Speculative Sending

The heart of CC is the response to congestion, which is centered around scheduling at the sender. The sender has ample scheduling space, including selecting the next message to send (message level), switching to another message while sending (packet level), and deciding whether to send speculatively. COER needs these sender scheduling functions to support different CC protocols.

3.4.1 Decoupling WQE from Processing Units.

The RNIC schedules WQEs based on processing units, which is detrimental to the scheduling of CC because the processing unit is strongly coupled to the WQE. As shown in Figure 5(a), the WQE scheduler must confirm that there are available connection context resources and idle processing units on the RNIC before extracting WQEs. In addition, limited by the number of processing units, the RNIC can only handle a few WQEs at a time, resulting in resource wastage if WQEs are rate-throttled due to CC. To avoid this, it is necessary to decouple the WQE from the processing unit on the RNIC.

Fig. 5.

We reconstructed the RNIC Sending Connection Management module (shown in Figure 5(b)) to allow for more flexible scheduling of WQEs by CC protocols. We changed the condition for taking out WQEs from requiring both connection context resources and idle processing units to requiring connection context resources only. The Sending Connection Management module can establish connections for all WQEs in a local time period, enabling effective scheduling. COER’s Sending Connection Management module maintains the connection context and processing unit status queues, allowing more efficient decision-making. Additionally, the module needs to add specific details and maintain the corresponding state machine. Therefore, all WQEs are converged into the Sending Connection Management module, which can schedule them based on the CC protocol’s state machine, selecting one for transmission or switching the currently transmitting flow.

3.4.2 Support for Switching between Messages (WQEs).

Most CC protocols assume that the sender can switch freely between flows to be transmitted (usually through arbitration of packet granularity), but this is not easy in the RNIC. According to the address information carried by the WQE, the processing unit fetches data from memory in DMA mode and encapsulates it into packets. Switching flows requires interrupting the DMA read, packet assembly, and transmission processes. COER provides two ways to support switching flows in RNICs. The first is to trade time for space: that is, to pause WQEs in the processing unit but at a cost of one PCIe latency. The other is to trade space for time, which adds a buffer downstream of the processing unit for temporary packet storage.

The path of WQEs after entering the processing unit is shown by the black arrow in Figure 6. It first translates the address through the Memory Translation Table (MTT), producing the local and remote memory addresses of the data it is operating. The processing unit generates DMA requests based on the local address and sends them to the PCIe interface to fetch data from memory. Once the data is obtained, it is encapsulated into the packet along with the remote address. If we want to pause a WQE while sending, we must record its current address information because the address changes constantly during the sending process.

Fig. 6.

As shown by the blue arrows in Figure 6, when a pause signal from the Sending Connection Management arrives, the processing unit first records the address information at that time (considering that there may be a data packet being assembled when the pause signal comes, the actual pause time is the moment at which the first data packet is completed after the pause signal arrives). Since the processing unit generates DMA requests quickly, there will be extra DMA ACKs in the DMA first in, first out (FIFO) queue and the buffer of the PCIe Interface when the pause signal is received (the capacity of the DMA FIFO queue in the processing unit is small). Thus, the processing unit also needs to delete the extra DMA ACKs to ensure the correctness of the data when the next WQE arrives. Finally, the processing unit updates the WQE with the address at the pause time and returns it to the Sending Connection Management module.

Through the WQE pause function, the transmission flow switching function required for CC can be realized. However, each time a new WQE assembly completes the first packet, hundreds of nanoseconds of PCIe latency are required. Therefore, switching the flows has a time cost; here, time is traded for space to prevent the addition of more buffers. Adding buffers downstream of the processing unit to store packets temporarily is another way to implement transmission flow switching. COER also supports adding buffers to the Send Arbitration module in the RNIC for more flexible switching.

Figure 7 shows that the processing unit sends completed packets to the Sending Arbitration module, which stores them in a buffer. The scheduler schedules the packets based on the needs of the CC protocols in question (e.g., using the shortest remaining time first or end-to-end credit scheduling). The buffer’s organization is not defined explicitly because different CC protocols have different additional requirements. ExpressPass uses a small number of queues (per unit or per type), whereas Homa and PCRP use multiple queues (per flow) to avoid head-of-line blocking. When various flows share a buffer, switching between them has no extra overhead, which allows space to be traded for time.

Fig. 7.

3.4.3 Using Half-connection Mode to Support Speculative Sending.

Reservation is the core of proactive CC protocols used to avoid endpoint congestion, but the cost is too high for short flows. Thus, many proactive protocols incorporate a speculative sending function. As shown in Figure 2, SRP speculatively sends some packets after a reservation request until the network deteriorates, whereas SMSRP sends speculative packets before making a reservation. Reactive protocols allow data to enter the network directly by default, then react and adjust based on congestion feedback, similar to speculative sending.

Speculative Sending is crucial for both reactive and proactive CC protocols. We use the RDMA half-connection protocol to support speculative sending, which involves the sender directly sending packets without establishing a connection. After sending, the sender sends a remote write completion event with the WQE’s length information. The receiver inserts the information into the connection context queue and releases a CQE to the sender when the received data amount equals the length. COER supports speculative sending with the RNIC’s half-connection mode and can simultaneously utilize it with full-connection mode. Although this adds complexity to the CC state machine, it improves the accuracy and availability of COER. Figure 8 illustrates the half-connection protocol.

Fig. 8.

3.5 Support for Receiver Scheduling

COER uses the Receiving Connection Management module of the RNIC to maintain reservation resources and perform scheduling, with additional costs depending on the implementation protocol employed. The sender selects one or multiple WQEs to send reservation requests, whereas the receiver maintains specific resources and schedules based on a particular CC protocol. For example, to implement SRP, the receiver needs to maintain the latest scheduling time based on the time that the reservation request will spend in the network and the length of the reserved flow; CRSP requires the addition of a credit pool in the receiver, which is updated after each successful reservation and received packet. PCRP involves maintaining a dynamic priority flow table in the receiver, with priorities set based on flow length and the decision to return a grant message determined by the remaining length.

3.6 Support for Rate Limiter

Adjusting the sending rate is also important for proactive CC protocols. Therefore, COER must support this function. Notably, this can only be achieved by controlling the number of flits sent in a certain period. As the RNIC’s maximum sending rate is one flit per cycle, rate limiting is required to prevent sending in certain cycles. As shown in Figure 9, a sliding sending window is added to the Network Interface with a fixed sliding speed, and the window size is calculated based on the limited rate and the sliding velocity. The Network Interface module controls the sending rate by controlling the window size.

Fig. 9.

The above method limits the speed of the entire RNIC (i.e., the endpoint), which may cause head-of-line blocking. To avoid this and to implement rate adjustment for flow in CC, we can store different flows in separate queues (as described in Section 3.4.2) and add rate-limiting functionality in the arbiter. Using this approach, many rate adjustment-based reactive protocols can also be supported by COER.

4 Building Offloading Schemes with COER: an Example

In Section 3, we introduced the architecture of COER and the adjustments made to support proactive CC protocols. This section will explain how to use COER to design offloading schemes for specific proactive CC protocols. To better illustrate the process, we will use PCRP as an example. PCRP was published in ICPP’19, and it addresses the issue of matching the granularity of reservations in proactive CC protocols with the reservation time overhead (one RTT). PCRP uses the product of base RTT and link rate as the default length of packet chains and splits messages accordingly before making reservations for the packet chains. Additionally, PCRP requires support for priority queues in the network devices. We will not go into further detail about the mechanisms of PCRP here.

First, we decompose PCRP into several coarse-grained components of proactive CC and determine the implementation or deployment of each component on COER based on the corresponding relationships shown in Figure 3. The results of the decomposition are shown in Figure 10, where the following points are worth noting: (1) PCRP relies on the receiver’s grant to regulate the sending process, thus, the Rate Limiter is not required. (2) The functionality of Sender Scheduling can be implemented in two different locations, corresponding to the two different strategies mentioned in Section 3.4.2. (3) The additional buffer functionality at the receiver is deployed on the Writeback Process module, as the Receiving Process module is only responsible for message assembly.

Fig. 10.

Next, filling in the details on COER to implement PCRP is possible according to the guidance in Figure 10. Figure 11 presents the implementation of PCRP with Sender Scheduling on the Sending Connection Management module, which only shows the modules related to the mechanism of PCRP. Compared with the basic COER structure, there are not many changes in Figure 11. The main differences are the addition of a priority queue in the Writeback Process module and adding a message table in the Receiving Connection Management. The receiver of PCRP only supports processing a limited number of messages simultaneously [59]; thus, the size of the message table is fixed. Messages exceeding this range will be discarded by the NIC and trigger the original fault-tolerance mechanism of RDMA. The complex interaction process between the Sending Connection Management module and the processing unit is supported by COER, and PCRP can use it directly. The queues for storing grants are lightweight. The mechanisms of PCRP are relatively complex among a group of proactive CC protocols. Even so, it is easy to design with COER because COER supports the core base functionality it needs.

Fig. 11.

5 Evaluation

5.1 The Emulator and Experimental Configuration

We use an in-house emulator that can execute real MPI programs and supports parallel emulation. The emulator is transparent to the message-passing interface (MPI) application and provides the same interface as the real system. Its structure is illustrated in Figure 12 and includes the following components.

Fig. 12.

—

Emulator Access Interface: WQEs generated by the RDMA device access interface can be accessed by the emulator through the Emulator Access Interface instead of hardware devices. The Emulator Access Interface’s main tasks are communication proxying (distributing bidirectional messages) and process mapping (mapping MPI processes to the emulated network topology).

—

Full Protocol Stack Emulation of RNIC Software and Hardware: The software part emulates the user space and kernel space interface drivers, as well as the WQ and CQ queues in memory; the hardware part is a functionally accurate and cycle-accurate RNIC model, which we have introduced in Section 3.

—

Large-Scale Network Topology and Routing Emulation: Our emulator implements multiple topologies and corresponding routing algorithms. It models multiple cycle-accurate router architecture. It uses credit-based flow control strategies to ensure that the emulated network is a lossless network. The emulator also supports synthetic and trace loads in addition to MPI programs.

Full Protocol Stack Emulation of RNIC Software and Hardware and Large-Scale Network Topology and Routing Emulation are the main components of the emulator. These main components run as a process in the system, and the Emulator Access Interface uses local procedure calls to communicate with them. Our other article [57] presents additional information regarding this simulator.

The RDMA device access interface converts MPI messages into small message (SM) WQEs and RDMA operation WQEs. Since all the CC protocols we have implemented are designed in fusion with RDMAW (including immediate RDMA Write (IMMRDMAW)) operations, for evaluation convenience, we limit all RDMA operations to either RDMAW or IMMRDMAW. The message content of IMMRDMAW is carried by WQE itself and does not need to be read from memory, thus, the IMMRDMAW message is also a short message. The Protocol Processing module contains two RDMAW processing units and two SM processing units, among others. The k-ary n-tree fat-tree network, real-life fat-tree network [62], and dragonfly network [40] are used. The routing algorithms used include the nearest common ancestor, Dmodk, and UGAL. The evaluation uses an output queue router model, credit-based VCT flow control, and 100 Gbps link bandwidth. The network supports multiple VNs with different VCs; PCRP and Homa require eight VCs. The RNIC can support up to 256 connections. When the connection resources are exhausted, WQE queues up in the WQE management module and host memory. The NIC has a bandwidth of 100 Gbps and a PCIe latency of 405 ns.

5.2 Evaluation Results

Using COER, we have rapidly designed and implemented eight proactive CC protocols (SRP, SMSRP, BFRP, pHost, CRSP, ExpressPass, Homa, and PCRP) on RNIC based on an accurate model in the in-house emulator. For comparison, the participants in the tests were Base,² IB [52], HPCC, and DCQCN. During the testing process, default parameters provided by these protocols were exclusively employed. Since the emulator can load real MPI programs, GPCNeT [21] was adopted by us as the main test load. GPCNeT is a lightweight MPI program designed to test the performance of CC protocols in high-performance computing systems. In addition, we used our own design test program, which generated random node pair and random ring traffic. The latency statistics in the test results are calculated by counting the time it takes for a message to go from entering the sender’s NIC³ to being written to the memory of the receiver’s NIC. This includes the waiting time of the message at the sender and receiver due to CC scheduling, the time required for memory read and write operations, and the time required for packet fragmentation and assembly. This method ensures fairness in the statistics for all protocols, including Base. The program completion time in the test results includes the time from memory generation to NICs for all WQEs and network latency. The latency reflects the CC protocol’s ability to control network congestion and the ability to schedule messages, whereas the program completion time reflects the cost paid by the CC protocol to control network congestion.⁴

5.2.1 Congestion Level Comparison.

Figure 13 illustrates the difference between the various CC protocols regarding internal congestion in the control network, where GPCNeT acts as the traffic load and a 4-ary 2-tree topology is used. The average effective transmit and receive rates are calculated as average, non-zero values for all small intervals (40 ns by default). The difference between the two rates indicates the degree of internal congestion. DCQCN and IB have high congestion due to their high bursts and weak scheduling capabilities. HPCC’s algorithm employs a very aggressive strategy to control the queue length in the switch, which allows for the lowest congestion level. However, its strict restrictions on injection have led to a huge impact on the completed time, as shown in Figure 14. pHost and Homa limit bursts but waste some bandwidth. SRP achieves high receive rates with redundant reservations, leaving idle time slots for senders. SMSRP and BFRP have the best performance and the most efficient bandwidth utilization, consistent with the results in Figure 14(a). The result verifies that proactive CC has stronger control over RDMA network congestion than reactive CC in general.

Fig. 13.

Fig. 14.

5.2.2 Performance and Analysis.

We tested the performance of these protocols using GPCNeT in the k-ary n-tree fat-tree topology. The y-axis in Figure 14 represents the improvement in the average latency of different types of WQE compared with Base. It can be seen that the proactive CC protocol designed by COER outperforms IB, DCQCN, and HPCC. Reactive protocols degrade performance at scale because they lag in detecting and reacting to congestion and have difficulty adjusting the rate. In addition, these results also show that the scheme of COER has characteristics consistent with the original CC protocol. Next, we will analyze the results in Figure 14.

SRP, SMSRP, and BFRP adopt reservations, but SM WQEs do not require them. These protocols ensure high performance for long reserved flows (RDMAW WQEs). However, the reservation overhead is intolerable for short flows (IMMRDMAW WQEs), resulting in poor performance. SMSRP and BFRP send all flows speculatively and reserve only if needed. While these approaches perform well in small scenarios, they can fail in large and heavily loaded ones because the proportion of short flows increases, as does the probability of speculative sending failure. BFRP’s reserved flow limitation has little effect on limited processing units in RNICs.

pHost uses token-based packet-level reservation and provides a free token at the beginning of each flow to allow speculation. Unlike other protocols, pHost uses a source degradation mechanism to prevent unresponsive origins from receiving tokens and wasting destination bandwidth. It prioritizes fairness over the distinction between short and long flows. The results in Figure 14 show that pHost performs well at a large scale, maintaining high receive bandwidth even under heavy loads.

CRSP uses a credit-based reservation method, with a default credit size equal to twice the maximum network flow size, and speculation is not allowed. Therefore, the CRSP must set a large enough credit value to ensure the transmission of long flows. However, for high-performance networks with significant differences in traffic size, short flows equate to no reservation constraints because excessive credit values are meaningless for short flows, while ultra-long flows are rare.

Our implementation of ExpressPass focused on the network interface module and used the ExpressPass protocol for all WQEs, resulting in degraded performance of SM WQE due to the additional reservation time overhead. ExpressPass achieves CC through router credit drop on the bottleneck link and rate adjustment of destination credit. ExpressPass does not support speculation and guarantees fairness. However, due to its restrictions on path diversity, the eccentricity of routers when discarding credit, and frequent starts and stops of end-to-end credit transmission, its performance is not stable.

PCRP and Homa differ in terms of their reservation granularities in that PCRP reserves the packet chain that can be sent within one RTT, while Homa reserves one packet at a time. Both PCRP and Homa prioritize short flows during scheduling and use multiple VCs in routers to support packet priorities, leading to lower average flow latency, albeit at the expense of long flow benefits. Their overall performance depends on the number and size of long and short flows in the load. Compared with Base, their program completion times are higher due to inaccurate reservations caused by RTT fluctuations. Specifically, PCRP’s packet chain is calculated with reference to base RTT. This results in the reservation granularity being smaller than the actual RTT, leading to network bandwidth wastage. Moreover, Homa requires a grant for every packet sent, and network bandwidth is wasted once RTT fluctuations cause the grant to lag.

The poor performance of DCQCN and IB is mainly influenced by the Head-of-Line (HoL) phenomenon. DCQCN and IB are implemented in the Network Interface module, and when a message is throttled in terms of sending rate, the next non-congested message cannot be injected into the network. Therefore, simply putting the CC protocol and the RDMA protocol at different levels will result in performance degradation. HPCC is implemented in the Send Arbitration module, and a multi-queue for per-flow is added to avoid HoL issues. However, HPCC has over-controlled network congestion, thus, its program completion time is large. We used the default parameters of the CC protocol for testing at different scales, thus, some protocols showed sharp jitter in performance at small scales (as shown in Figure 14(a)).

Subsequently, we tested the latency performance of these protocols using random node pair and random ring traffic. Random node pair traffic creates random source–destination node pairs for communication. Random ring traffic randomly forms a ring of all nodes, each of which communicates only with its neighbors in the ring. Messages are sent in ascending order of size during the test, and random node pairs or random ring processes are executed multiple times. The vertical axis of the Figures 15(a) and 15(b) represents the average time it takes to complete multiple point-to-point or ring communications at a given message size, and Figure 15(c) represents the total time it takes for the two test programs to complete messages of all sizes.

Fig. 15.

PCRP and Homa have the lowest latency and total completion time under random node pair traffic, as shown in Figures 15(a) and 15(c). This differs from the results under GPCNeT because random node pair traffic has high burstiness with no endpoint congestion, which makes SMSRP and BFRP ineffective. At the same time, PCRP and Homa benefit from multi-VC and multi-priority within the network. ExpressPass performs well in terms of latency due to its ability to handle internal congestion but has the longest total completion time due to the larger reservation overhead. The performance of other protocols is similar to that under GPCNeT.

Under random ring traffic, PCRP, Homa, SMSRP, and BFRP show significant latency reduction compared with other protocols (Figure 15(b)) due to the amplification of reservation overhead in the random ring communication pattern. A random ring communication pattern requires each node to receive a message from the upstream node before it can send a message to the downstream node, which amplifies the advantages of the half-connection scheme of these protocols. However, PCRP and Homa’s total completion time increases rather than decreases, unlike SMSRP and BFRP, for the same reason as in the GPCNeT test. The performance of other protocols is consistent with expectations and will not be discussed further.

Finally, we tested these protocols using common topologies from real systems and increased the scale of testing to 1,000 nodes. The test traffic consisted of random node pairs and only included RDMAW messages, with message sizes ranging from 8B to 256 KB. Figure 16(a) shows the test results on a real-life fat-tree (3;16,8,8;1,8,8;1,2,2) using the Dmodk routing algorithm. Figure 16(b) shows the test results on a dragonfly (8,4) using the UGAL routing algorithm. The performance of each protocol in the test results is generally consistent with previous discussions. The latency performance of proactive CC protocols is generally higher than reactive CC protocols, with PCRP showing the largest improvement in latency performance. However, some proactive CC protocols with aggressive CC strategies result in longer program completion times. It is worth noting that in the dragonfly topology, all CC protocols reduced program completion time. This demonstrates that CC in general helps the performance.

Fig. 16.

From the results of the above experiments, it can be concluded that these CC offloading schemes implemented through COER are accurate and consistent with their characteristics. Each scheme can be decomposed into components to facilitate the rapid design of an RNIC offloading implementation. COER can effectively integrate CC protocols on RNICs with the RDMA protocol while preserving their original characteristics.

In addition, the results of these experiments indicate that proactive CC protocols are more compatible with RDMA networks and exhibit better performance. Among the eight tested proactive CC protocols, PCRP and Homa consistently maintained the lowest latency level, suggesting that a more complex yet refined mechanism is effective. pHost is the only CC protocol that performs better as the scale increases, indicating the importance of fairness in CC within large-scale networks. Therefore, a complex, refined, and fairness-guaranteed proactive CC protocol is what RDMA networks require.

5.3 Discussion

As mentioned in Section 4, multiple alternative designs may emerge when using COER to construct a CC protocol. WQE goes through multiple stages and modules after entering RNIC, meaning it is sent to different storage locations and goes through a series of arbitration processes. Therefore, COER can schedule WQE, reservation requests, and received packets at different locations. These different designs can have different impacts on implementation overhead and performance.

Let us carefully consider the two ways of implementing WQE switching discussed in Section 3.4.2: pausing WQEs in the processing unit and adding a buffer in the Send Arbitration module. The difference between these two solutions is the different scheduling positions used on the sender side. Switching WQE in the processing unit requires the addition of only a small amount of control logic; however, the cost is having to endure the PCIe latency of WQE starting after each switch. While adding a buffer in the Send Arbitration module can achieve packet-level switching granularity and eliminate the start overhead of WQE, the cost of adding a buffer on the NIC is higher.

However, does the high-cost approach of adding buffers necessarily lead to significant performance improvements for all CC protocols? The answer is negative because different CC protocols have different characteristics. To illustrate this point, we implemented two scheduling schemes for SRP and PCRP. Assuming that the NIC has unlimited buffer resources in the send quorum module, we observe the performance improvement of the SRP and PCRP high-cost solutions. Table 2 shows the buffer resource usage.

Table 1.

	SM	IMMRDMAW	RDMAW	SM	IMMRDMAW	RDMAW	SM	IMMRDMAW	RDMAW
	GPCNet-16			GPCNet-64			GPCNet-256
Quantity	1%	95.59%	3.41%	51.87%	45.13%	3%	88.18%	10.42%	1.4%
Load	0.08%	2.93%	96.99%	6.83%	2.4%	90.76%	26.87%	1.26%	71.87%

Table 1. In Different GPCNeT Workloads of Various Scales, the Distribution of the Number and Payload of Different Types of WQE Exhibits a Typical 80/20 Distribution

Table 2.

	SRP	PCRP
Average length	12.69 packets	71 packets
Maximum length	715.05 packets	1706 packets

Table 2. Usage of the Buffer Resources of the Send Arbitration Module by the High-Cost Implementation Scenarios of SRP and PCRP

Suppose its buffer resources are unlimited. The size of an RDMAW packet is 288 B.

The performance results are shown in Figure 17. Here, the y-axis represents the performance improvement of using the high-cost buffer-added scheme compared with the processing unit switching scheme. SRP and PCRP exhibit two contrasting results, with the high-cost scheme providing limited performance improvement for SRP while being significant for PCRP. This is because, in SRP, the switching frequency of WQE is relatively low — once a long flow arrives at its reservation time Ts, it will continue to be sent without switching, resulting in almost no performance improvement for long flows. On the other hand, because short flows in SRP are often interrupted by long flows during speculative transmission, the performance improvement for these flows is slightly greater than that of long flows. In PCRP, because the reservation granularity is packet-chaining, a long flow will be switched many times during transmission. Thus, using a buffer-added scheme eliminates the PCIe latency of switching WQE and yields a significant improvement in both the performance of long flows and the total completion time.

Fig. 17.

In addition to the various schemes for scheduling WQEs at the sender, there are multiple schemes for scheduling reservation ACKs at the sender and reservation requests and packets at the receiver. COER’s support for diverse scenarios helps in making the best choice after considering more options related to the characteristics, implementation cost, and performance of the CC protocol.

5.4 Implementation Cost and Complexity

The implementation of the CC protocol on NIC using COER inevitably requires modifications to the NIC, but the cost is within an acceptable range. First, let us discuss the overhead of connection context for message-level connections. In our NIC model, a maximum of 256 connections is supported. The transmission-side connection context overhead for a message is 38 B, and the reception-side connection context overhead is 85 B. The overhead for supporting 256 connections on the NIC accounts for approximately 3% of the entire on-chip storage space, with the rest being used mainly for the buffers of each module.

We have also conducted statistics on the usage of connections during the GPCNeT test scenario running on a 4-ary 4-tree topology, as shown in Figure 18. In most cases, the number of active connections on the NIC is relatively small. Cases in which the number of connections is limited due to burst traffic or incast are rare events and have short durations (the units on the left and right sides of Figure 18 are different). This is due to message-level connections having a short lifespan, with rapid establishment and release. If resources are limited, the maximum number of connections supported by the NIC can be further reduced.

Fig. 18.

We evaluated the cost and complexity of implementing these CC protocols. Table 3 presents the additional on-chip storage space required for implementing each protocol. The data in the table is calculated based on the total on-chip memory size of our NIC model. This additional on-chip memory space is used to record the states, rates, parameters, and buffers required by the protocols. SRP, SMSRP, BFRP, CRSP, and IB require a small amount of space, as they only need to record some state information. PCRP, Homa, pHost, and HPCC require a larger space, as they need buffers to store packets temporarily. DCQCN and ExpressPass require a space between the previous two as they need to record the rates, state information, and compute intermediate values for each active flow. Overall, the cost in Table 3 is acceptable.

Table 3.

SRP	0.59%	SMSRP	0.61%	BFRP	0.65%	CRSP	0.23%
PCRP	5.43%	HOMA	5.43%	pHOST	2.97%	DCQCN	3.10%
IB	0.06%	HPCC	3.10%	ExpressPass	2.60%

Table 3. Fine-Grained Estimation of the Additional On-Chip Memory Required for each CC Protocol to Implement on a NIC

Table 4 shows the logical resources and computational resources required for implementing four representative protocols. The data was obtained through simulation using the Vivado HLS tool. The mechanisms of SMSRP, BFRP, and CRSP are similar to SRP; Homa and pHOST are similar to PCRP; IB and HPCC are similar to DCQCN. The resources of SRP are mainly used for pause and restart of processing units, as well as arbitration and scheduling between multiple messages; the resources of PCRP are mainly used for table lookup; and the resources of DCQCN and ExpressPass are mainly used for rate calculation. The data in Table 4 indicates that implementing proactive protocols does not necessarily require more resources than reactive protocols. On the contrary, proactive CC protocols do not require complex calculations and mainly rely on scheduling, making them more suitable for offloading on RDMA NICs.

Table 4.

Procotol	BRAM	DSP	FF	LUT
SRP	2	14	2839	6531
PCRP	76	3	4768	10763
DCQCN	76	47	4419	9169
ExpressPass	10	22	3277	11732

Table 4. Coarse-Grained Estimation of the Resources Required to Implement the CC Protocols on the NIC

6 Related Work

Our discussion centered on how to integrate RDMA protocol and CC protocol on NIC, but CC protocol is only an important aspect of CC technology. In fact, CC technology also includes another important aspect: routing algorithms and queue (or virtual channel, virtual lane) allocation strategies within the network, which respond locally to congestion without requiring host-side participation in decision-making. Oblivious Routing technology does not consider the current state of the network when selecting the routing path for data packets. Such solutions can balance the load under any traffic pattern to reduce the probability of forming local hotspots. Adaptive routing technologies such as ECMP (Equal-cost multi-path), CBCM [39], FootPrint [26], and UGAL [40] utilize local information based on the characteristics of the topology to optimize path selection and VC allocation, allowing traffic to bypass hotspots in the network. Static queuing schemes such as VOQ (Virtual Output Queue), DBBM [25], Flow2SL [23], and VFTree [30] separate different traffic flows into different queues to prevent head-of-line blocking and reduce interference between different traffic flows on the same path. Dynamic queuing schemes such as RECN [22], DRBCM [24], and BFC [29] dynamically queue traffic to further isolate congested flows. These technologies and CC protocols can work together in the network.

The NVIDIA ConnectX series NICs [11] support different offloading technologies, such as RoCE [4] and iWARP [2], and various CC protocols, such as ECN, DCQCN, and H-TCP. The Intel XL710 series NICs [10] support multiple CC technologies, such as DCB [3] and PFC, and have high performance, low latency, and reliability. The Cisco VIC (Virtual Interface Card) series NICs [9] support ECN and DCB, whereas the Marvell FastLinQ series NICs [5] support ECN and DCQCN. The Solarflare XtremeScale series NICs [12] support multiple offloading technologies such as TCP/IP offload, RDMA offload, and NVMe over Fabrics offload, along with various CC protocols. The Broadcom NetXtreme series NICs [7] support TCP/IP offload and RDMA offload and can implement traffic classification and CC. The Huawei IN200 [8] is a high-performance intelligent NIC based on FPGA that supports TCP/UDP offload, RDMA offload, ECN, and DCQCN. Notably, although some NICs have integrated CC with RDMA, the aforementioned industrial products can only offload reactive CC protocols and cannot support proactive CC protocols.

In the academic world, some works have proposed improvements to NICs. FlexNIC [38] introduces a new packet reception interface on programmable NICs. SENIC [48] provides each flow with its transmission descriptor queue and uses a shared doorbell FIFO to notify the NIC of new segments. Loom [55] is a multi-queue NIC design that uses a programmable hierarchical packet scheduler and provides an OS/NIC interface that allows the OS to control how the NIC performs packet scheduling. SRNIC [58] minimizes the on-chip data structures and their memory requirements in RNIC through detailed protocol and architecture co-design to improve connection scalability. While these works have improved the performance of NICs, they have not considered offloading the decision-making part of CC to the NIC or RNIC. The 1RMA NIC [54] is connection-free and fixed-function, assisting software by offering fine-grained delay measurements and fast failure notifications. It provides security and isolation for multi-tenant datacenters but leaves CC to software. CCP [46] and Tonic [56], on the other hand, are CC frameworks at the software layer and on programmable NICs, respectively. CCP is a software CC system that runs in user space, but software layer CC approaches always have a significant disadvantage in terms of responsiveness. Although Tonic is able to implement programmable transport protocols in high-speed NICs, it does not support proactive CC and does not do so with RDMA integration.

7 Conclusion

We firmly believe that to further advance the level of CC in high-performance interconnection networks, it is necessary to offload CC protocols to the NIC. The high efficiency of offloading cannot be achieved through algorithms alone in software-based implementation. Additionally, we would like to bridge the transport and network layers of RDMA networks and integrate the RDMA connection protocol with CC instead of simply stacking them. Based on the above motivations, we propose COER.

This article first conducted a detailed analysis of the structure of the RNIC and the RDMA protocol and found a way to integrate them with the CC protocol. This article also summarizes several coarse-grained components of the proactive CC. It then describes the design, implementation, and evaluation of COER. COER represents another step towards genuinely offloading CC protocols to the RNIC. The emulator used in the evaluation has a cycle-accurate RNIC model with complete RDMA functionality, and the evaluation results verify that the scheme designed through COER maintains the functionality and characteristics of the original CC protocol. Our future work is to leverage the ability of COER to quickly produce new CC schemes to explore the most suitable CC schemes for high-performance interconnection networks.

Footnotes

We studied the RNIC of the Tianhe interconnection network. In addition to the RNIC, Tianhe has dedicated communication libraries and protocol stacks. These solutions have been successfully deployed in extensively scaled systems and have maintained stability over considerable periods.

The baseline RDMA protocol without CC. In Figure 4, we describe the RDMAW operation of the Base. In fact, Base is a solution used in the actual system of Tianhe’s high-performance interconnection network, which has a certain degree of congestion control capabilities. Although the baseline solution (Base) does not have CC, this does not necessarily mean that CC will surpass Base.

It is not the time when the message is generated because the RNIC does not know when the message will be generated.

⁴

The RNIC’s resources are not unlimited, thus, not all of the messages generated by the application can enter the RNIC at the same time. The CC protocol controls the injection of messages into the network and may let the message occupy the resources on the RNIC for a long time so that subsequent messages cannot enter the RNIC and the program completion time increases. Therefore, a low average latency does not mean a low program completion time and vice versa.

References

[1]

2008. Infiniband Architecture Volume 1, General Specifications, Release 1.2.1. Retrieved from www.infinibandta.org/specs

Abstract

1 Introduction

2 Motivation

3 The Design of COER

3.1 Overview

3.2 Message-Level Connections

3.3 Support for Reservation Request and Reservation Feedback

3.4 Support for Sender Scheduling and Speculative Sending

3.4.1 Decoupling WQE from Processing Units.

3.4.2 Support for Switching between Messages (WQEs).

3.4.3 Using Half-connection Mode to Support Speculative Sending.

3.5 Support for Receiver Scheduling

3.6 Support for Rate Limiter

4 Building Offloading Schemes with COER: an Example

5 Evaluation

5.1 The Emulator and Experimental Configuration

5.2 Evaluation Results

5.2.1 Congestion Level Comparison.

5.2.2 Performance and Analysis.

5.3 Discussion

5.4 Implementation Cost and Complexity

6 Related Work

7 Conclusion

Footnotes

References

Index Terms

Recommendations

Congestion Control for Large-Scale RDMA Deployments

Congestion Control for Large-Scale RDMA Deployments

Unreliable transport protocol using congestion control for high-speed networks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations