Using COER, we have rapidly designed and implemented eight proactive CC protocols (SRP, SMSRP, BFRP, pHost, CRSP, ExpressPass, Homa, and PCRP) on RNIC based on an accurate model in the in-house emulator. For comparison, the participants in the tests were
Base,
2 IB [
52], HPCC, and DCQCN. During the testing process, default parameters provided by these protocols were exclusively employed. Since the emulator can load real MPI programs, GPCNeT [
21] was adopted by us as the main test load. GPCNeT is a lightweight MPI program designed to test the performance of CC protocols in high-performance computing systems. In addition, we used our own design test program, which generated random node pair and random ring traffic. The latency statistics in the test results are calculated by counting the time it takes for a message to go from entering the sender’s NIC
3 to being written to the memory of the receiver’s NIC. This includes the waiting time of the message at the sender and receiver due to CC scheduling, the time required for memory read and write operations, and the time required for packet fragmentation and assembly. This method ensures fairness in the statistics for all protocols, including
Base. The program completion time in the test results includes the time from memory generation to NICs for all WQEs and network latency. The latency reflects the CC protocol’s ability to control network congestion and the ability to schedule messages, whereas the program completion time reflects the cost paid by the CC protocol to control network congestion.
45.2.2 Performance and Analysis.
We tested the performance of these protocols using GPCNeT in the k-ary n-tree fat-tree topology. The y-axis in Figure
14 represents the improvement in the average latency of different types of WQE compared with
Base. It can be seen that the proactive CC protocol designed by COER outperforms IB, DCQCN, and HPCC. Reactive protocols degrade performance at scale because they lag in detecting and reacting to congestion and have difficulty adjusting the rate. In addition, these results also show that the scheme of COER has characteristics consistent with the original CC protocol. Next, we will analyze the results in Figure
14.
SRP, SMSRP, and BFRP adopt reservations, but SM WQEs do not require them. These protocols ensure high performance for long reserved flows (RDMAW WQEs). However, the reservation overhead is intolerable for short flows (IMMRDMAW WQEs), resulting in poor performance. SMSRP and BFRP send all flows speculatively and reserve only if needed. While these approaches perform well in small scenarios, they can fail in large and heavily loaded ones because the proportion of short flows increases, as does the probability of speculative sending failure. BFRP’s reserved flow limitation has little effect on limited processing units in RNICs.
pHost uses token-based packet-level reservation and provides a free token at the beginning of each flow to allow speculation. Unlike other protocols, pHost uses a source degradation mechanism to prevent unresponsive origins from receiving tokens and wasting destination bandwidth. It prioritizes fairness over the distinction between short and long flows. The results in Figure
14 show that pHost performs well at a large scale, maintaining high receive bandwidth even under heavy loads.
CRSP uses a credit-based reservation method, with a default credit size equal to twice the maximum network flow size, and speculation is not allowed. Therefore, the CRSP must set a large enough credit value to ensure the transmission of long flows. However, for high-performance networks with significant differences in traffic size, short flows equate to no reservation constraints because excessive credit values are meaningless for short flows, while ultra-long flows are rare.
Our implementation of ExpressPass focused on the network interface module and used the ExpressPass protocol for all WQEs, resulting in degraded performance of SM WQE due to the additional reservation time overhead. ExpressPass achieves CC through router credit drop on the bottleneck link and rate adjustment of destination credit. ExpressPass does not support speculation and guarantees fairness. However, due to its restrictions on path diversity, the eccentricity of routers when discarding credit, and frequent starts and stops of end-to-end credit transmission, its performance is not stable.
PCRP and Homa differ in terms of their reservation granularities in that PCRP reserves the packet chain that can be sent within one RTT, while Homa reserves one packet at a time. Both PCRP and Homa prioritize short flows during scheduling and use multiple VCs in routers to support packet priorities, leading to lower average flow latency, albeit at the expense of long flow benefits. Their overall performance depends on the number and size of long and short flows in the load. Compared with Base, their program completion times are higher due to inaccurate reservations caused by RTT fluctuations. Specifically, PCRP’s packet chain is calculated with reference to base RTT. This results in the reservation granularity being smaller than the actual RTT, leading to network bandwidth wastage. Moreover, Homa requires a grant for every packet sent, and network bandwidth is wasted once RTT fluctuations cause the grant to lag.
The poor performance of DCQCN and IB is mainly influenced by the
Head-of-Line (HoL) phenomenon. DCQCN and IB are implemented in the Network Interface module, and when a message is throttled in terms of sending rate, the next non-congested message cannot be injected into the network. Therefore, simply putting the CC protocol and the RDMA protocol at different levels will result in performance degradation. HPCC is implemented in the Send Arbitration module, and a multi-queue for per-flow is added to avoid HoL issues. However, HPCC has over-controlled network congestion, thus, its program completion time is large. We used the default parameters of the CC protocol for testing at different scales, thus, some protocols showed sharp jitter in performance at small scales (as shown in Figure
14(a)).
Subsequently, we tested the latency performance of these protocols using random node pair and random ring traffic. Random node pair traffic creates random source–destination node pairs for communication. Random ring traffic randomly forms a ring of all nodes, each of which communicates only with its neighbors in the ring. Messages are sent in ascending order of size during the test, and random node pairs or random ring processes are executed multiple times. The vertical axis of the Figures
15(a) and 15(b) represents the average time it takes to complete multiple point-to-point or ring communications at a given message size, and Figure
15(c) represents the total time it takes for the two test programs to complete messages of all sizes.
PCRP and Homa have the lowest latency and total completion time under random node pair traffic, as shown in Figures
15(a) and
15(c). This differs from the results under GPCNeT because random node pair traffic has high burstiness with no endpoint congestion, which makes SMSRP and BFRP ineffective. At the same time, PCRP and Homa benefit from multi-VC and multi-priority within the network. ExpressPass performs well in terms of latency due to its ability to handle internal congestion but has the longest total completion time due to the larger reservation overhead. The performance of other protocols is similar to that under GPCNeT.
Under random ring traffic, PCRP, Homa, SMSRP, and BFRP show significant latency reduction compared with other protocols (Figure
15(b)) due to the amplification of reservation overhead in the random ring communication pattern. A random ring communication pattern requires each node to receive a message from the upstream node before it can send a message to the downstream node, which amplifies the advantages of the half-connection scheme of these protocols. However, PCRP and Homa’s total completion time increases rather than decreases, unlike SMSRP and BFRP, for the same reason as in the GPCNeT test. The performance of other protocols is consistent with expectations and will not be discussed further.
Finally, we tested these protocols using common topologies from real systems and increased the scale of testing to 1,000 nodes. The test traffic consisted of random node pairs and only included RDMAW messages, with message sizes ranging from 8B to 256 KB. Figure
16(a) shows the test results on a real-life fat-tree (3;16,8,8;1,8,8;1,2,2) using the Dmodk routing algorithm. Figure
16(b) shows the test results on a dragonfly (8,4) using the UGAL routing algorithm. The performance of each protocol in the test results is generally consistent with previous discussions. The latency performance of proactive CC protocols is generally higher than reactive CC protocols, with PCRP showing the largest improvement in latency performance. However, some proactive CC protocols with aggressive CC strategies result in longer program completion times. It is worth noting that in the dragonfly topology, all CC protocols reduced program completion time. This demonstrates that CC in general helps the performance.
From the results of the above experiments, it can be concluded that these CC offloading schemes implemented through COER are accurate and consistent with their characteristics. Each scheme can be decomposed into components to facilitate the rapid design of an RNIC offloading implementation. COER can effectively integrate CC protocols on RNICs with the RDMA protocol while preserving their original characteristics.
In addition, the results of these experiments indicate that proactive CC protocols are more compatible with RDMA networks and exhibit better performance. Among the eight tested proactive CC protocols, PCRP and Homa consistently maintained the lowest latency level, suggesting that a more complex yet refined mechanism is effective. pHost is the only CC protocol that performs better as the scale increases, indicating the importance of fairness in CC within large-scale networks. Therefore, a complex, refined, and fairness-guaranteed proactive CC protocol is what RDMA networks require.