CN117812027B - RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium - Google Patents
RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN117812027B CN117812027B CN202410224302.0A CN202410224302A CN117812027B CN 117812027 B CN117812027 B CN 117812027B CN 202410224302 A CN202410224302 A CN 202410224302A CN 117812027 B CN117812027 B CN 117812027B
- Authority
- CN
- China
- Prior art keywords
- multicast
- forwarding
- data
- ack
- nack
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000003860 storage Methods 0.000 title claims abstract description 12
- 230000001133 acceleration Effects 0.000 title abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 17
- 230000005540 biological transmission Effects 0.000 claims abstract description 12
- 101000741965 Homo sapiens Inactive tyrosine-protein kinase PRAG1 Proteins 0.000 claims description 68
- 102100038659 Inactive tyrosine-protein kinase PRAG1 Human genes 0.000 claims description 68
- 230000008569 process Effects 0.000 claims description 33
- 230000002776 aggregation Effects 0.000 claims description 27
- 238000004220 aggregation Methods 0.000 claims description 27
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 3
- 238000012790 confirmation Methods 0.000 claims description 3
- 238000004891 communication Methods 0.000 abstract description 17
- 230000006870 function Effects 0.000 abstract description 11
- 239000002699 waste material Substances 0.000 abstract description 3
- 102100036409 Activated CDC42 kinase 1 Human genes 0.000 description 75
- 238000010586 diagram Methods 0.000 description 11
- 102100031478 C-type natriuretic peptide Human genes 0.000 description 9
- 101100465000 Mus musculus Prag1 gene Proteins 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000001934 delay Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/20—Support for services
- H04L49/201—Multicast operation; Broadcast operation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1863—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast comprising mechanisms for improved reliability, e.g. status reports
- H04L12/1868—Measures taken after transmission, e.g. acknowledgments
- H04L12/1872—Measures taken after transmission, e.g. acknowledgments avoiding ACK or NACK implosion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/12—Shortest path evaluation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/30—Peripheral units, e.g. input or output ports
- H04L49/3009—Header conversion, routing tables or routing tags
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/65—Re-configuration of fast packet switches
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/70—Virtual switches
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to an RDMA acceleration multicast method, a device, equipment and a storage medium, belonging to the technical field of communication. Therefore, there is no bandwidth waste, and the communication distance and delay are minimized. The invention integrates the fourth layer into the structure, further realizes virtual one-to-many connection, and the QP of a single connection in the sending end can simultaneously communicate with a plurality of QPs on a plurality of receiving ends, thereby the invention can fully utilize the advanced functions of the connection transmission service, besides more effective bandwidth utilization, the invention is superior to application layer multicast with shorter communication distance and lower forwarding delay, because no additional WQE and CQE processing exists, and the intermediate node does not participate in forwarding data.
Description
Technical Field
The invention belongs to the technical field of communication, and particularly relates to an RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium.
Background
Data center applications place increasingly stringent demands on network communications, such as sustained high concurrency, ultra-low latency (microsecond scale), and low CPU overhead. To meet this demand, remote direct memory access (remote direct memory access, RDMA) is becoming a widely used networking technology. Its performance reaches 40Gbps link. Many technology companies have incorporated RDMA into their production data centers. These data centers carry various network-intensive applications such as deep learning, cloud storage, graph exploration, etc., which are highly beneficial from underlying RDMA communications.
However, RDMA transfer data only supports one-to-one reliable connections (Reliable Connection, RC), which make various applications mismatched to group communication modes (e.g., one-to-many). Multicasting is common in data center applications, and in particular, multicasting is the top two communication modes in high performance computing clusters. In the benchmark application of high performance computing, more than 90% of the traffic is multicast mode.
To meet the needs of an application, one simple solution is to provide multicast primitives. However, although there are some existing multicast solutions, none of them simultaneously achieve two performance requirements: 1) Optimally forwarding multicast traffic; 2) Fully releasing the excellent RDMA functions.
In one aspect, some major RDMA multicast frameworks choose to develop their private multicast protocols on RDMA one-to-one RC transport (also known as application layer multicast). Thus, they can effectively utilize the outstanding RDMA functions. However, application layer multicast does not achieve optimal traffic forwarding, resulting in low bandwidth utilization and communication bottlenecks. Although many multicast algorithms are proposed to alleviate bottlenecks (e.g., ring and duobinary trees), they inevitably result in longer communication distances and higher delays due to data forwarding by intermediate nodes.
On the other hand, IB multicasting defined in IB (InfiniBand) standard only supports UD (Unreliable Datagram) transmission. Thus, applications employing IB multicasting cannot take advantage of high performance RDMA unilateral WRITE, extended message size, and hardware supported reliability. In addition to performance requirements, multicast solutions should also meet the actual deployment requirements in a production environment.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an RDMA acceleration multicast method, device, equipment and storage medium, which realize optimal multicast forwarding and effective utilization of RDMA advanced functions and can meet deployment requirements.
In order to solve the problems, the invention adopts the following technical scheme: RDMA accelerated multicast method, including
Control plane multicast group registration: taking an RDMA UD mode protocol as an envelope; the initialization node in the multicast group collects the 4 th layer state of other nodes, fits the 4 th layer state of other nodes into an envelope data packet format, and sends the envelope data packet format back to other nodes for establishing a multicast forwarding table and for confirming multicast membership;
Data plane one-to-many data forwarding: the multicast member after the identity confirmation is used as a receiving end, the multicast source finishes one-time data transmission through the existing RC, the receiving end is used as an intermediate forwarding node, the multicast data is copied according to the multicast forwarding table, and the copied multicast data is forwarded to a plurality of other receiving ends through an optimal path; the connection related state in the routing table is replaced by the specific forwarding node to match different QPs in different receiving ends, so that virtual one-to-many connection is realized;
data plane many-to-one feedback aggregation: after receiving multicast data, a receiving end generates an ACK/NACK/CNP packet according to the existing RC logic and forwards the generated feedback to a forwarding node; when the forwarding node has a plurality of feedback stream inputs, feedback aggregation is executed first, and then the aggregated feedback is forwarded to the next hop forwarding node.
Further, in the control plane multicast group registration process, the multicast forwarding table is extended:
the method comprises the steps that a multicast group IP address is used as an index key, a group level state and a port level state are indexed by the index key, the group level state contains statistical information of a multicast group, the port level state is formatted into an array with n items at most, n is the port number of multicast members, and the array only contains ports contained in a multicast tree; each entry is assigned to one of two types connected or forwarded, where connected indicates that the port is directly connected to the receiving end node, forwarded indicates that the next hop is another receiving end.
Setting the destination IP as a unique multicast group IP, assigning any non-conflicting values to the destination QPN;
after the QPs are established, the multicast members exchange QPs information and register the forwarding routing table to the user state.
Further, in the data plane one-to-many data forwarding process, when a forwarding node receives a data packet sent by a multicast source, the forwarding node uses a destination IP address in the data packet to index an associated multicast forwarding table, and then iterates all entries in the multicast forwarding table, and performs the following operations:
If the entry type is direct data forwarding: creating a data packet copy and forwarding through the port;
If the entry type is used to establish a multicast connection: creating a copy of the data packet, modifying its connection-related state, and forwarding through the port.
Further, in the data plane one-to-many data forwarding process, MR information is updated for each WRITE request:
Before submitting an actual WRITE request, the host application program calls an additional WRITE message, wherein the additional WRITE message contains MR states of different receiving ends, after identifying the additional WRITE message, the MR information of the receiving end is updated to the multicast forwarding table, and when submitting the actual WRITE request, the MR information in the updated multicast forwarding table is used for replacing the original MR information.
Further, in the data plane many-to-one feedback aggregation process, many-to-one ACK aggregation and NACK filtering are performed:
the multicast source receives the ACK only when all receiving ends receive the corresponding packets;
when any one of the receiving ends loses the packet, the multicast source receives NACK.
Further, in the process of data plane many-to-one feedback aggregation, when the forwarding node receives the ACK/NACK packet, the forwarding node finds out an associated multicast forwarding table through the destination IP of the ACK/NACK packet, and processes the ACK data packet and the NACK data packet;
The process of processing the ACK data packet comprises the following steps: the ACK data packet comprises group-level data and port-level data, if the psn of the incoming ACK is larger than that of the old ACK, updating the psn of the ACK of the port, and if the triggering condition is met, adopting GENERATE NEW ACK/NACK function to generate aggregate ACK, wherein the aggregate ACK comprises the minimum ACK_psn of the multicast record;
The process of processing NACK data packet is: when a loss occurs, the receiving end generates a NACK packet to inform the multicast source that the NACK packet contains the expected PSN of the receiving end, and each NACK will confirm all packets having a PSN smaller than the expected PSN.
RDMA (remote direct memory Access) acceleration multicast device comprising
The initialization node unit is used for collecting the 4 th layer states of other nodes, fitting the 4 th layer states of other nodes into an envelope data packet format, and sending the envelope data packet format to the other nodes, wherein the envelope is a protocol of RDMA UD mode;
the multicast source is used for completing one-time data transmission through the existing RC;
the receiving end is used for receiving the multicast data, generating an ACK/NACK/CNP packet according to the existing RC logic, and forwarding the generated feedback to the intermediate node;
The intermediate node is used for receiving the format of the envelope data packet, establishing a multicast forwarding table and confirming the membership of the multicast, copying multicast data, forwarding the copied multicast data to a plurality of receiving ends through an optimal path, and replacing the connection related state in the routing table by a specific forwarding node to match different QPs in different receiving ends so as to realize virtual one-to-many connection; and the feedback aggregation is executed first when a plurality of feedback streams are input, and then the aggregated feedback is forwarded to the next-hop intermediate node.
RDMA accelerated multicast device, including
One or more processors;
A storage device storing one or more programs;
the processor implements the above method when the program is executed by the processor.
A storage medium storing a computer program which, when executed by a processor, implements the method described above.
The beneficial effects of the invention are as follows:
The key points of the invention are as follows: 1. performing optimal multicast forwarding in an intra-structure distribution manner; 2. the use of local RDMA RC logic is readjusted through careful multicast routing table coordination to achieve efficient multicast transmission.
However, there are some key challenges: how to achieve integration between optimal multicast forwarding and existing RDMA RC logic. Specifically, there are two main compatibility issues:
First, in RDMA, connectionless-oriented UD logic cannot find the relevant Queue Pair (QP) when receiving a conventionally forwarded multicast packet. In the conventional multicast forwarding process, the switch does not alter the layer 4 header of the packet, either copying the entire packet or modifying only the layer three (IP) header of the layer three multicast route. Thus, all packets contain the same layer 4 header that matches at most only one connection. The unmatched network card will discard these packets because it cannot find the associated QP and queue pair context (Queue Pair Context, QPC) based on the unmatched layer 4 header.
Second, even if the receiving end can accept the packet and find the relevant QP, a second incompatibility that prevents RC utilization is the existing reliability logic. Standard reliability logic is designed for a single feedback (including ACK, NACK, CNP, etc.) stream from a single receiver. Thus, multiple feedback streams from multiple receivers may confuse the RC and interfere with its mechanism of loss detection and retransmission.
The present invention provides a structurally supported multicast protocol that is structurally very different from the traditional layer 3, systematically integrates its design components to address these two incompatibility issues.
For data plane one-to-many data forwarding, the present invention retains the optimal multicast forwarding employed by the multicast solutions in existing structures. Therefore, the sender only needs to send one copy of the data, while the UD mode of RDMA maintains a routing table, and the receiving end makes multiple copies according to the routing table and forwards the copies to multiple receiving ends through the optimal path. Therefore, there is no bandwidth waste, and the communication distance and delay are minimized.
Second, unlike previous work that only supports third tier routing, the present invention integrates fourth tier into the fabric, further enabling virtual one-to-many connections. Thus, a single connected QP in the sender may communicate with multiple QPs on multiple receivers simultaneously, thereby enabling the present invention to fully exploit the advanced functions of the connection transport service, i.e., unilateral write operations and extended message sizes. In addition to more efficient bandwidth utilization, the present invention is also superior to application layer multicast with shorter communication distances and lower forwarding delays because there are no additional WQE and CQE processing and intermediate nodes are not involved in forwarding data.
Drawings
FIG. 1 is a flow diagram of an RDMA accelerated multicast method of the present invention;
Fig. 2 is a schematic diagram of multicast communication according to the present invention;
Fig. 3 is a schematic diagram of the invention for expanding a multicast forwarding table;
fig. 4 is a schematic diagram illustrating a process of turning an initial node to a multicast node in the multicast registration process according to the present invention; ;
FIG. 5 is a schematic diagram of a one-to-many data forwarding process according to the present invention;
FIG. 6 is a schematic diagram illustrating the processing of a packet by a forwarding node according to the present invention;
FIG. 7 is a schematic diagram of the processing of many-to-one data aggregation of the present invention;
FIG. 8 is a schematic diagram of the processing of the ACK/NACK acknowledgement message generation algorithm in the data aggregation process of the present invention;
FIG. 9 is a schematic diagram of the process of NACK filtering in accordance with the present invention;
FIG. 10 is a schematic diagram of a multicast software call flow in accordance with the present invention;
FIG. 11 is a schematic diagram of the application message mechanism of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention is focused on realizing 1, the best multicast forwarding, 2, the effective utilization of RDMA advanced functions, and 3 meets the deployment requirement. For this purpose, the key ideas of the invention are: 1. performing optimal multicast forwarding in an intra-structure distribution manner; 2. the use of local RDMA RC logic is readjusted through careful multicast routing table coordination to achieve efficient multicast transmission.
However, there are some key challenges: how to achieve integration between optimal multicast forwarding and existing RDMA RC logic. Specifically, there are two major compatibility issues.
First, connectionless-oriented UD logic in RDMA cannot find the relevant Queue Pair (QP) when receiving a conventionally forwarded multicast packet. In the conventional multicast forwarding process, the switch does not alter the layer 4 header of the packet, either by copying the entire packet or by modifying only the layer three (IP) header of the layer three multicast route. Thus, all packets contain the same layer 4 header that matches at most only one connection. The unmatched network card will discard these packets because it cannot find the associated QP and queue pair context (Queue Pair Context, QPC) based on the unmatched layer 4 header.
Second, even if the receiving end can accept the packet and find the relevant QP, the second incompatibility that prevents RC utilization is the existing reliability RC logic. Standard reliability logic is designed for a single feedback (including ACK, NACK, CNP, etc.) stream from a single receiver, and thus multiple feedback streams from multiple receivers may confuse the RC and interfere with its loss detection and retransmission mechanisms.
In order to solve the above two compatibility problems, a multicast protocol supported by a structure is provided, which is quite different from the traditional layer 3 in the structural method.
Specifically, as shown in fig. 1, fig. 1 shows the multicast flow of the present invention, which is composed of three main phases: control plane multicast group registration, data plane one-to-many data forwarding and data plane many-to-one feedback aggregation.
For control plane multicast group registration, the invention carefully expands the traditional multicast forwarding table structure by integrating the layer 4 state. With the protocol of RDMA UD mode as an envelope, RDMA UD mode is one of three common modes of RDMA for registering a forwarding routing table with a multicast. The initialization node in the multicast group collects the 4 th layer states of other nodes, fits the 4 th layer states of other nodes into an envelope data packet format, sends the envelope data packet format to other nodes for establishing a multicast forwarding table, and simultaneously sends the envelope data packet format to other nodes for confirming multicast membership. The involved node will acknowledge the ACK to the initializing node to acknowledge its participation. The multicast group has a plurality of nodes, including an initialization node and a multicast node, wherein the multicast node is all the nodes after the multicast routing forwarding table is built, and the initialization node is the node before the multicast forwarding table is built.
For data plane one-to-many data forwarding, the invention reserves the optimal multicast forwarding and realizes virtual one-to-many connection. The multicast member after the identity confirmation is used as a receiving end, the multicast source finishes one-time data transmission through the existing RC, the receiving end judges whether the multicast data needs to be copied and forwarded according to a multicast routing forwarding table, and the copied multicast data is forwarded to a plurality of receiving ends through an optimal path; the connection-related state in the routing table is replaced at a particular forwarding node to match different QPs in different receiving ends, thereby implementing a virtual one-to-many connection.
Taking multicast communication as shown in fig. 2 as an example, in fig. 2, S is a multicast source, that is, a data transmitting end, R1, R2 and R3 are receiving ends, L1, L2, L3, L4, S1, S3 and C2 are forwarding nodes, and the forwarding nodes are generally receiving ends serving as intermediate nodes. The multicast source S transmits data only once over the existing RC connection. Then, the forwarding intermediate nodes (L1, S1, C2, S3) in the multicast tree replicate the data and forward it to the plurality of receiving ends via the optimal path. In addition, some specific forwarding nodes replace connection-related states in the routing table to match different QPs in different receiving ends, thereby implementing virtual one-to-many connections. For example, L1, S1, C2, and S3 replicate and forward the packet to a particular port identified in the forwarding table. In addition, L2, L3, and L4 replace some of the connection-related states in the packet routing table to match QP in R1, R2, and R3.
Data plane many-to-one feedback aggregation: after receiving multicast data, a receiving end generates an ACK/NACK/CNP packet according to the existing RC logic, and forwards the generated feedback to a multicast source or an intermediate node; when the intermediate node has only one feedback input, the feedback grouping is directly forwarded to the next-hop intermediate node, and when the intermediate node has a plurality of feedback inputs, the feedback aggregation is executed first, and then the aggregated feedback is forwarded to the next-hop intermediate node.
As shown in fig. 2, after receiving the data packet, the receiving ends R1, R2 and R3 generate a normal ACK/NACK/CNP packet according to the existing RC logic. These feedback are then forwarded to the sender. Taking ACK as an example, L2, L3, L4 and C2 forward ACK packets only to the next hop intermediate node, since there is only one ACK stream as input. S3 and S1 perform ACK aggregation because there are multiple ACK streams as inputs. L1 changes the connection-related state in the ACK header to match the QP of the multicast source S before forwarding the aggregate ACK.
The aggregated ACK and NACK packets enable the transmitting end to transmit subsequent new data packets or to retransmit lost data packets. The filtered CNP is used to adjust the sending rate of the sender.
It can be seen that the data plane of the present invention is one-to-many data forwarding, preserving the optimal multicast forwarding employed by the multicast solution in the previous architecture. Thus, the sender only needs to send one copy of the data, while the RC mode of RDMA will make multiple copies according to the multicast routing forwarding table established by the UD mode, and forward the copies to multiple receivers through the optimal path. Therefore, there is no bandwidth waste, and the communication distance and delay are minimized.
In addition, unlike the previous work that only third layer routing is supported, the present invention further enables virtual one-to-many connection by collecting the state of layer 4 of other nodes, integrating layer 4 into the structure. Thus, a single connected QP in the sender may communicate with multiple QPs on multiple receivers simultaneously, so that the present invention may fully exploit the advanced functions of the connection transport service, namely unilateral write operations and extended message sizes. In addition to more efficient bandwidth utilization, the present invention establishes an application layer packet with shorter communication distances and lower forwarding delays because there is no additional WQE and CQE processing and the intermediate nodes forward data more efficiently.
In the process of control plane multicast group registration, the multicast forwarding table is expanded:
the method comprises the steps that a multicast group IP address is used as an index key, a group level state and a port level state are indexed by the index key, the group level state contains statistical information of a multicast group, the port level state is formatted into an array with n items at most, n is the port number of multicast members, and the array only contains ports contained in a multicast tree; each entry is assigned to one of two types connected or forwarded, where connected indicates that the port is directly connected to the receiving end node, forwarded indicates that the next hop is another receiving end.
The invention carefully expands the traditional multicast forwarding table structure by integrating the layer 4 state, the expansion table is shown in figure 3, and the index key of the table is a multicast group IP address (GroupIP for short). Many multicast groups may exist simultaneously, each with a unique GroupIP.
GroupIP index two types of states: group level state and port level state. The group level state contains statistics of the multicast group, which are used for many-to-one feedback aggregation. The port level state is formatted as an array of up to n (multicast member port number) entries, with the array containing only ports contained in the multicast tree. The entry for port i represents the state associated with port i. Each entry is assigned one of two types: connected and forwarded, where connected indicates that the port is directly connected to the receiving end node, forwarded indicates that the next hop is another receiving end. The entry for the connection contains the layer 3 and layer 4 states of the receiving end of the connection, and the ACK/NACK state, and the forwarded entry contains only the ACK/NACK state.
The specific memory space used by a multicast table depends on the number of ports involved in the multicast group, at most n. It is computationally available that 1K multicast groups require at most 0.92MB of memory when each group contains the largest entry. Furthermore, the design goal of the present invention is not to compress the state maintained by the switch, but rather to provide a generic multicast protocol with significant RDMA features, many existing approaches can be used to extend the multicast forwarding table to support more communities.
Establishing QP: each multicast member follows a common unicast-like procedure to establish an RC QP but assigns it a virtual destination. Specifically, the destination IP is set to be unique GroupIP, and the destination QPN may be assigned to any non-conflicting value (e.g., 0x 1). The IB network card provides a programming interface for the application to specify the destination IP and QPN without modifying the IB network card state. After the QPs are established, the multicast members exchange their QPs information and register the forwarding table to the user state. Once registration is complete, the sender may begin sending multicast packets.
In the control plane multicast group registration process, the initial node is diverted to the multicast node:
The initial node sends out multicast registration information in RDMA UD mode, and receives multicast registration information from other initial nodes, and first receives the multicast registration information from other initial nodes, and starts a thread to establish a forwarding table immediately and sends the forwarding table to other nodes.
As shown in fig. 4, all the four nodes a, B, C and D are initial nodes, and send multicast registration information to the outside, the node a initially receives the reply message of the node B (the 13 th message in fig. 4), then establishes a forwarding table to synchronize with other nodes, the node B receives the reply message of the node C (the 18 th message in fig. 4), the node B establishes a forwarding table, the node C receives the reply message of the node D (the 22 nd message in fig. 4), the node C establishes a forwarding table, and the node D does not need to be directly used as a final receiving end and does not need to forward because all the received reply messages have already been established.
In the data plane one-to-many data forwarding process, when a forwarding node receives a data packet sent by a multicast source, using a destination IP address in the data packet to index an associated multicast forwarding table, and then iterating all entries in the multicast forwarding table by the forwarding node, and executing the following operations:
If the entry type is direct data forwarding: creating a data packet copy and forwarding through the port;
If the entry type is used to establish a multicast connection: creating a copy of the data packet, modifying its connection-related state, and forwarding through the port.
The intermediate nodes involved in the multicast structure are responsible for forwarding data through the multicast structure and modifying the packet data packets to match different QPs. The forwarding node processes the data packet following the procedure shown in fig. 5, and upon receipt of the data packet (p), the multicast protocol uses the destination IP address (p.dest_ip) in the data packet to index the associated multicast forwarding table (T). Then forwarding all entries in node iteration T (one entry for each port) and performing the following: (1) if the type is direct data forwarding: creating a data packet copy and forwarding through the port; (2) if the type is used to establish the multicast connection: creating a copy of the data packet, modifying its connection-related state, and forwarding through the port.
Taking one path S1 to L2 to R1 of the multicast structure in fig. 2 as an example to show the change of the header, first, the destination IP and QPN are modified to match the QP identification of R1. In addition, the present invention changes the source IP from the sender's IP to GroupIP. As a result, when R1 generates feedback, the destination IP of the feedback will be the source IP of the packet, i.e., groupIP. Thus, the feedback packet may also index the associated forwarding table through its destination IP. In addition, the application program of the invention replaces the destination MAC address so as to avoid the MAC layer of the receiving end from discarding the data packet.
The present invention supports one-to-many writing. The present invention maintains connections between a sender and multiple receivers, which is sufficient for RDMA primitives of SEND/RECEIVE. But WRITE requires more support. WRITE allows a node to WRITE to memory on a remote node. MR information to be written (including remote QP and q_key) is indicated in the first packet of the WRITE request. The network card of the WRITE response program will check the MR information and only execute the request if they are correct. Otherwise, the packet will be discarded. In order to enable one-to-many WRITE, the present invention needs to modify MR states in WRITE request headers of different receiving ends, as shown in fig. 6, the present invention further expands a routing information table to support the query of MR information, and the present invention designs a message queuing mechanism for polling and notification for one-to-many efficient writing.
In addition to maintaining MR information for different receivers, it is also necessary to update MR information for each WRITE request, as MR varies with different WRITE requirements. Thus, during data plane one-to-many data forwarding, MR information is updated for each WRITE request: before submitting the actual WRITE request, the host application invokes an additional WRITE message, where the additional WRITE message includes MR states of different receiving ends, identifies the additional WRITE message, updates the MR information of the receiving end to the multicast forwarding table, and replaces the original MR information with the updated MR information in the multicast forwarding table when submitting the actual WRITE request. This on-request update scheme introduces minimal additional bandwidth overhead as long as the additional WRITE messages are small compared to the total amount of data transmitted.
Through extended one-to-many data forwarding, the present invention can optimally forward data copies to multiple receiving ends, and QP at different receiving ends can accept data packets. However, the RC-existing reliability logic does not allow the sender to receive datagram acknowledgment information for multiple receivers. In particular, reliability logic in the currently prevailing RDMA network cards is designed to receive a single feedback stream from a single receiver; thus, multiple feedback streams from multiple receivers may compromise reliability.
In the process of data plane many-to-one feedback aggregation, many-to-one ACK aggregation and NACK filtration are carried out:
the multicast source receives the ACK only when all receiving ends receive the corresponding packets;
when any one of the receiving ends loses the packet, the multicast source receives NACK.
The feedback contains various types of datagrams, such as ACKs, NACKs, notification packets (e.g., CNPs) for congestion control, etc. The processing mode of the invention for the congestion control related feedback of ACK and NACK is as follows: many-to-one ACK aggregation and NACK filtering are performed within the structure to send unicast-like ACK/NACK streams to the sender. As a result, the transmitting end can correctly interpret the ACK and continue data transmission. In addition, when a loss occurs, the sender can accurately detect and retransmit the lost datagram.
The basic principle of ACK aggregation/NACK filtering is: (1) The multicast source should receive ACK only when all the receiving ends receive the corresponding packet; (2) When any receiving end loses a packet, the multicast source should receive a NACK. Furthermore, the aggregation needs to take into account processing rules of the RDMA protocol, such as go-back-N retransmission and ACK merging.
In the process of data plane many-to-one feedback aggregation, when the switch receives the ACK/NACK packet, the switch finds out the associated multicast forwarding table through the destination IP of the ACK/NACK packet, and processes the ACK data packet and the NACK data packet;
The process of processing the ACK data packet comprises the following steps: the ACK data packet comprises group-level data and port-level data, if the psn of the incoming ACK is larger than that of the old ACK, updating the psn of the ACK of the port, and if the triggering condition is met, adopting GENERATE NEW ACK/NACK function to generate aggregate ACK, wherein the aggregate ACK comprises the minimum ACK_psn of the multicast record;
The process of processing NACK data packet is: when a loss occurs, the receiving end generates a NACK packet to inform the multicast source that the NACK packet contains the expected PSN of the receiving end, and each NACK will confirm all packets having a PSN smaller than the expected PSN.
The invention maintains the ACK/NACK related information in an extended multicast forwarding table, as shown in figure 7. Upon receiving the ACK/NACK packet, the forwarding node finds the associated forwarding table through the destination IP (i.e., groupIP) of the ACK/NACK packet. The invention provides a working logic without packet loss, and then processes NACK data packets.
Processing ACK: the ACK-related state includes: 1. group level data including PSN (last_ack_psn) of last aggregated ACK, port (ack_out_port) where ACK should be transmitted, and the like; 2. port level data including a maximum acknowledgement PSN (ack_psn) for each port. The forwarding node processes the ack packet, updates the relevant state in the forwarding table, and generates an aggregate ack packet, as shown in fig. 7.
Upon receipt of an ACK packet, if the psn of the incoming ACK is greater than the old ack_psn, the ack_psn for that port is updated first. Then, if the trigger condition is satisfied (p.psn is greater than last_ack_psn as shown in fig. 7), an ACK/NACK generation algorithm is invoked (as shown in fig. 8) to generate an aggregate ACK. The aggregate ACK contains the minimum ack_psn of the multicast record, which is found by iterating through all multicast forwarding table entries. As a result, each aggregate ACK forwarded by the multicast protocol acknowledges that all downstream receivers have received the corresponding data packet.
The port with the smallest ack_psn is recorded as min_port. ACK generation is triggered each time an ACK with a large PSN is received from min_port. Thus, not every ACK packet triggers generation and the number of ACKs received by the source decreases.
Processing NACK: when the loss occurs, the receiving end generates a NACK packet to notify the multicast transmitting end. The NACK packet (p) contains the expected PSN (p.psn) of the receiving end. Each NACK will acknowledge all packets with PSN smaller than the expected PSN. This rule must be handled carefully when generating the NACK.
Taking fig. 9 as an example, there are two receiving ends and one transmitting end, the transmitting end generates two data copies and forwards them to the two receiving ends. There are two missing packets p 1R 1 and p 2R 2, and two associated NACK packets NACK p1 and NACK p2. The sender should forward nack p1 because it contains the minimum expected PSN. If nack p2 is forwarded first to the sender, the loss of p 1R 1 will be covered, since the sender will assume that all packets before the expected PSN of nack p2 (i.e., p 2) have been received. Therefore, the transmitting end will not retransmit p1 and reliability is compromised.
Therefore, a NACK packet should be forwarded only if all the receiving ends acknowledge all packets whose PSN is smaller than their intended PSN. We implement this decision in the ACK/NACK generation algorithm. If the condition is not satisfied, the transmitting end continues to wait, during which new ACK/NACK packets continue to arrive. If the transmitting end receives a new NACK and its PSN is not greater than the recorded t.nack.epsn, the t.nack.epsn is updated and the NACK generation condition is rechecked as shown in fig. 9.
The invention provides a multicast protocol prototype with complete functions, which realizes logic in the whole structure; and a set of software APIs exposed to the application program. The multicast protocol prototype of the present invention is based on IB network card, as shown in fig. 10, the core protocol logic of the present invention is developed based on rdmacm and ibverbs function libraries in user mode.
API based IB network card. The invention realizes multicast registration, data packet replication, routing table generation and updating and feedback aggregation logic at the application layer. A test environment (Mellanox/7800 model of each machine) is constructed by using an IB network card and four servers, and each server is provided with an IB network card. The data (ACK) packets will be replicated (aggregated) by a replicator (ACK aggregator). The generated data packet is pushed into the queue system waiting for the multiplexer to schedule in case of queue contention. Finally, the replicated data (aggregated ACK) packets are sent back to the commodity switch. During processing, multicast forwarding tables are accessed as needed.
Software API. The invention can provide various communication libraries and middleware for multicast protocol support, as shown in fig. 10, specifically, the invention adds new implementation of RDMA IB verbs, designs multicast software packages for multicast QP creation and data transmission, and multicast processes can establish QP for multicast. Multicast members exchange their QPs information and begin handshaking. Once the multicast group is successfully established, the multicast software package eventually invokes the RDMA primitive defined in libibverb of the open source to transfer the data. Software modifications on the end hosts are transparent to upper layer applications and do not require any RNIC or RDMA driver modifications.
The RDMA acceleration multicast device of the invention comprises
The initialization node unit is used for collecting the 4 th layer states of other nodes, fitting the 4 th layer states of other nodes into an envelope data packet format, and sending the envelope data packet format to the other nodes, wherein the envelope is a protocol of RDMA UD mode;
the multicast source is used for completing one-time data transmission through the existing RC;
the receiving end is used for receiving the multicast data, generating an ACK/NACK/CNP packet according to the existing RC logic, and forwarding the generated feedback to the intermediate node;
The intermediate node is used for receiving the format of the envelope data packet, establishing a multicast forwarding table and confirming the membership of the multicast, copying multicast data, forwarding the copied multicast data to a plurality of receiving ends through an optimal path, and replacing the connection related state in the routing table by a specific forwarding node to match different QPs in different receiving ends so as to realize virtual one-to-many connection; and the feedback aggregation is executed first when a plurality of feedback streams are input, and then the aggregated feedback is forwarded to the next-hop intermediate node.
RDMA accelerated multicast device, including
One or more processors;
A storage device storing one or more programs;
when the program is executed by the processor, the processor implements the RDMA accelerated multicast method described above.
The storage medium of the present invention stores a computer program that is executed by a processor to perform the RDMA accelerated multicast method described above.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
- RDMA accelerated multicast method, characterized by comprisingControl plane multicast group registration: taking an RDMA UD mode protocol as an envelope; the initialization node in the multicast group collects the 4 th layer state of other nodes, fits the 4 th layer state of other nodes into an envelope data packet format, and sends the envelope data packet format back to other nodes for establishing a multicast forwarding table and for confirming multicast membership;Data plane one-to-many data forwarding: the multicast member after the identity confirmation is used as a receiving end, the multicast source finishes one-time data transmission through the existing RC, the receiving end is used as an intermediate forwarding node, the multicast data is copied according to the multicast forwarding table, and the copied multicast data is forwarded to a plurality of other receiving ends through an optimal path; the connection related state in the routing table is replaced by the specific forwarding node to match different QPs in different receiving ends, so that virtual one-to-many connection is realized;data plane many-to-one feedback aggregation: after receiving multicast data, a receiving end generates an ACK/NACK/CNP packet according to the existing RC logic and forwards the generated feedback to a forwarding node; when the forwarding node has a plurality of feedback stream inputs, feedback aggregation is executed first, and then the aggregated feedback is forwarded to the next hop forwarding node.
- 2. The RDMA-accelerated multicast method of claim 1, wherein the multicast forwarding table is extended during control plane multicast group registration:the method comprises the steps that a multicast group IP address is used as an index key, a group level state and a port level state are indexed by the index key, the group level state contains statistical information of a multicast group, the port level state is formatted into an array with n items at most, n is the port number of multicast members, and the array only contains ports contained in a multicast tree; each entry is assigned as one of two types connected or forwarded, where connected means that the port is directly connected to the receiving end node, forwarded means that the next hop is the other receiving end;Setting the destination IP as a unique multicast group IP, assigning any non-conflicting values to the destination QPN;after the QPs are established, the multicast members exchange QPs information and register the forwarding routing table to the user state.
- 3. The RDMA-accelerated multicast method according to claim 2, wherein, in a data plane one-to-many data forwarding process, when a forwarding node receives a data packet sent by a multicast source, the forwarding node indexes an associated multicast forwarding table using a destination IP address in the data packet, and then iterates all entries in the multicast forwarding table, and performs the following operations:If the entry type is direct data forwarding: creating a data packet copy and forwarding through the port;If the entry type is used to establish a multicast connection: creating a copy of the data packet, modifying its connection-related state, and forwarding through the port.
- 4. The RDMA-accelerated multicast method of claim 1, wherein in the data plane one-to-many data forwarding process, MR information is updated for each WRITE request:Before submitting an actual WRITE request, the host application program calls an additional WRITE message, wherein the additional WRITE message contains MR states of different receiving ends, after identifying the additional WRITE message, the MR information of the receiving end is updated to the multicast forwarding table, and when submitting the actual WRITE request, the MR information in the updated multicast forwarding table is used for replacing the original MR information.
- 5. The RDMA-accelerated multicast method of claim 1, wherein in a data plane many-to-one feedback aggregation process, many-to-one ACK aggregation and NACK filtering are performed:the multicast source receives the ACK only when all receiving ends receive the corresponding packets;when any one of the receiving ends loses the packet, the multicast source receives NACK.
- 6. The RDMA-accelerated multicast method according to claim 1, wherein in the data plane many-to-one feedback aggregation process, when a forwarding node receives an ACK/NACK packet, an associated multicast forwarding table is found through a destination IP of the ACK/NACK packet, and the ACK packet and the NACK packet are processed;The process of processing the ACK data packet comprises the following steps: the ACK data packet comprises group-level data and port-level data, if the psn of the incoming ACK is larger than that of the old ACK, updating the psn of the ACK of the port, and if the triggering condition is met, adopting GENERATE NEW ACK/NACK function to generate aggregate ACK, wherein the aggregate ACK comprises the minimum ACK_psn of the multicast record;The process of processing NACK data packet is: when a loss occurs, the receiving end generates a NACK packet to inform the multicast source that the NACK packet contains the expected PSN of the receiving end, and each NACK will confirm all packets having a PSN smaller than the expected PSN.
- RDMA accelerated multicast device, characterized in that it comprisesThe initialization node unit is used for collecting the 4 th layer states of other nodes, fitting the 4 th layer states of other nodes into an envelope data packet format, and sending the envelope data packet format to the other nodes, wherein the envelope is a protocol of RDMA UD mode;the multicast source is used for completing one-time data transmission through the existing RC;the receiving end is used for receiving the multicast data, generating an ACK/NACK/CNP packet according to the existing RC logic, and forwarding the generated feedback to the intermediate node;The intermediate node is used for receiving the format of the envelope data packet, establishing a multicast forwarding table and confirming the membership of the multicast, copying multicast data, forwarding the copied multicast data to a plurality of receiving ends through an optimal path, and replacing the connection related state in the routing table by a specific forwarding node to match different QPs in different receiving ends so as to realize virtual one-to-many connection; and the feedback aggregation is executed first when a plurality of feedback streams are input, and then the aggregated feedback is forwarded to the next-hop intermediate node.
- RDMA accelerated multicast device, characterized by comprisingOne or more processors;A storage device storing one or more programs;the processor implements the method of claim 1 when the program is executed by the processor.
- 9. A storage medium storing a computer program which, when executed by a processor, implements the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224302.0A CN117812027B (en) | 2024-02-29 | 2024-02-29 | RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224302.0A CN117812027B (en) | 2024-02-29 | 2024-02-29 | RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117812027A CN117812027A (en) | 2024-04-02 |
CN117812027B true CN117812027B (en) | 2024-04-30 |
Family
ID=90434899
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410224302.0A Active CN117812027B (en) | 2024-02-29 | 2024-02-29 | RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117812027B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101242342A (en) * | 2007-02-05 | 2008-08-13 | 华为技术有限公司 | Multicast method and multicast route method |
CN114077507A (en) * | 2021-11-23 | 2022-02-22 | 南京大学 | Zero-copy serialization method applied to remote procedure call system |
CN117478503A (en) * | 2022-07-21 | 2024-01-30 | 华为技术有限公司 | Multicast configuration method and device |
-
2024
- 2024-02-29 CN CN202410224302.0A patent/CN117812027B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101242342A (en) * | 2007-02-05 | 2008-08-13 | 华为技术有限公司 | Multicast method and multicast route method |
CN114077507A (en) * | 2021-11-23 | 2022-02-22 | 南京大学 | Zero-copy serialization method applied to remote procedure call system |
CN117478503A (en) * | 2022-07-21 | 2024-01-30 | 华为技术有限公司 | Multicast configuration method and device |
Also Published As
Publication number | Publication date |
---|---|
CN117812027A (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1323264B1 (en) | Mechanism for completing messages in memory | |
US20220214934A1 (en) | System and method for facilitating hybrid message matching in a network interface controller (nic) | |
US10148581B2 (en) | End-to-end enhanced reliable datagram transport | |
US10521283B2 (en) | In-node aggregation and disaggregation of MPI alltoall and alltoallv collectives | |
US6865160B1 (en) | Broadcast tree determination in load balancing switch protocols | |
US10791054B2 (en) | Flow control and congestion management for acceleration components configured to accelerate a service | |
US6456597B1 (en) | Discovery of unknown MAC addresses using load balancing switch protocols | |
US7283476B2 (en) | Identity negotiation switch protocols | |
US10419329B2 (en) | Switch-based reliable multicast service | |
US11277350B2 (en) | Communication of a large message using multiple network interface controllers | |
WO2000072421A1 (en) | Reliable multi-unicast | |
US20210218808A1 (en) | Small Message Aggregation | |
EP3563535B1 (en) | Transmission of messages by acceleration components configured to accelerate a service | |
US6621829B1 (en) | Method and apparatus for the prioritization of control plane traffic in a router | |
Hoefler et al. | Data center ethernet and remote direct memory access: Issues at hyperscale | |
US20060133376A1 (en) | Multicast transmission protocol for fabric services | |
CN114024910B (en) | Extremely low-delay reliable communication system and method for financial transaction system | |
CN117812027B (en) | RDMA (remote direct memory access) acceleration multicast method, device, equipment and storage medium | |
CN117354253A (en) | Network congestion notification method, device and storage medium | |
Yu et al. | Scalable, High-performance NIC-based All-to-all Broadcast over Myrinet/GM | |
Li et al. | Gleam: An rdma-accelerated multicast protocol for datacenter networks | |
WO2024125098A1 (en) | Data transmission method and apparatus, and device and computer-readable storage medium | |
WO2024046151A1 (en) | Data stream processing method and related device | |
WO2023047567A1 (en) | Intermediate device, communication method, and program | |
Li et al. | Host-driven In-Network Aggregation on RDMA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |