CN118093499B

CN118093499B - Data transmission method, device, equipment and storage medium for remote memory access

Info

Publication number: CN118093499B
Application number: CN202410166037.5A
Authority: CN
Inventors: 张世明; 杜剑峰; 余鑫才
Original assignee: Beigemeis Shenzhen Technology Co ltd
Current assignee: Beigemeis Shenzhen Technology Co ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-11-19
Anticipated expiration: 2044-02-06
Also published as: CN118093499A

Abstract

The present invention discloses a data transmission method, device, equipment and storage medium for remote memory access, including communicating through the RDMA Sendℜ primitive after determining that the message type of a message to be transmitted is a small message; generating a corresponding work queue element and putting it into a sending queue; determining that there are remaining cache blocks or free caches in a small message buffer area, taking out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sending it; after determining that the message type of the message to be transmitted is a large message, communicating through the RDMA Sendℜ primitive, determining that the required capacity of the message to be transmitted is less than or equal to the remaining storage space, allocating capacity in the large message buffer area according to the required capacity, and dividing the message to be sent into multiple data packets for sending; when the cache area in the last-level cache module receives data, the cache area notifies the application to process the data through a shared cache mapping with the application.

Description

Data transmission method, device, equipment and storage medium for remote memory access

Technical Field

The present invention relates to the field of data transmission technologies, and in particular, to a method, an apparatus, a device, and a storage medium for data transmission with remote memory access.

Background

Data center applications increasingly rely on high-speed networks, requiring network communication characteristics with high throughput, low latency, and low CPU overhead, to support large-scale network communications. In particular, for bandwidth intensive applications such as distributed machine learning and cloud storage, network bandwidths in excess of 100Gbps are required between servers, while for online services such as database online analytics processing, low latency is required to minimize query response time. At the same time, most applications desire a low CPU overhead for the network stack to reserve as much CPU core as possible for the computation.

Remote direct memory access (Remote Direct Memory Access, RDMA) has become a popular high-speed network technology, mainly due to the high throughput, low latency and low CPU overhead provided by the architecture innovations such as core bypass and transport offloading. However, despite the high performance and low CPU overhead of RDMA technology, commercial RDMA Network cards (RDMA Network INTERFACE CARD, RNIC) still face the problem of host bandwidth contention, which is embodied as follows: when the RNIC performs a direct memory access (Direct Memory Access, DMA) operation to store the received message to main memory, the RNIC contends with other computational flows of the CPU (e.g., main memory data analysis, data replication, garbage collection, etc.) for main memory bandwidth; under the competition, the RNIC is difficult to obtain enough main memory bandwidth to store the received data packet to main memory in time, so that the buffer memory on the RNIC chip is filled with unprocessed data packets, and finally, the data in the RNIC overflows, and only the received but unprocessed data packets can be discarded; at the same time, data overflow can trigger congestion control, resulting in reduced network throughput and significantly increased delay.

The problem of hosting bandwidth contention is further exacerbated as high-speed networks develop and deploy scale up. Therefore, how to avoid the impact of hosting bandwidth contention on RDMA network communications is an urgent issue to be addressed in RDMA network applications.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a data transmission method, apparatus, device and storage medium for remote memory access.

A method for data transmission for remote memory access, the method comprising:

s1, a second host needs to send a message to be transmitted to a first host;

S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;

S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;

s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;

S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;

S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;

S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;

S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;

S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;

S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;

S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;

S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;

s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.

In a specific embodiment, before S1, the method further includes:

The first host establishes RDMA communication with the second host, and initializes communication resources including queue pair numbers and data cache addresses of the two parties;

the initializing communication resources is accomplished through a socket communication or RDMA connection manager.

In the method, the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with a data threshold value, and determining the type of the message according to a comparison result;

if the size of the message to be transmitted is smaller than or equal to a data threshold value, determining that the message type is small message;

and if the size of the message to be transmitted is larger than the data threshold value, determining that the message type is a large message.

In a specific embodiment, the data threshold is set to 4KB; each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.

In a specific embodiment, the final buffer module allocates a buffer space for data reception under RDMA communication, and the small message buffer area and the large message buffer area share the buffer space; buffer size = network bandwidth data residence time threshold of the last buffer module.

In a specific embodiment, the first host determines a required capacity size of a message to be transmitted, and specifically includes: the required capacity of the message to be transmitted = amount of data to be sent =data dwell time threshold set to 200 microseconds.

In a specific embodiment, the second host segments the message to be sent into a plurality of data packets for sending, and specifically includes: and splitting the message to be transmitted into a plurality of data packets with the size not larger than 256KB for transmission.

A telecommunications device, the device comprising:

The request module is used for generating a communication request when the second host needs to send a message to be transmitted to the first host;

the message type determining module is used for determining the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;

The small message transmission module is used for determining that the message type of the message to be transmitted is a small message, and the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host computer takes out a buffer block from a small message buffer area in a final buffer module, generates corresponding work queue elements, puts the work queue elements into a shared receiving queue, and judges whether residual buffer blocks exist or idle buffer exists in the small message buffer area; if the small message buffer area does not have the residual buffer blocks and the idle buffer, rejecting the communication request from the second host to the first host; if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue;

The large message transmission module is used for determining that the message type of the message to be transmitted is large, the second host and the first host communicate through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the residual storage space or not; if the required capacity of the message to be transmitted is larger than the residual storage space, rejecting the communication request from the second host to the first host; if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, and the second host segments the message to be transmitted into a plurality of data packets for transmission;

The transmission processing module is used for notifying the application program to process the data through the shared cache mapping with the application program after the buffer area in the final buffer module receives the data; unloading the data of the message to a main storage receiving buffer area and informing the application program of related information if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, and then taking the data from the main storage and processing the data by the application program; if the residence time of the data in the last cache block is less than the data residence time threshold, the process is determined to be complete.

An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

s1, a second host needs to send a message to be transmitted to a first host;

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

s1, a second host needs to send a message to be transmitted to a first host;

The embodiment of the invention has the following beneficial effects:

The invention adopts different RDMA operation and control methods for small message and large message respectively. For small messages, RDMA SEND/RECV is selected and the shared receive queue mechanism is used to aggregate the receive queues; for large messages, RDMA READ/WRITE is used, and according to the size of the required capacity, the RDMA READ/WRITE and the RDMA READ/WRITE share the buffer space of a data receiving buffer pool, and the monopolization of buffer resources is avoided by setting the data using threshold of the two types of messages.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

FIG. 1 is a flow chart of a method of data transfer for remote memory access in one embodiment;

FIG. 2 is a flow chart of a small message receiving process in a data transmission method of remote memory access in one embodiment;

FIG. 3 is a flow chart of a large message receiving process in a data transmission method of remote memory access in one embodiment;

fig. 4 is a block diagram of an electronic device in one embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The basic idea of the invention to solve the problem of bandwidth competition of main memory is to avoid excessive use of slow main memory, but directly use CPU buffer, i.e. a small isolated buffer pool is reserved in the final buffer (LLC, last Level Cache). From a bandwidth perspective, the CPU cache is sufficient to meet the RNIC requirements, but only a very limited cache capacity can be utilized. If the RNIC obtains data small enough while the lifecycle is short enough, then CPU buffering is sufficient. But for data that is relatively large or has a long lifetime, a better solution is needed.

Specifically, the RNIC data receiving module moves the main memory out of the data path of the receiving party, and reserves a small isolated buffer pool in the final buffer (LLC) for receiving the message data from the network card; then, in order to improve the multiplexing level of reserved LLC, different RDMA operation and control methods are respectively adopted for the small message and the large message; the small message adopts RDMA SEND/RECV operation, and utilizes a shared receiving queue mechanism to aggregate receiving queues; large messages employ RDMA READ/WRITE operations and allocate buffers according to message size and data dwell time thresholds. The small message and the large message share the cache pool, and the monopolization of the cache resource is avoided through the data residence time threshold set by the user and the residence data proportion required to be processed in the cache space.

In addition, in order to improve the data processing efficiency and the cache reuse, the invention uses SLAB (Sequential Locality Allocation Buffer) algorithm to manage the cache so as to avoid the generation of internal fragments, uses multithread parallel data processing operation to improve the processing efficiency, and utilizes pipeline technology to realize the recovery of the cache with finer granularity. To maintain a high performance network, when the application occupies the cache for more than a threshold, the data in the cache pool is copied to main memory, thereby releasing the cache quickly.

The invention can obviously improve the communication throughput and reduce the network delay and the occupation of the main memory bandwidth.

The existing experimental results show that the invention improves the communication throughput to 2.11 times, reduces the average network delay by 64.2%, and reduces the occupation of the main memory bandwidth by 96.4%.

In one embodiment, a method for data transfer for remote memory access is provided.

As shown in fig. 1, the data transmission method for remote memory access specifically includes the following steps:

s1, a second host needs to send a message to be transmitted to a first host;

specifically, comparing the size of the message to be transmitted with a data threshold value, and determining the message type according to a comparison result;

Illustratively, the data threshold is set to 4KB; each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.

specifically, a buffer space used for data receiving under RDMA communication is allocated in the final buffer module, and a small message buffer area and a large message buffer area share the buffer space; buffer size = network bandwidth data residence time threshold of the last buffer module.

The first host takes out a buffer block from the small message buffer area in the final buffer module and generates corresponding work queue elements to be put into the shared receiving queue. The second host generates a corresponding work queue element request to place in the send queue, wherein the data threshold that distinguishes message types is set to 4KB. Each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.

The buffer memory in the final buffer memory module is determined according to the actual CPU structure. If the last level of cache of the CPU is an L3 cache, the cache is distributed in the L3 cache; if the last level of cache of the CPU is an L4 cache, the cache is allocated in the L4 cache.

The buffer space allocated in the final buffer module is dedicated to data receiving and use under RDMA communication, and cannot be used for other purposes, namely the buffer space is isolated, and the small message buffer area and the large message buffer area share one block of buffer space, but the minimum occupied space threshold exists between the small message buffer area and the large message buffer area, and when one side is lower than the threshold set by the small message buffer area and the large message buffer area, the other side is refused to continue to occupy the buffer space. The size of the minimum occupancy space threshold is determined based on the frequency of requests for small and large messages in actual traffic.

The buffer space management of the small message buffer area and the large message buffer area in the final buffer module uses SLAB algorithm based on object management, and the granularity of the managed objects is 4KB. The small message buffer area is pre-allocated with a fixed number of buffer blocks, and the pre-allocated buffer blocks are related to a set space threshold value and actual small message request frequency. The SLAB (Sequential Locality Allocation Buffer) algorithm is a high-efficiency algorithm for memory management, and is used for improving the efficiency of memory allocation and release and reducing memory fragmentation. According to the algorithm, the memory is divided into different cache blocks, and the proper blocks are selected for allocation according to the size of the object, so that frequent memory allocation and release operations are avoided, and the utilization rate of the memory is improved.

For RDMA Send/Recv primitive communication, before data transmission, the receiver checks whether there is an idle buffer block in the small message buffer, and if so, generates a corresponding work queue element and notifies the opposite end to Send data.

In the primitive communication, after each data packet arrives at the corresponding area of the buffer area, the buffer monitor immediately notifies the application program to process the data through the shared buffer mapping with the application program, and after the application program processes, the buffer monitor is notified to release and recycle the buffer which is used completely for further use through the shared buffer mapping. If the data processing time exceeds the maximum residence time, the cache monitor copies the data into main memory and notifies the application and releases the cache in which the data is reclaimed for further use.

When an application processes data, multithreading and pipeline technology are used to process the data arriving at the cache. The data received by the data receiving cache pool in the LLC is managed with the granularity of 4KB by using SLAB algorithm, that is, each incoming data packet is divided into small packets with the size of not more than 4KB and then is input into a pipeline for data processing for processing. The pipeline comprises three stages of data storage, data processing and data release, multithreading parallel processing is adopted, and corresponding cache space is released immediately after one granularity data is subjected to the three stages, so that the whole message is not required to be released after being processed by an application program.

Specifically, the data address conversion and authority protection functions are provided by a main memory conversion table (MTT, memory Translation Table) and a main memory protection table (MPT, memory Protection Table), respectively, and the two tables are stored in a network card storage medium by adopting a megapage technology, a physical segment technology and a traffic locality technology to reduce the table structure.

Specifically, the required capacity of a message to be transmitted=the amount of data to be transmitted×the data dwell time threshold, which is set to 200 microseconds.

For communication of RDMA Read/Write primitives, the receiver may calculate the buffer capacity required to receive the data prior to the data transfer. And if the residual buffer space of the large message buffer area is larger than or equal to the required capacity, allocating the buffer space and notifying the opposite terminal to send data. Before the message to be sent is sent, the message is segmented into data packets for transmission according to the preset maximum transmission unit size.

Specifically, the message to be sent is split into a plurality of data packets with the size not larger than 256KB for sending.

The first host allocates the required capacity from the large message buffer area, and the second host segments the message to be sent into a plurality of data packets to be sent, wherein each data packet is not more than 256KB.

The next step is performed immediately on the large message buffer of each data packet reaching the first host without waiting for the complete transmission of the message.

specifically, the data is pipelined through multithreading and pipelining.

The pipeline processing of the data by the multithreading and pipeline technology is specifically characterized in that the data is segmented by the granularity of 4KB in a final stage buffer module, and the segmented data is subjected to three stages of data storage, data processing and data release, and multithreading parallel processing is adopted to form a pipeline for data processing. The data with one granularity is released and recycled by the cache monitoring module immediately after three stages, and the recycling of the cache space is not required to wait for the whole message to be processed by the application program.

Specifically, for the unloaded data, the buffer monitoring module recovers and releases the corresponding buffer space.

Specifically, if the residence time of the data in the last-stage buffer module is less than the data residence time threshold, the data leaving the pipeline considers the processing to be completed, and does not wait for the processing result, and the buffer monitoring module recovers and releases the corresponding buffer space.

Further, before S1, the method further includes:

Specifically, the initialized communication resource includes information such as queue pair numbers and data cache addresses of both parties. The data of both sides are sent out from the main memory, and the final stage buffer memory receives the data.

Illustratively, assuming that the second host needs to transfer information to the first host, initialization of the communication resources may be accomplished through a socket communication or RDMA connection manager.

The queue pair context is stored in the network card storage medium, the queue pair scheduler is responsible for scheduling ready queue pairs, and the work queue element mover is responsible for batchwise fetching work queue elements of the ready queue pairs from the main memory each time.

In order to improve the statistical multiplexing level of LLC, the invention adopts different RDMA operation and control methods for small message and large message respectively. For small messages, RDMA SEND/RECV is selected and the shared receive queue mechanism is used to aggregate the receive queues; for large messages, RDMA READ/WRITE is used, required caches are allocated according to the product of the message size and the data residence time threshold, the RDMA READ/WRITE and the RDMA READ/WRITE share the cache space of a data receiving cache pool, and the use threshold of the two types of messages is set to avoid the exclusive use of cache resources.

Compared with the direct IO access (DATA DIRECT Input Output) technology of data which is not isolated by the cache, the invention averagely improves the hit throughput to 1.6 times, averagely improves the network throughput by 5 percent, and averagely reduces the occupation amount of the main memory bandwidth by 5 percent. Meanwhile, the buffer area used for data receiving is limited according to the Litel rule to solve the mismatch between the small size of reserved LLC and the inefficient reservation of RDMA queue pairs, and only a small block of buffer memory is needed to support high bandwidth requirements.

For example: for a bandwidth of 200Gbps, the data retention time threshold is 200 microseconds, and the invention only needs 5MB of buffer space to support the requirement.

For better understanding of specific details and features of the method for remote memory access data transmission according to the present invention when receiving small messages, the method is described in detail with reference to fig. 2, and includes steps 1 to 6:

Step 1: preparing communication resources of both parties, including constructing a queue pair, acquiring information such as queue pair numbers, data cache addresses and the like of the opposite ends, and establishing connection by both parties by using a reliable connector; after the establishment is completed, the second host applies for a data sending request to the first host, informs about to Send a small message, and the second host sends data by adopting an RDMA Send primitive.

Step 2: the first host receives a data sending request of the second host, judges that the request sends a small message, adopts RDMA Recv primitives to receive data, an application program in the first host checks whether a small message buffer zone in LLC has an idle buffer block or an idle buffer, if not, the application program refuses to generate a corresponding work queue element receiving request, if the idle buffer block or the idle buffer is present, a corresponding receiving buffer block is allocated, the application program generates a corresponding work queue element to point to the buffer block, and stores the work queue element in a main receiving queue to wait for the extraction of a network card program.

Step 3: after the first host is ready to receive the cache block and work queue element of the data, the second host is notified that the data can be sent. After receiving the message, the second host replies an Acknowledgement (ACK), and the first host polls the completion queue to acknowledge the receipt of the message.

Step 4: the second host places the corresponding sending request into a sending queue in the form of a work queue element, wherein the work queue element points to data information to be sent in the main memory, and the data information comprises an address, a length and the like. When the program on the network card obtains the work queue element from the main memory and executes the work queue element, the data to be sent is transmitted to the network card receiving queue of the first host according to the information stored in the work queue element, the first host takes out the corresponding work queue element from the shared receiving queue, and the data is transmitted from the receiving queue to the buffer block distributed in the small message buffer area in advance.

Step 5: upon arrival of the transferred data at the corresponding cache block in the last level cache, the application processes the data immediately through multithreading and pipelining.

Step 6: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache block for further use.

For a better understanding of the specific details and features of the present invention in a remote memory access data transfer method when receiving large messages, the details are described in conjunction with fig. 3, including RDMA Read and RDMA WRITE:

the RDMA Read comprises the steps 1 to 5:

Step 1: preparing the communication resources of both parties, including constructing a queue pair, acquiring the queue pair number, the data address and other information of the opposite end, and informing the second host of the need of executing RDMA Read operation by the first host, so as to prepare the Read data. The second host places the data to be read in the registered area of the main memory, informs the first host that the data to be read is ready, and informs the stored main memory information and authority.

Step 2: the first host checks whether the large message buffer in the final buffer has enough capacity to receive the data to be Read, and if so, generates a corresponding Read work queue element request to put in the send queue. When the program on the network card accesses the work queue element from the host and executes the work queue element, the first host sends a Read request to the second host, and the request contains information such as a target host address, a data size, access rights and the like of data to be Read.

Step 3: after receiving the RDMA Read request sent by the first host, the second host acquires data from the corresponding area in the main memory according to the information in the request, divides the data into data packets according to the set packet size, sends the data packets to a network card receiving queue of the first host, and when each data packet arrives at the network card of the first host, the first host replies to confirm that the ACK represents that the data is received.

Step 4: the network card of the first host immediately processes the data by multi-threading and pipelining as soon as the data arrives when the transmitted data packet is sent from the RX to the corresponding buffer region of the last level buffer at a granularity of 4 KB.

Step 5: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache for further use.

The RDMA WRITE includes steps 6 to 11:

step 6: preparing the communication resources of both parties, including constructing a queue pair and acquiring the information such as the queue pair number, the data address and the like of the opposite end.

Step 7: the second host sends out a Write request, wherein the request contains relevant information of data to be written.

Step 8: and after the first host receives the request, according to the provided related information, checking whether a large message buffer area of the last-stage buffer in the first host has enough capacity to receive the data to be written, and after searching enough buffer, the first host replies a confirmation ACK to inform the second host of writing the data. The reply contains information such as a cache address, access rights and the like of the data to be written.

Step 9: the second host prepares the data to be written from the host and places the data in the registered host area. The network card of the second host computer cuts the data to be written into according to the set packet size and then sends the cut data to the network card receiving queue of the first host computer.

Step 10: when the network card of the first host sends the transmitted data packet from the receiving queue to the corresponding buffer area of the final buffer with the granularity of 4KB, the application program immediately processes the data through the multithreading and pipeline technology as soon as the data arrives.

Step 11: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache for further use.

In one embodiment, there is provided a telecommunications device, the device comprising:

The small message transmission module is used for determining that the message type of the message to be transmitted is small message, and the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host computer takes out a buffer block from a small message buffer area in a final buffer module, generates corresponding work queue elements, puts the work queue elements into a shared receiving queue, and judges whether residual buffer blocks exist or idle buffer exists in the small message buffer area; if the small message buffer area does not have the residual buffer blocks and the idle buffer, rejecting the communication request from the second host to the first host; if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue;

In one embodiment, an electronic device is presented comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

s1, a second host needs to send a message to be transmitted to a first host;

As shown in fig. 4, the electronic device 500 includes a processor 501, and the processor 501 may execute a computer program in a readable storage medium 503 or a computer program loaded from a storage unit 508 into the readable storage medium 503 to implement the functions set forth in the present invention.

The PCIe bus 505 in the device 500 is connected to the internal bus 504, and each component in the device 500 interacts with these two types of buses, where the component connected to the PCIe bus includes an input unit 506, such as a keyboard, a mouse, and the like; an output unit 507 such as a display, a speaker, or the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; a communication unit 509, including a network card, a network card storage area, and a processing unit; the network card storage medium stores a computer program therein, and the processing unit on the network card executes the embodiments of the present invention by running the computer program therein. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet. The communication unit 509 may be implemented in computer hardware, firmware, software, and/or combinations thereof, such as digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

s1, a second host needs to send a message to be transmitted to a first host;

The computer readable storage medium includes a receiving storage medium, a readable storage medium, or a network card storage medium. The receiving storage medium stores data sent by the network card. The readable storage medium and the network card storage medium store computer programs to realize the data communication method based on remote direct data access.

In particular, the receiving storage medium may be any entity or recording medium capable of receiving data sent by a network card, including static random access memory (SRAM, static Random Access Memory), embedded dynamic random access memory (eDRAM, embedded Dynamic Random Access Memory), and the like. The readable storage medium may be any entity or recording medium capable of carrying the computer program instructions, including a usb disk, a removable hard disk, an optical disk, a computer memory, a Read-only memory (ROM), a Random-access memory (RAM), etc. The network card storage medium may be any entity or recording medium capable of carrying the computer program instructions and incorporating a network card, including Field Programmable Gate Array (FPGA), static random access Memory (SRAM, static Random Access Memory), flash Memory (Flash Memory), electronically erasable Programmable read Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A data transmission method for remote memory access, characterized in that it is applied to a remote communication device, wherein the remote communication device manages a last-level cache module;

The last-level cache module is provided with a cache area, the cache area includes a small message cache area and a large message cache area, the small message cache area and the large message cache area share the cache space of the cache area, and the small message cache area is divided into a preset number of cache blocks of a preset data threshold size for management;

The method comprises:

S1. The second host needs to send a message to be transmitted to the first host;

S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message; the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with the data threshold, and determining the message type according to the comparison result; if the size of the message to be transmitted is less than or equal to the data threshold, determining the message type to be a small message; if the size of the message to be transmitted is greater than the data threshold, determining the message type to be a large message;

After determining that the message type of the message to be transmitted is the small message, executing step S3; after determining that the message type of the message to be transmitted is the large message, executing step S7;

S3, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue, and the work queue element points to the data information to be sent in the main memory, including the address and length; the first host generates a corresponding work queue element and puts it into the shared receiving queue, and each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module;

S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area;

S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host;

S6: If there are remaining cache blocks or free caches in the small message cache area, a receiving cache block is allocated, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element, and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue; then S11 is executed;

S7, the second host communicates with the first host through RDMA Read/Write primitives, and the first host determines the required capacity size of the message to be transmitted;

S8. The first host determines whether the required capacity of the message to be transmitted meets the remaining storage space, where the remaining storage space is the free buffer space in the large message buffer area;

S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host;

S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer area of the first host;

S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application;

S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the application is informed of the relevant information; the application subsequently retrieves the data from the main memory and processes it;

S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed.

2. The remote memory access data transmission method according to claim 1, characterized in that before S1, the method further comprises:

The first host establishes RDMA communication with the second host and initializes communication resources, including queue pair numbers and data cache addresses of both parties;

The initialization of communication resources is completed through socket communication or RDMA connection manager.

3. The data transmission method for remote memory access according to claim 1 is characterized in that the data threshold is set to 4KB; each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module.

4. According to the data transmission method of remote memory access according to claim 1, it is characterized in that the cache space used for data reception under RDMA communication is allocated in the final cache module, and the small message cache area and the large message cache area share the cache space; the cache space size allocated by the final cache module = network bandwidth * data residence time threshold * the proportion of residence data that needs to be processed in the cache space.

5. The data transmission method for remote memory access according to claim 1 is characterized in that the first host determines the required capacity of the message to be transmitted, specifically including: the required capacity of the message to be transmitted = the amount of data to be sent * the data residence time threshold, and the data residence time threshold is set to 200 microseconds.

6. The remote memory access data transmission method according to claim 1 is characterized in that the second host divides the message to be sent into multiple data packets for sending, specifically comprising: dividing the message to be sent into several data packets no larger than 256KB for sending.

7. A remote communication device, characterized in that the device manages a last-level cache module;

The device comprises:

A request module, used for generating a communication request when the second host needs to send a message to be transmitted to the first host;

A message type determination module is used to determine the message type of the message to be transmitted, wherein the message type includes a large message and a small message; the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with the data threshold, and determining the message type according to the comparison result; if the size of the message to be transmitted is less than or equal to the data threshold, determining the message type to be a small message; if the size of the message to be transmitted is greater than the data threshold, determining the message type to be a large message;

A small message transmission module, used to respond after determining that the message type of the message to be transmitted is a small message, including: the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue, and the work queue element points to the data information to be sent in the main memory, including the address and length; the first host generates a corresponding work queue element and puts it into the shared receiving queue, and each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module; it is determined whether there are remaining cache blocks or free caches in the small message cache area; if there are no remaining cache blocks and free caches in the small message cache area, the communication request from the second host to the first host is rejected; if there are remaining cache blocks or free caches in the small message cache area, a receiving cache block is allocated, and the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue;

A large message transmission module, used to respond after determining that the message type of the message to be transmitted is a large message, including: the second host communicates with the first host through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the remaining storage space, and the remaining storage space is the free cache space in the large message buffer area; if the required capacity of the message to be transmitted is greater than the remaining storage space, the communication request from the second host to the first host is rejected; if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission;

The transmission processing module is used to notify the application to process the data through the shared cache mapping with the application after the cache area in the last-level cache module receives the data; if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the relevant information is notified to the application; the application subsequently retrieves the data from the main memory and processes it; if the residence time of the data in the last-level cache module is less than the data residence time threshold, it is determined that the processing is completed.

8. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the processor is caused to execute the steps of the method according to any one of claims 1 to 6.