[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118093499B - Data transmission method, device, equipment and storage medium for remote memory access - Google Patents

Data transmission method, device, equipment and storage medium for remote memory access Download PDF

Info

Publication number
CN118093499B
CN118093499B CN202410166037.5A CN202410166037A CN118093499B CN 118093499 B CN118093499 B CN 118093499B CN 202410166037 A CN202410166037 A CN 202410166037A CN 118093499 B CN118093499 B CN 118093499B
Authority
CN
China
Prior art keywords
message
host
data
cache
transmitted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410166037.5A
Other languages
Chinese (zh)
Other versions
CN118093499A (en
Inventor
张世明
杜剑峰
余鑫才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beigemeis Shenzhen Technology Co ltd
Original Assignee
Beigemeis Shenzhen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beigemeis Shenzhen Technology Co ltd filed Critical Beigemeis Shenzhen Technology Co ltd
Priority to CN202410166037.5A priority Critical patent/CN118093499B/en
Publication of CN118093499A publication Critical patent/CN118093499A/en
Application granted granted Critical
Publication of CN118093499B publication Critical patent/CN118093499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9047Buffering arrangements including multiple buffers, e.g. buffer pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1081Address translation for peripheral access to main memory, e.g. direct memory access [DMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3036Shared queuing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9063Intermediate storage in different physical parts of a node or terminal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1012Design facilitation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/15Use in a specific computing environment
    • G06F2212/154Networked environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明公开了一种远程内存访问的数据传输方法、装置、设备及存储介质,包括确定待传输消息的消息类型为小消息后,通过RDMA Sendℜ原语进行通信;生成对应的工作队列元素放入发送队列;确定小消息缓存区存在剩余缓存块或者空闲缓存,根据工作队列元素指向的主存消息存放地址取出待传输消息发送;确定待传输消息的消息类型为大消息后,通过RDMA Sendℜ原语进行通信,确定待传输消息的需求容量小于等于剩余存储空间,根据需求容量在大消息缓存区分配容量,将待发送消息切分为多个数据包发送;当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据。

The present invention discloses a data transmission method, device, equipment and storage medium for remote memory access, including communicating through the RDMA Sendℜ primitive after determining that the message type of a message to be transmitted is a small message; generating a corresponding work queue element and putting it into a sending queue; determining that there are remaining cache blocks or free caches in a small message buffer area, taking out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sending it; after determining that the message type of the message to be transmitted is a large message, communicating through the RDMA Sendℜ primitive, determining that the required capacity of the message to be transmitted is less than or equal to the remaining storage space, allocating capacity in the large message buffer area according to the required capacity, and dividing the message to be sent into multiple data packets for sending; when the cache area in the last-level cache module receives data, the cache area notifies the application to process the data through a shared cache mapping with the application.

Description

Data transmission method, device, equipment and storage medium for remote memory access
Technical Field
The present invention relates to the field of data transmission technologies, and in particular, to a method, an apparatus, a device, and a storage medium for data transmission with remote memory access.
Background
Data center applications increasingly rely on high-speed networks, requiring network communication characteristics with high throughput, low latency, and low CPU overhead, to support large-scale network communications. In particular, for bandwidth intensive applications such as distributed machine learning and cloud storage, network bandwidths in excess of 100Gbps are required between servers, while for online services such as database online analytics processing, low latency is required to minimize query response time. At the same time, most applications desire a low CPU overhead for the network stack to reserve as much CPU core as possible for the computation.
Remote direct memory access (Remote Direct Memory Access, RDMA) has become a popular high-speed network technology, mainly due to the high throughput, low latency and low CPU overhead provided by the architecture innovations such as core bypass and transport offloading. However, despite the high performance and low CPU overhead of RDMA technology, commercial RDMA Network cards (RDMA Network INTERFACE CARD, RNIC) still face the problem of host bandwidth contention, which is embodied as follows: when the RNIC performs a direct memory access (Direct Memory Access, DMA) operation to store the received message to main memory, the RNIC contends with other computational flows of the CPU (e.g., main memory data analysis, data replication, garbage collection, etc.) for main memory bandwidth; under the competition, the RNIC is difficult to obtain enough main memory bandwidth to store the received data packet to main memory in time, so that the buffer memory on the RNIC chip is filled with unprocessed data packets, and finally, the data in the RNIC overflows, and only the received but unprocessed data packets can be discarded; at the same time, data overflow can trigger congestion control, resulting in reduced network throughput and significantly increased delay.
The problem of hosting bandwidth contention is further exacerbated as high-speed networks develop and deploy scale up. Therefore, how to avoid the impact of hosting bandwidth contention on RDMA network communications is an urgent issue to be addressed in RDMA network applications.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data transmission method, apparatus, device and storage medium for remote memory access.
A method for data transmission for remote memory access, the method comprising:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
In a specific embodiment, before S1, the method further includes:
The first host establishes RDMA communication with the second host, and initializes communication resources including queue pair numbers and data cache addresses of the two parties;
the initializing communication resources is accomplished through a socket communication or RDMA connection manager.
In the method, the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with a data threshold value, and determining the type of the message according to a comparison result;
if the size of the message to be transmitted is smaller than or equal to a data threshold value, determining that the message type is small message;
and if the size of the message to be transmitted is larger than the data threshold value, determining that the message type is a large message.
In a specific embodiment, the data threshold is set to 4KB; each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.
In a specific embodiment, the final buffer module allocates a buffer space for data reception under RDMA communication, and the small message buffer area and the large message buffer area share the buffer space; buffer size = network bandwidth data residence time threshold of the last buffer module.
In a specific embodiment, the first host determines a required capacity size of a message to be transmitted, and specifically includes: the required capacity of the message to be transmitted = amount of data to be sent =data dwell time threshold set to 200 microseconds.
In a specific embodiment, the second host segments the message to be sent into a plurality of data packets for sending, and specifically includes: and splitting the message to be transmitted into a plurality of data packets with the size not larger than 256KB for transmission.
A telecommunications device, the device comprising:
The request module is used for generating a communication request when the second host needs to send a message to be transmitted to the first host;
the message type determining module is used for determining the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
The small message transmission module is used for determining that the message type of the message to be transmitted is a small message, and the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host computer takes out a buffer block from a small message buffer area in a final buffer module, generates corresponding work queue elements, puts the work queue elements into a shared receiving queue, and judges whether residual buffer blocks exist or idle buffer exists in the small message buffer area; if the small message buffer area does not have the residual buffer blocks and the idle buffer, rejecting the communication request from the second host to the first host; if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue;
The large message transmission module is used for determining that the message type of the message to be transmitted is large, the second host and the first host communicate through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the residual storage space or not; if the required capacity of the message to be transmitted is larger than the residual storage space, rejecting the communication request from the second host to the first host; if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, and the second host segments the message to be transmitted into a plurality of data packets for transmission;
The transmission processing module is used for notifying the application program to process the data through the shared cache mapping with the application program after the buffer area in the final buffer module receives the data; unloading the data of the message to a main storage receiving buffer area and informing the application program of related information if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, and then taking the data from the main storage and processing the data by the application program; if the residence time of the data in the last cache block is less than the data residence time threshold, the process is determined to be complete.
An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
The embodiment of the invention has the following beneficial effects:
The invention adopts different RDMA operation and control methods for small message and large message respectively. For small messages, RDMA SEND/RECV is selected and the shared receive queue mechanism is used to aggregate the receive queues; for large messages, RDMA READ/WRITE is used, and according to the size of the required capacity, the RDMA READ/WRITE and the RDMA READ/WRITE share the buffer space of a data receiving buffer pool, and the monopolization of buffer resources is avoided by setting the data using threshold of the two types of messages.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a flow chart of a method of data transfer for remote memory access in one embodiment;
FIG. 2 is a flow chart of a small message receiving process in a data transmission method of remote memory access in one embodiment;
FIG. 3 is a flow chart of a large message receiving process in a data transmission method of remote memory access in one embodiment;
fig. 4 is a block diagram of an electronic device in one embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The basic idea of the invention to solve the problem of bandwidth competition of main memory is to avoid excessive use of slow main memory, but directly use CPU buffer, i.e. a small isolated buffer pool is reserved in the final buffer (LLC, last Level Cache). From a bandwidth perspective, the CPU cache is sufficient to meet the RNIC requirements, but only a very limited cache capacity can be utilized. If the RNIC obtains data small enough while the lifecycle is short enough, then CPU buffering is sufficient. But for data that is relatively large or has a long lifetime, a better solution is needed.
Specifically, the RNIC data receiving module moves the main memory out of the data path of the receiving party, and reserves a small isolated buffer pool in the final buffer (LLC) for receiving the message data from the network card; then, in order to improve the multiplexing level of reserved LLC, different RDMA operation and control methods are respectively adopted for the small message and the large message; the small message adopts RDMA SEND/RECV operation, and utilizes a shared receiving queue mechanism to aggregate receiving queues; large messages employ RDMA READ/WRITE operations and allocate buffers according to message size and data dwell time thresholds. The small message and the large message share the cache pool, and the monopolization of the cache resource is avoided through the data residence time threshold set by the user and the residence data proportion required to be processed in the cache space.
In addition, in order to improve the data processing efficiency and the cache reuse, the invention uses SLAB (Sequential Locality Allocation Buffer) algorithm to manage the cache so as to avoid the generation of internal fragments, uses multithread parallel data processing operation to improve the processing efficiency, and utilizes pipeline technology to realize the recovery of the cache with finer granularity. To maintain a high performance network, when the application occupies the cache for more than a threshold, the data in the cache pool is copied to main memory, thereby releasing the cache quickly.
The invention can obviously improve the communication throughput and reduce the network delay and the occupation of the main memory bandwidth.
The existing experimental results show that the invention improves the communication throughput to 2.11 times, reduces the average network delay by 64.2%, and reduces the occupation of the main memory bandwidth by 96.4%.
In one embodiment, a method for data transfer for remote memory access is provided.
As shown in fig. 1, the data transmission method for remote memory access specifically includes the following steps:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
specifically, comparing the size of the message to be transmitted with a data threshold value, and determining the message type according to a comparison result;
if the size of the message to be transmitted is smaller than or equal to a data threshold value, determining that the message type is small message;
and if the size of the message to be transmitted is larger than the data threshold value, determining that the message type is a large message.
Illustratively, the data threshold is set to 4KB; each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
specifically, a buffer space used for data receiving under RDMA communication is allocated in the final buffer module, and a small message buffer area and a large message buffer area share the buffer space; buffer size = network bandwidth data residence time threshold of the last buffer module.
The first host takes out a buffer block from the small message buffer area in the final buffer module and generates corresponding work queue elements to be put into the shared receiving queue. The second host generates a corresponding work queue element request to place in the send queue, wherein the data threshold that distinguishes message types is set to 4KB. Each work queue element in the shared receive queue points to a 4KB sized cache block in the last level cache block.
The buffer memory in the final buffer memory module is determined according to the actual CPU structure. If the last level of cache of the CPU is an L3 cache, the cache is distributed in the L3 cache; if the last level of cache of the CPU is an L4 cache, the cache is allocated in the L4 cache.
The buffer space allocated in the final buffer module is dedicated to data receiving and use under RDMA communication, and cannot be used for other purposes, namely the buffer space is isolated, and the small message buffer area and the large message buffer area share one block of buffer space, but the minimum occupied space threshold exists between the small message buffer area and the large message buffer area, and when one side is lower than the threshold set by the small message buffer area and the large message buffer area, the other side is refused to continue to occupy the buffer space. The size of the minimum occupancy space threshold is determined based on the frequency of requests for small and large messages in actual traffic.
The buffer space management of the small message buffer area and the large message buffer area in the final buffer module uses SLAB algorithm based on object management, and the granularity of the managed objects is 4KB. The small message buffer area is pre-allocated with a fixed number of buffer blocks, and the pre-allocated buffer blocks are related to a set space threshold value and actual small message request frequency. The SLAB (Sequential Locality Allocation Buffer) algorithm is a high-efficiency algorithm for memory management, and is used for improving the efficiency of memory allocation and release and reducing memory fragmentation. According to the algorithm, the memory is divided into different cache blocks, and the proper blocks are selected for allocation according to the size of the object, so that frequent memory allocation and release operations are avoided, and the utilization rate of the memory is improved.
For RDMA Send/Recv primitive communication, before data transmission, the receiver checks whether there is an idle buffer block in the small message buffer, and if so, generates a corresponding work queue element and notifies the opposite end to Send data.
In the primitive communication, after each data packet arrives at the corresponding area of the buffer area, the buffer monitor immediately notifies the application program to process the data through the shared buffer mapping with the application program, and after the application program processes, the buffer monitor is notified to release and recycle the buffer which is used completely for further use through the shared buffer mapping. If the data processing time exceeds the maximum residence time, the cache monitor copies the data into main memory and notifies the application and releases the cache in which the data is reclaimed for further use.
When an application processes data, multithreading and pipeline technology are used to process the data arriving at the cache. The data received by the data receiving cache pool in the LLC is managed with the granularity of 4KB by using SLAB algorithm, that is, each incoming data packet is divided into small packets with the size of not more than 4KB and then is input into a pipeline for data processing for processing. The pipeline comprises three stages of data storage, data processing and data release, multithreading parallel processing is adopted, and corresponding cache space is released immediately after one granularity data is subjected to the three stages, so that the whole message is not required to be released after being processed by an application program.
S4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
Specifically, the data address conversion and authority protection functions are provided by a main memory conversion table (MTT, memory Translation Table) and a main memory protection table (MPT, memory Protection Table), respectively, and the two tables are stored in a network card storage medium by adopting a megapage technology, a physical segment technology and a traffic locality technology to reduce the table structure.
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
Specifically, the required capacity of a message to be transmitted=the amount of data to be transmitted×the data dwell time threshold, which is set to 200 microseconds.
For communication of RDMA Read/Write primitives, the receiver may calculate the buffer capacity required to receive the data prior to the data transfer. And if the residual buffer space of the large message buffer area is larger than or equal to the required capacity, allocating the buffer space and notifying the opposite terminal to send data. Before the message to be sent is sent, the message is segmented into data packets for transmission according to the preset maximum transmission unit size.
In the primitive communication, after each data packet arrives at the corresponding area of the buffer area, the buffer monitor immediately notifies the application program to process the data through the shared buffer mapping with the application program, and after the application program processes, the buffer monitor is notified to release and recycle the buffer which is used completely for further use through the shared buffer mapping. If the data processing time exceeds the maximum residence time, the cache monitor copies the data into main memory and notifies the application and releases the cache in which the data is reclaimed for further use.
When an application processes data, multithreading and pipeline technology are used to process the data arriving at the cache. The data received by the data receiving cache pool in the LLC is managed with the granularity of 4KB by using SLAB algorithm, that is, each incoming data packet is divided into small packets with the size of not more than 4KB and then is input into a pipeline for data processing for processing. The pipeline comprises three stages of data storage, data processing and data release, multithreading parallel processing is adopted, and corresponding cache space is released immediately after one granularity data is subjected to the three stages, so that the whole message is not required to be released after being processed by an application program.
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
Specifically, the message to be sent is split into a plurality of data packets with the size not larger than 256KB for sending.
The first host allocates the required capacity from the large message buffer area, and the second host segments the message to be sent into a plurality of data packets to be sent, wherein each data packet is not more than 256KB.
The next step is performed immediately on the large message buffer of each data packet reaching the first host without waiting for the complete transmission of the message.
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
specifically, the data is pipelined through multithreading and pipelining.
The pipeline processing of the data by the multithreading and pipeline technology is specifically characterized in that the data is segmented by the granularity of 4KB in a final stage buffer module, and the segmented data is subjected to three stages of data storage, data processing and data release, and multithreading parallel processing is adopted to form a pipeline for data processing. The data with one granularity is released and recycled by the cache monitoring module immediately after three stages, and the recycling of the cache space is not required to wait for the whole message to be processed by the application program.
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
Specifically, for the unloaded data, the buffer monitoring module recovers and releases the corresponding buffer space.
S13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
Specifically, if the residence time of the data in the last-stage buffer module is less than the data residence time threshold, the data leaving the pipeline considers the processing to be completed, and does not wait for the processing result, and the buffer monitoring module recovers and releases the corresponding buffer space.
Further, before S1, the method further includes:
The first host establishes RDMA communication with the second host, and initializes communication resources including queue pair numbers and data cache addresses of the two parties;
the initializing communication resources is accomplished through a socket communication or RDMA connection manager.
Specifically, the initialized communication resource includes information such as queue pair numbers and data cache addresses of both parties. The data of both sides are sent out from the main memory, and the final stage buffer memory receives the data.
Illustratively, assuming that the second host needs to transfer information to the first host, initialization of the communication resources may be accomplished through a socket communication or RDMA connection manager.
The queue pair context is stored in the network card storage medium, the queue pair scheduler is responsible for scheduling ready queue pairs, and the work queue element mover is responsible for batchwise fetching work queue elements of the ready queue pairs from the main memory each time.
In order to improve the statistical multiplexing level of LLC, the invention adopts different RDMA operation and control methods for small message and large message respectively. For small messages, RDMA SEND/RECV is selected and the shared receive queue mechanism is used to aggregate the receive queues; for large messages, RDMA READ/WRITE is used, required caches are allocated according to the product of the message size and the data residence time threshold, the RDMA READ/WRITE and the RDMA READ/WRITE share the cache space of a data receiving cache pool, and the use threshold of the two types of messages is set to avoid the exclusive use of cache resources.
Compared with the direct IO access (DATA DIRECT Input Output) technology of data which is not isolated by the cache, the invention averagely improves the hit throughput to 1.6 times, averagely improves the network throughput by 5 percent, and averagely reduces the occupation amount of the main memory bandwidth by 5 percent. Meanwhile, the buffer area used for data receiving is limited according to the Litel rule to solve the mismatch between the small size of reserved LLC and the inefficient reservation of RDMA queue pairs, and only a small block of buffer memory is needed to support high bandwidth requirements.
For example: for a bandwidth of 200Gbps, the data retention time threshold is 200 microseconds, and the invention only needs 5MB of buffer space to support the requirement.
For better understanding of specific details and features of the method for remote memory access data transmission according to the present invention when receiving small messages, the method is described in detail with reference to fig. 2, and includes steps 1 to 6:
Step 1: preparing communication resources of both parties, including constructing a queue pair, acquiring information such as queue pair numbers, data cache addresses and the like of the opposite ends, and establishing connection by both parties by using a reliable connector; after the establishment is completed, the second host applies for a data sending request to the first host, informs about to Send a small message, and the second host sends data by adopting an RDMA Send primitive.
Step 2: the first host receives a data sending request of the second host, judges that the request sends a small message, adopts RDMA Recv primitives to receive data, an application program in the first host checks whether a small message buffer zone in LLC has an idle buffer block or an idle buffer, if not, the application program refuses to generate a corresponding work queue element receiving request, if the idle buffer block or the idle buffer is present, a corresponding receiving buffer block is allocated, the application program generates a corresponding work queue element to point to the buffer block, and stores the work queue element in a main receiving queue to wait for the extraction of a network card program.
Step 3: after the first host is ready to receive the cache block and work queue element of the data, the second host is notified that the data can be sent. After receiving the message, the second host replies an Acknowledgement (ACK), and the first host polls the completion queue to acknowledge the receipt of the message.
Step 4: the second host places the corresponding sending request into a sending queue in the form of a work queue element, wherein the work queue element points to data information to be sent in the main memory, and the data information comprises an address, a length and the like. When the program on the network card obtains the work queue element from the main memory and executes the work queue element, the data to be sent is transmitted to the network card receiving queue of the first host according to the information stored in the work queue element, the first host takes out the corresponding work queue element from the shared receiving queue, and the data is transmitted from the receiving queue to the buffer block distributed in the small message buffer area in advance.
Step 5: upon arrival of the transferred data at the corresponding cache block in the last level cache, the application processes the data immediately through multithreading and pipelining.
Step 6: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache block for further use.
For a better understanding of the specific details and features of the present invention in a remote memory access data transfer method when receiving large messages, the details are described in conjunction with fig. 3, including RDMA Read and RDMA WRITE:
the RDMA Read comprises the steps 1 to 5:
Step 1: preparing the communication resources of both parties, including constructing a queue pair, acquiring the queue pair number, the data address and other information of the opposite end, and informing the second host of the need of executing RDMA Read operation by the first host, so as to prepare the Read data. The second host places the data to be read in the registered area of the main memory, informs the first host that the data to be read is ready, and informs the stored main memory information and authority.
Step 2: the first host checks whether the large message buffer in the final buffer has enough capacity to receive the data to be Read, and if so, generates a corresponding Read work queue element request to put in the send queue. When the program on the network card accesses the work queue element from the host and executes the work queue element, the first host sends a Read request to the second host, and the request contains information such as a target host address, a data size, access rights and the like of data to be Read.
Step 3: after receiving the RDMA Read request sent by the first host, the second host acquires data from the corresponding area in the main memory according to the information in the request, divides the data into data packets according to the set packet size, sends the data packets to a network card receiving queue of the first host, and when each data packet arrives at the network card of the first host, the first host replies to confirm that the ACK represents that the data is received.
Step 4: the network card of the first host immediately processes the data by multi-threading and pipelining as soon as the data arrives when the transmitted data packet is sent from the RX to the corresponding buffer region of the last level buffer at a granularity of 4 KB.
Step 5: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache for further use.
The RDMA WRITE includes steps 6 to 11:
step 6: preparing the communication resources of both parties, including constructing a queue pair and acquiring the information such as the queue pair number, the data address and the like of the opposite end.
Step 7: the second host sends out a Write request, wherein the request contains relevant information of data to be written.
Step 8: and after the first host receives the request, according to the provided related information, checking whether a large message buffer area of the last-stage buffer in the first host has enough capacity to receive the data to be written, and after searching enough buffer, the first host replies a confirmation ACK to inform the second host of writing the data. The reply contains information such as a cache address, access rights and the like of the data to be written.
Step 9: the second host prepares the data to be written from the host and places the data in the registered host area. The network card of the second host computer cuts the data to be written into according to the set packet size and then sends the cut data to the network card receiving queue of the first host computer.
Step 10: when the network card of the first host sends the transmitted data packet from the receiving queue to the corresponding buffer area of the final buffer with the granularity of 4KB, the application program immediately processes the data through the multithreading and pipeline technology as soon as the data arrives.
Step 11: immediately after the data leaves the pipeline, the application notifies the cache via the shared cache map to monitor the reclaimed cache for further use.
In one embodiment, there is provided a telecommunications device, the device comprising:
the request module is used for generating a communication request when the second host needs to send a message to be transmitted to the first host;
the message type determining module is used for determining the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
The small message transmission module is used for determining that the message type of the message to be transmitted is small message, and the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host computer takes out a buffer block from a small message buffer area in a final buffer module, generates corresponding work queue elements, puts the work queue elements into a shared receiving queue, and judges whether residual buffer blocks exist or idle buffer exists in the small message buffer area; if the small message buffer area does not have the residual buffer blocks and the idle buffer, rejecting the communication request from the second host to the first host; if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue;
The large message transmission module is used for determining that the message type of the message to be transmitted is large, the second host and the first host communicate through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the residual storage space or not; if the required capacity of the message to be transmitted is larger than the residual storage space, rejecting the communication request from the second host to the first host; if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, and the second host segments the message to be transmitted into a plurality of data packets for transmission;
The transmission processing module is used for notifying the application program to process the data through the shared cache mapping with the application program after the buffer area in the final buffer module receives the data; unloading the data of the message to a main storage receiving buffer area and informing the application program of related information if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, and then taking the data from the main storage and processing the data by the application program; if the residence time of the data in the last cache block is less than the data residence time threshold, the process is determined to be complete.
In one embodiment, an electronic device is presented comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
As shown in fig. 4, the electronic device 500 includes a processor 501, and the processor 501 may execute a computer program in a readable storage medium 503 or a computer program loaded from a storage unit 508 into the readable storage medium 503 to implement the functions set forth in the present invention.
The PCIe bus 505 in the device 500 is connected to the internal bus 504, and each component in the device 500 interacts with these two types of buses, where the component connected to the PCIe bus includes an input unit 506, such as a keyboard, a mouse, and the like; an output unit 507 such as a display, a speaker, or the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; a communication unit 509, including a network card, a network card storage area, and a processing unit; the network card storage medium stores a computer program therein, and the processing unit on the network card executes the embodiments of the present invention by running the computer program therein. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet. The communication unit 509 may be implemented in computer hardware, firmware, software, and/or combinations thereof, such as digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
s1, a second host needs to send a message to be transmitted to a first host;
S2, the second host determines the message type of the message to be transmitted, wherein the message type comprises a big message and a small message;
S3, after the message type of the message to be transmitted is determined to be a small message, the second host and the first host communicate through RDMA Send & Recv primitives; the second host generates corresponding work queue elements and puts the work queue elements into a sending queue; the first host takes out a buffer block from a small message buffer area in the final buffer module and generates corresponding work queue elements to be put into a shared receiving queue;
s4: the first host judges whether residual cache blocks or idle caches exist in the small message cache region;
S5: if the small message buffer area does not have the residual buffer blocks and the idle buffer, the first host refuses the communication request of the second host;
S6: if the small message buffer area has residual buffer blocks or idle buffer, the second host takes out the message to be transmitted according to the main storage message storage address pointed by the work queue element and sends the message to the first host, and the first host transmits the message to be transmitted to the prepared buffer space according to the address pointed by the work queue element in the shared receiving queue, and S11 is executed;
S7, after the message type of the message to be transmitted is determined to be a large message, the second host and the first host communicate through RDMA Read/Write primitives, and the first host determines the required capacity of the message to be transmitted;
S8, the first host judges whether the required capacity of the message to be transmitted meets the residual storage space or not;
S9, if the required capacity of the message to be transmitted is larger than the residual storage space, the first host refuses the communication request of the second host;
S10, if the required capacity of the message to be transmitted is smaller than or equal to the remaining storage space, the first host allocates capacity in a large message buffer area according to the required capacity, the second host segments the message to be transmitted into a plurality of data packets for transmission, and S11 is executed immediately after each data packet reaches the large message buffer area of the first host;
S11, after the buffer area in the final buffer module receives data, the buffer area informs the application program to process the data through the shared buffer mapping with the application program;
S12, if the residence time of the data in the final-stage buffer module is larger than the data residence time threshold value, unloading the data of the message into a main storage receiving buffer area, informing the application program of related information, and then taking out the data from the main storage and processing the data by the application program;
s13, if the residence time of the data in the final cache module is smaller than or equal to the data residence time threshold value, determining that the processing is completed.
The computer readable storage medium includes a receiving storage medium, a readable storage medium, or a network card storage medium. The receiving storage medium stores data sent by the network card. The readable storage medium and the network card storage medium store computer programs to realize the data communication method based on remote direct data access.
In particular, the receiving storage medium may be any entity or recording medium capable of receiving data sent by a network card, including static random access memory (SRAM, static Random Access Memory), embedded dynamic random access memory (eDRAM, embedded Dynamic Random Access Memory), and the like. The readable storage medium may be any entity or recording medium capable of carrying the computer program instructions, including a usb disk, a removable hard disk, an optical disk, a computer memory, a Read-only memory (ROM), a Random-access memory (RAM), etc. The network card storage medium may be any entity or recording medium capable of carrying the computer program instructions and incorporating a network card, including Field Programmable Gate Array (FPGA), static random access Memory (SRAM, static Random Access Memory), flash Memory (Flash Memory), electronically erasable Programmable read Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (9)

1.一种远程内存访问的数据传输方法,其特征在于,应用于远程通信装置,所述远程通信装置管理有末级缓存模块;1. A data transmission method for remote memory access, characterized in that it is applied to a remote communication device, wherein the remote communication device manages a last-level cache module; 所述末级缓存模块中配置有缓存区,所述缓存区包括小消息缓存区和大消息缓存区,所述小消息缓存区和所述大消息缓存区共享所述缓存区的缓存空间,所述小消息缓存区划分成预设数据阈值大小的预设个数的缓存块进行管理;The last-level cache module is provided with a cache area, the cache area includes a small message cache area and a large message cache area, the small message cache area and the large message cache area share the cache space of the cache area, and the small message cache area is divided into a preset number of cache blocks of a preset data threshold size for management; 所述方法包括:The method comprises: S1、第二主机需要向第一主机发送待传输消息;S1. The second host needs to send a message to be transmitted to the first host; S2、所述第二主机确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;所述确定所述待传输消息的消息类型,具体包括:根据所述待传输消息的大小与所述数据阈值进行比较,根据比较结果确定消息类型;如果所述待传输消息的大小小于等于所述数据阈值,确定消息类型为小消息;如果所述待传输消息的大小大于所述数据阈值,确定消息类型为大消息;S2. The second host determines the message type of the message to be transmitted, where the message type includes a large message and a small message; the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with the data threshold, and determining the message type according to the comparison result; if the size of the message to be transmitted is less than or equal to the data threshold, determining the message type to be a small message; if the size of the message to be transmitted is greater than the data threshold, determining the message type to be a large message; 确定所述待传输消息的消息类型为所述小消息后,执行步骤S3;确定所述待传输消息的消息类型为所述大消息后,执行步骤S7;After determining that the message type of the message to be transmitted is the small message, executing step S3; after determining that the message type of the message to be transmitted is the large message, executing step S7; S3、所述第二主机与第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列,所述工作队列元素中指向主存中待发送的数据信息,包括地址、长度;所述第一主机生成相应的工作队列元素放入共享接收队列,所述共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块;S3, the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue, and the work queue element points to the data information to be sent in the main memory, including the address and length; the first host generates a corresponding work queue element and puts it into the shared receiving queue, and each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module; S4:所述第一主机判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;S4: The first host determines whether there are any remaining cache blocks or free caches in the small message cache area; S5:如果所述小消息缓存区中不存在剩余缓存块和空闲缓存,则所述第一主机拒绝所述第二主机的通信请求;S5: If there are no remaining cache blocks and free cache in the small message cache area, the first host rejects the communication request of the second host; S6:如果所述小消息缓存区中存在剩余缓存块或者空闲缓存,则分配接收缓存块,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息,并发送到所述第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中;然后执行S11;S6: If there are remaining cache blocks or free caches in the small message cache area, a receiving cache block is allocated, the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element, and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue; then S11 is executed; S7、所述第二主机与第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小;S7, the second host communicates with the first host through RDMA Read/Write primitives, and the first host determines the required capacity size of the message to be transmitted; S8、所述第一主机判断所述待传输消息的需求容量大小是否满足剩余存储空间,所述剩余存储空间为所述大消息缓存区中的空闲缓存空间;S8. The first host determines whether the required capacity of the message to be transmitted meets the remaining storage space, where the remaining storage space is the free buffer space in the large message buffer area; S9、如果所述待传输消息的需求容量大于所述剩余存储空间,所述第一主机拒绝所述第二主机的通信请求;S9. If the required capacity of the message to be transmitted is greater than the remaining storage space, the first host rejects the communication request of the second host; S10、如果所述待传输消息的需求容量小于等于所述剩余存储空间,所述第一主机根据需求容量在所述大消息缓存区中分配容量,所述第二主机将待发送消息切分为多个数据包发送,每个数据包一抵达第一主机的大消息缓存区后立刻执行S11;S10, if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission, and executes S11 immediately after each data packet arrives at the large message buffer area of the first host; S11、当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;S11, when the cache area in the last-level cache module receives the data, the cache area notifies the application to process the data through a shared cache mapping with the application; S12、如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息;所述应用程序后续从主存中取出数据并处理;S12, if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the application is informed of the relevant information; the application subsequently retrieves the data from the main memory and processes it; S13、如果数据在末级缓存模块中的停留时间小于等于数据停留时间阈值,确定处理完成。S13: If the residence time of the data in the last-level cache module is less than or equal to the data residence time threshold, it is determined that the processing is completed. 2.根据权利要求1所述的远程内存访问的数据传输方法,其特征在于,所述S1之前,所述方法还包括:2. The remote memory access data transmission method according to claim 1, characterized in that before S1, the method further comprises: 所述第一主机与第二主机建立RDMA通信,初始化通信资源,包括双方的队列对编号、数据缓存地址;The first host establishes RDMA communication with the second host and initializes communication resources, including queue pair numbers and data cache addresses of both parties; 所述初始化通信资源通过套接字通信或RDMA连接管理器完成。The initialization of communication resources is completed through socket communication or RDMA connection manager. 3.根据权利要求1所述的远程内存访问的数据传输方法,其特征在于,所述数据阈值设置为4KB;共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块。3. The data transmission method for remote memory access according to claim 1 is characterized in that the data threshold is set to 4KB; each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module. 4.根据权利要求1所述的远程内存访问的数据传输方法,其特征在于,所述末级缓存模块中分配属于RDMA通信下的数据接收使用的缓存空间,小消息缓存区和大消息缓存区共用该缓存空间;所述末级缓存模块分配的缓存空间大小=网络带宽*数据停留时间阈值*需要在缓存空间中处理的停留数据比例。4. According to the data transmission method of remote memory access according to claim 1, it is characterized in that the cache space used for data reception under RDMA communication is allocated in the final cache module, and the small message cache area and the large message cache area share the cache space; the cache space size allocated by the final cache module = network bandwidth * data residence time threshold * the proportion of residence data that needs to be processed in the cache space. 5.根据权利要求1所述的远程内存访问的数据传输方法,其特征在于,所述第一主机确定待传输消息的需求容量大小,具体包括:待传输消息的需求容量=待发送数据量*数据停留时间阈值,所述数据停留时间阈值设置为200微秒。5. The data transmission method for remote memory access according to claim 1 is characterized in that the first host determines the required capacity of the message to be transmitted, specifically including: the required capacity of the message to be transmitted = the amount of data to be sent * the data residence time threshold, and the data residence time threshold is set to 200 microseconds. 6.根据权利要求1所述的远程内存访问的数据传输方法,其特征在于,所述第二主机将待发送消息切分为多个数据包发送,具体包括:将待发送消息切分为若干个不大于256KB的数据包发送。6. The remote memory access data transmission method according to claim 1 is characterized in that the second host divides the message to be sent into multiple data packets for sending, specifically comprising: dividing the message to be sent into several data packets no larger than 256KB for sending. 7.一种远程通信装置,其特征在于,所述装置管理有末级缓存模块;7. A remote communication device, characterized in that the device manages a last-level cache module; 所述末级缓存模块中配置有缓存区,所述缓存区包括小消息缓存区和大消息缓存区,所述小消息缓存区和所述大消息缓存区共享所述缓存区的缓存空间,所述小消息缓存区划分成预设数据阈值大小的预设个数的缓存块进行管理;The last-level cache module is provided with a cache area, the cache area includes a small message cache area and a large message cache area, the small message cache area and the large message cache area share the cache space of the cache area, and the small message cache area is divided into a preset number of cache blocks of a preset data threshold size for management; 所述装置包括:The device comprises: 请求模块,用于在第二主机需要向第一主机发送待传输消息时生成通信请求;A request module, used for generating a communication request when the second host needs to send a message to be transmitted to the first host; 消息类型确定模块,用于确定所述待传输消息的消息类型,所述消息类型包括大消息、小消息;所述确定所述待传输消息的消息类型,具体包括:根据所述待传输消息的大小与所述数据阈值进行比较,根据比较结果确定消息类型;如果所述待传输消息的大小小于等于所述数据阈值,确定消息类型为小消息;如果所述待传输消息的大小大于所述数据阈值,确定消息类型为大消息;A message type determination module is used to determine the message type of the message to be transmitted, wherein the message type includes a large message and a small message; the determining the message type of the message to be transmitted specifically includes: comparing the size of the message to be transmitted with the data threshold, and determining the message type according to the comparison result; if the size of the message to be transmitted is less than or equal to the data threshold, determining the message type to be a small message; if the size of the message to be transmitted is greater than the data threshold, determining the message type to be a large message; 小消息传输模块,用于在确定所述待传输消息的消息类型为小消息后响应,包括:所述第二主机与所述第一主机通过RDMA Send&Recv原语进行通信;所述第二主机生成对应的工作队列元素放入发送队列,所述工作队列元素中指向主存中待发送的数据信息,包括地址、长度;所述第一主机生成相应的工作队列元素放入共享接收队列,所述共享接收队列中每个工作队列元素指向末级缓存模块中的大小为4KB的缓存块;判断所述小消息缓存区中是否存在剩余缓存块或存在空闲缓存;如果所述小消息缓存区中不存在剩余缓存块和空闲缓存,拒绝第二主机到第一主机的通信请求;如果所述小消息缓存区中存在剩余缓存块或者空闲缓存,则分配接收缓存块,所述第二主机根据工作队列元素指向的主存消息存放地址取出待传输消息发送到第一主机,所述第一主机根据共享接收队列中的工作队列元素指向的地址将待传输消息传输到准备好的缓存空间中;A small message transmission module, used to respond after determining that the message type of the message to be transmitted is a small message, including: the second host communicates with the first host through the RDMA Send&Recv primitive; the second host generates a corresponding work queue element and puts it into the sending queue, and the work queue element points to the data information to be sent in the main memory, including the address and length; the first host generates a corresponding work queue element and puts it into the shared receiving queue, and each work queue element in the shared receiving queue points to a cache block of 4KB in the last-level cache module; it is determined whether there are remaining cache blocks or free caches in the small message cache area; if there are no remaining cache blocks and free caches in the small message cache area, the communication request from the second host to the first host is rejected; if there are remaining cache blocks or free caches in the small message cache area, a receiving cache block is allocated, and the second host takes out the message to be transmitted according to the main memory message storage address pointed to by the work queue element and sends it to the first host, and the first host transmits the message to be transmitted to the prepared cache space according to the address pointed to by the work queue element in the shared receiving queue; 大消息传输模块,用于在确定所述待传输消息的消息类型为大消息后响应,包括:所述第二主机与所述第一主机通过RDMA Read/Write原语进行通信,所述第一主机确定待传输消息的需求容量大小,并判断所述待传输消息的需求容量大小是否满足剩余存储空间,所述剩余存储空间为所述大消息缓存区中的空闲缓存空间;如果所述待传输消息的需求容量大于所述剩余存储空间,拒绝第二主机到第一主机的通信请求;如果所述待传输消息的需求容量小于等于所述剩余存储空间,所述第一主机根据需求容量在所述大消息缓存区中分配容量,所述第二主机将待发送消息切分为多个数据包发送;A large message transmission module, used to respond after determining that the message type of the message to be transmitted is a large message, including: the second host communicates with the first host through RDMA Read/Write primitives, the first host determines the required capacity of the message to be transmitted, and judges whether the required capacity of the message to be transmitted meets the remaining storage space, and the remaining storage space is the free cache space in the large message buffer area; if the required capacity of the message to be transmitted is greater than the remaining storage space, the communication request from the second host to the first host is rejected; if the required capacity of the message to be transmitted is less than or equal to the remaining storage space, the first host allocates capacity in the large message buffer area according to the required capacity, and the second host divides the message to be sent into multiple data packets for transmission; 传输处理模块,用于当末级缓存模块中的缓存区接收数据后,缓存区通过与应用程序的共享缓存映射通知应用程序处理数据;如果数据在末级缓存模块中的停留时间大于数据停留时间阈值,将该消息的数据卸载到主存的接收缓冲区中,并告知所述应用程序相关信息;所述应用程序后续从主存中取出数据并处理;如果数据在末级缓存模块中的停留时间小于数据停留时间阈值,确定处理完成。The transmission processing module is used to notify the application to process the data through the shared cache mapping with the application after the cache area in the last-level cache module receives the data; if the residence time of the data in the last-level cache module is greater than the data residence time threshold, the data of the message is unloaded to the receiving buffer of the main memory, and the relevant information is notified to the application; the application subsequently retrieves the data from the main memory and processes it; if the residence time of the data in the last-level cache module is less than the data residence time threshold, it is determined that the processing is completed. 8.一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述方法的步骤。8. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the method according to any one of claims 1 to 6. 9.一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述方法的步骤。9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the processor is caused to execute the steps of the method according to any one of claims 1 to 6.
CN202410166037.5A 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access Active CN118093499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410166037.5A CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410166037.5A CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Publications (2)

Publication Number Publication Date
CN118093499A CN118093499A (en) 2024-05-28
CN118093499B true CN118093499B (en) 2024-11-19

Family

ID=91162757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410166037.5A Active CN118093499B (en) 2024-02-06 2024-02-06 Data transmission method, device, equipment and storage medium for remote memory access

Country Status (1)

Country Link
CN (1) CN118093499B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679688A (en) * 2013-12-02 2015-06-03 华为技术有限公司 Data access method, device and system
CN116471242A (en) * 2023-05-23 2023-07-21 江苏华创微系统有限公司 RDMA-based transmitting end, RDMA-based receiving end, data transmission system and data transmission method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102546612B (en) * 2011-12-23 2015-07-08 华中科技大学 Remote procedure call implementation method based on remote direct memory access (RDMA) protocol in user mode
US9842083B2 (en) * 2015-05-18 2017-12-12 Red Hat Israel, Ltd. Using completion queues for RDMA event detection
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN115858160B (en) * 2022-12-07 2023-12-05 江苏为是科技有限公司 Remote direct memory access virtualized resource allocation method and device and storage medium
CN116501549A (en) * 2023-05-06 2023-07-28 上海英方软件股份有限公司 Data caching method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679688A (en) * 2013-12-02 2015-06-03 华为技术有限公司 Data access method, device and system
CN116471242A (en) * 2023-05-23 2023-07-21 江苏华创微系统有限公司 RDMA-based transmitting end, RDMA-based receiving end, data transmission system and data transmission method

Also Published As

Publication number Publication date
CN118093499A (en) 2024-05-28

Similar Documents

Publication Publication Date Title
US20240171507A1 (en) System and method for facilitating efficient utilization of an output buffer in a network interface controller (nic)
CN103763173B (en) Data transmission method and calculate node
WO2020019743A1 (en) Traffic control method and device
CN113485822A (en) Memory management method, system, client, server and storage medium
CN108965148B (en) Processor and message processing method
CN113891396B (en) Data packet processing method and device, computer equipment and storage medium
CN115964319A (en) Data processing method for remote direct memory access and related product
CN114201421A (en) A data stream processing method, storage control node and readable storage medium
CN111857992B (en) Method and device for allocating linear resources in Radosgw module
WO2017032152A1 (en) Method for writing data into storage device and storage device
US20240348686A1 (en) Remote Data Access Method and Apparatus
WO2022017475A1 (en) Data access method and related device
WO2022143774A1 (en) Data access method and related device
CN115509644B (en) Computing power unloading method, device, electronic device and storage medium
CN111756586B (en) A priority queue-based fair bandwidth allocation method, switch and readable storage medium in a data center network
CN109951540B (en) Data acquisition method and device based on content timeliness and electronic equipment
CN118093499B (en) Data transmission method, device, equipment and storage medium for remote memory access
CN114500403A (en) Data processing method and device and computer readable storage medium
CN114691382A (en) RDMA-based communication method, node, system and medium
CN115412502B (en) Network port expansion and message rapid equalization processing method
CN114995748A (en) Request processing method and device
CN115905042A (en) Data processing method and related equipment
CN115174484A (en) RDMA (remote direct memory Access) -based data transmission method, device, equipment and storage medium
US11188394B2 (en) Technologies for synchronizing triggered operations
CN113438274A (en) Data transmission method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant