CN114490524A

CN114490524A - High-performance distributed key value storage method based on master-slave copy data decoupling

Info

Publication number: CN114490524A
Application number: CN202210064750.XA
Authority: CN
Inventors: 李永坤; 张强; 许胤龙
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-13
Anticipated expiration: 2042-01-20
Also published as: CN114490524B

Abstract

The invention discloses a high-performance distributed key value storage method based on master-slave copy data decoupling, which is applied to a storage cluster comprising a plurality of nodes, if any node receives key value data for the ith time, decoupling operation and differentiated storage operation are carried out on ith key value data, so that the ith key value data are written into a log structure merged tree or a two-layer log architecture for storage and management, the key value data in the two-layer log architecture are subjected to sequential adjustable operation, and parallel data recovery operation is carried out on the key value data in a fault node. The invention can improve the overall performance of the distributed key value system and provide a mechanism capable of dynamically adjusting the read-write performance according to the performance requirements of the application, thereby solving the key problem that the unified multi-copy data management scheme in the existing distributed key value storage system aggravates the read-write amplification of the system.

Description

A high-performance distributed key-value storage method based on master-slave replica data decoupling

技术领域technical field

本发明属于计算机存储系统技术领域，具体涉及在分布式键值存储系统(distributed KV Stores)中，基于主-从副本数据解耦来构建差异化的副本数据存储机制，并在此基础上设计有序度可动态调整的副本数据管理方案，实现高性能和读写性能可调整的分布式键值存储方法。The invention belongs to the technical field of computer storage systems, and in particular relates to constructing a differentiated copy data storage mechanism based on master-slave copy data decoupling in a distributed key-value storage system (distributed KV Stores), and designing a The replica data management scheme with dynamically adjustable order, realizes a distributed key-value storage method with high performance and adjustable read and write performance.

背景技术Background technique

为了适应海量非结构化数据的存储与访问需求，并克服传统关系型数据库、文件存储在可扩展性与性能等方面的不足，键值存储(简称为KV存储)系统提供了很好的解决方案，比如谷歌公司的LevelDB、脸书公司的RocksDB等单机键值存储系统，以及为支持大规模数据存储而衍生出的分布式键值存储系统，如PingCAP公司的TiKV，亚马逊公司的DynamoDB，以及阿帕奇基金会的Cassandra。上述主流键值存储系统均采用日志结构合并树(LSM-tree)来管理键值数据(key-value pairs)，该架构将磁盘上的数据组织成多层有序形式，数据在内存排序后追加写入，优点是可以将随机写转换为顺序写，从而获得高效的写入性能，缺点是需要对相邻层间的数据做频繁的合并操作，由此造成严重的读写放大问题。In order to meet the storage and access requirements of massive unstructured data, and overcome the shortcomings of traditional relational databases and file storage in scalability and performance, key-value storage (referred to as KV storage) system provides a good solution , such as Google's LevelDB, Facebook's RocksDB and other stand-alone key-value storage systems, as well as distributed key-value storage systems derived to support large-scale data storage, such as PingCAP's TiKV, Amazon's DynamoDB, and A Cassandra of the Patch Foundation. The above mainstream key-value storage systems all use a log-structured merge tree (LSM-tree) to manage key-value data (key-value pairs). The advantage of writing is that random writing can be converted into sequential writing, so as to obtain efficient writing performance.

另一方面，为了保证数据的高可靠并提供数据容错，多副本容错机制被广泛应用在分布式键值存储系统中，即每个键值数据被复制成多份并存储在多个节点(当一个节点的数据丢失后，可以从其他节点进行数据恢复)。然而，现有分布式键值存储系统的多副本管理方案并未考虑日志结构合并树(LSM-tree)架构自身存在的读写放大问题，简单地在每个节点上采用一个日志结构合并树(LSM-tree)对所有副本数据进行统一存储。On the other hand, in order to ensure high reliability of data and provide data fault tolerance, the multi-copy fault tolerance mechanism is widely used in distributed key-value storage systems, that is, each key-value data is copied into multiple copies and stored in multiple nodes (when After the data of one node is lost, data recovery can be performed from other nodes). However, the existing multi-copy management solutions of distributed key-value storage systems do not consider the read-write amplification problem existing in the log-structured merge tree (LSM-tree) architecture, and simply use a log-structured merge tree (LSM-tree) on each node. LSM-tree) for unified storage of all replica data.

因此，多副本机制下日志结构合并树(LSM-tree)中存储的数据量成倍增长，且在执行读写操作时不同副本数据之间会相互干扰，从而进一步加剧日志结构合并树(LSM-tree)的读写放大问题。Therefore, the amount of data stored in the log-structured merge tree (LSM-tree) under the multi-copy mechanism increases exponentially, and the data of different copies will interfere with each other when performing read and write operations, which further aggravates the log-structured merge tree (LSM-tree). tree) read-write amplification problem.

发明内容SUMMARY OF THE INVENTION

本发明是为了解决上述现有技术存在的不足之处，提出一种基于主从副本数据解耦的高性能分布式键值存储方法，以期能提升分布式键值系统的整体性能，并提供可根据应用的性能需求动态调整读写性能的机制，从而解决现有分布式键值存储系统中统一的多副本数据管理方案加剧系统读写放大的关键问题。In order to solve the shortcomings of the above-mentioned prior art, the present invention proposes a high-performance distributed key-value storage method based on the decoupling of master-slave replica data, in order to improve the overall performance of the distributed key-value system and provide The mechanism of dynamically adjusting the read and write performance according to the performance requirements of the application, thus solving the key problem that the unified multi-copy data management scheme in the existing distributed key-value storage system aggravates the system's read and write amplification.

本发明为达到上述发明目的，采用如下技术方案：The present invention adopts the following technical scheme in order to achieve the above-mentioned purpose of the invention:

本发明一种基于主从副本数据解耦的高性能分布式键值存储方法的特点是应用于包含若干个节点的存储集群中，若任一节点第i次接收到键值数据，则对第i个键值数据进行解耦操作和差异化存储操作；The feature of the high-performance distributed key-value storage method based on the decoupling of master-slave replica data of the present invention is that it is applied to a storage cluster including several nodes. i key-value data for decoupling operations and differentiated storage operations;

所述解耦操作包括：The decoupling operations include:

对所接收到的第i个键值数据中的键进行哈希计算得到第i个哈希值，然后根据第i个哈希值和一致性哈希环计算得到所述第i个键值数据中的键所映射到的一个节点编号；如果计算得到的节点编号等于当前接收到第i个键值数据的节点编号，则将所述第i个键值数据识别为主副本；否则，将所述第i个键值数据识别为从副本；Hash the key in the received i-th key-value data to obtain the i-th hash value, and then calculate the i-th key-value data according to the i-th hash value and the consistent hash ring A node number to which the key in is mapped; if the calculated node number is equal to the node number that currently receives the i-th key-value data, then the i-th key-value data is identified as the primary copy; otherwise, all The i-th key-value data is identified as a slave copy;

所述差异化存储操作包括：The differentiated storage operations include:

若所述第i个键值数据为主副本，则将所述第i个键值数据写入节点的日志结构合并树中进行存储和管理；If the i-th key-value data is the master copy, write the i-th key-value data into the log structure merge tree of the node for storage and management;

若所述第i个键值数据为从副本，则将所述第i个键值数据写入一个两层日志架构中进行存储和管理。If the i-th key-value data is a slave copy, the i-th key-value data is written into a two-tier log architecture for storage and management.

本发明所述的高性能分布式键值存储方法的特点也在于，所述两层日志架构的写入步骤包括：The high-performance distributed key-value storage method of the present invention is also characterized in that the writing step of the two-layer log architecture includes:

(i)两层日志架构接收到第i个键值数据时，判断i>Nmax是否成立，若成立，则执行步骤(ii)，否则，将第i个键值数据写入第一层全局日志中；(i) When the two-layer log architecture receives the i-th key-value data, it is judged whether i>Nmax holds, and if so, execute step (ii), otherwise, write the i-th key-value data into the first-layer global log middle;

(ii)对于第一层全局日志中的Nmax个键值数据，后台线程计算得到每个键值数据中的键所映射的节点编号；并根据不同的节点编号建立对应的本地日志，且在每个本地日志中均设置有若干个范围组；(ii) For the Nmax key-value data in the first-layer global log, the background thread calculates the node number mapped by the key in each key-value data; and establishes a corresponding local log according to different node numbers, and in each Each local log has several range groups;

(iii)分割Nmax个键值数据：(iii) Split Nmax key-value data:

A、定义变量j并初始化j＝1；A. Define variable j and initialize j=1;

B、对于第j个键值数据，后台线程计算得到第j个键值数据中的键所映射的节点编号为L，则将其映射到第L个本地日志；并判断在第L个本地日志中，第j个键值数据的键所属于的范围组；然后将第j个键值数据写入第L个本地日志中对应的范围组中；B. For the jth key-value data, the background thread calculates that the node number mapped by the key in the jth key-value data is L, then maps it to the Lth local log; and judges that in the Lth local log , the range group to which the key of the jth key-value data belongs; and then write the jth key-value data into the corresponding range group in the Lth local log;

C、将j+1赋值给j后，判断j>Nmax是否成立，若成立，则表示将Nmax个键值数据都分别写入不同本地日志的不同范围组中，并将每个范围组内所写入的若干个键值数据保存在一个文件内，以形成每个范围组所对应的有序段；否则，返回步骤B执行。C. After assigning j+1 to j, judge whether j>Nmax is true. If it is true, it means that Nmax key-value data are written into different range groups of different local logs respectively, and all the data in each range group are written into different range groups. Several key-value data written are stored in a file to form an ordered segment corresponding to each range group; otherwise, return to step B for execution.

对两层日志架构中的键值数据按如下步骤进行有序度可调操作：The key-value data in the two-tier log architecture can be adjusted according to the following steps:

若所述高性能分布式键值存储方法应用于读密集型的存储系统中，则增加范围组内有序段间的合并操作，即当一个范围组内有序段的个数超过阈值Δ₁时，则将相应范围组内所有的有序段合并为一个有序段，使得范围组内有序段的个数维持在阈值Δ₁内；If the high-performance distributed key-value storage method is applied to a read-intensive storage system, the merge operation between ordered segments in a range group is increased, that is, when the number of ordered segments in a range group exceeds the threshold Δ ₁ When , all ordered segments in the corresponding range group are merged into one ordered segment, so that the number of ordered segments in the range group is maintained within the threshold Δ ₁ ;

若所述高性能分布式键值存储方法应用于写密集型的存储系统中，则减少范围组内有序段间的合并操作，即当一个范围组内有序段的个数超过Δ₂时，将相应范围组内所有的有序段合并为一个有序段，从而将范围组内有序段的个数维持在阈值Δ₂内；Δ₁＜Δ₂。If the high-performance distributed key-value storage method is applied to a write-intensive storage system, the merge operation between ordered segments in a range group is reduced, that is, when the number of ordered segments in a range group exceeds _Δ2 , merge all ordered segments in the corresponding range group into one ordered segment, so as to maintain the number of ordered segments in the range group within the threshold Δ ₂ ; Δ ₁ <Δ ₂ .

对故障节点中的键值数据按如下步骤进行并行数据恢复操作：Perform parallel data recovery operations on the key-value data in the faulty node as follows:

(i)使用两个线程并行地从每个节点上的日志结构合并树和两层日志中读取数据，并根据读取的键值数据，计算每个键值数据的哈希值，并将其存储在梅克尔树的各个结点中，从而构建每个节点上的梅克尔树；(i) Use two threads to read data from the log-structured merge tree and the two-layer log on each node in parallel, and based on the read key-value data, calculate the hash value of each key-value data, and put It is stored in each node of the Merkle tree, thereby constructing the Merkle tree on each node;

(ii)将故障节点上构建的梅克尔树分别与其他节点构建的梅克尔树依次进行比较，判断故障节点和其他梅克尔树在相同位置的结点的值是否相同；若相同，则表示相应结点所对应的键值数据未丢失，并结束流程，否则，表示相应结点所对应的键值数据已丢失；(ii) Compare the Merkle tree constructed on the faulty node with the Merkle tree constructed by other nodes in turn, and judge whether the fault node and other Merkle trees have the same value at the same position; if they are the same, It means that the key-value data corresponding to the corresponding node is not lost, and the process ends, otherwise, it means that the key-value data corresponding to the corresponding node has been lost;

(iii)在非故障节点上使用两个线程并行地从日志结构合并树和两层日志中读取丢失数据所在的结点上对应的键值数据作为恢复数据，并发送到故障节点；(iii) Use two threads on the non-faulty node to read the corresponding key-value data on the node where the lost data is located from the log-structured merge tree and the two-layer log in parallel as recovery data, and send it to the faulty node;

(iv)所述故障节点收到所述恢复数据后，使用两个线程并行地向日志结构合并树和两层日志中写入所述恢复数据。(iv) After receiving the recovery data, the faulty node uses two threads to write the recovery data into the log-structured merge tree and the two-layer log in parallel.

与现有技术相比，本发明的有益技术效果体现在：Compared with the prior art, the beneficial technical effects of the present invention are embodied in:

1、轻量级的主从副本数据解耦：本发明采用哈希计算并结合一致性哈希环来对主-从副本数据进行解耦操作，使得主从副本解耦操作是轻量级、低开销的，且不影响存储系统上层的数据分布机制及一致性协议。1. Lightweight master-slave replica data decoupling: The present invention uses hash calculation combined with a consistent hash ring to decouple the master-slave replica data, so that the master-slave replica decoupling operation is lightweight, It has low overhead and does not affect the data distribution mechanism and consistency protocol of the upper layer of the storage system.

2、主-从副本数据的差异化存储：本发明将解耦出的主副本采用日志结构合并树存储，将解耦出的从副本采用两层日志架构进行存储，从而有效地避免了执行读写操作时主从副本数据之间的相互干扰影响，克服了现有分布式键值存储系统中统一的多副本管理方案加剧系统读写放大的关键问题；并且两层日志架构设计也提升了从副本的写入和读取性能。2. Differential storage of master-slave copy data: The present invention uses a log structure merge tree to store the decoupled master copy, and uses a two-layer log structure to store the decoupled slave copy, thereby effectively avoiding the execution of read The mutual interference between the master and slave replica data during write operations overcomes the key problem that the unified multi-copy management scheme in the existing distributed key-value storage system aggravates the system read and write amplification; and the two-tier log architecture design also improves slave Write and read performance of replicas.

3、两层日志有序度可调：本发明设计了数据存储的有序度可调机制，可动态调整两层日志架构的有序程度，可以将两层日志调整为利于写或者利于读，实现了可根据应用的不同性能需求来调整两层日志的有序度，从而实现了不同场景下都能提升系统性能。3. The order degree of the two-layer log can be adjusted: the present invention designs an order degree adjustment mechanism for data storage, which can dynamically adjust the order degree of the two-layer log structure, and can adjust the two-layer log to be conducive to writing or reading, The order of the two-layer logs can be adjusted according to the different performance requirements of the application, so that the system performance can be improved in different scenarios.

4、数据并行恢复：本发明在恢复丢失的数据时，使用多线程并行地从日志结构合并树和两层日志中读写数据，加速了数据恢复操作，从而缩短了数据恢复的时间。4. Parallel data recovery: when recovering lost data, the present invention uses multiple threads to read and write data from the log structure merge tree and the two-layer log in parallel, which speeds up data recovery operations and shortens data recovery time.

综上所述，本发明通过提出轻量级的主-从副本数据解耦操作，主-从副本数据的差异化存储技术，有序度可调机制及数据并行恢复技术，较好地解决了分布式键值系统中统一多副本管理方案加剧系统读写放大的问题，极大地提升了分布式键值系统的整体读写性能，以及数据恢复性能，同时保证了系统的高可靠性。To sum up, the present invention can better solve the The unified multi-copy management scheme in the distributed key-value system aggravates the problem of system read-write amplification, greatly improves the overall read-write performance and data recovery performance of the distributed key-value system, and ensures the high reliability of the system.

附图说明Description of drawings

图1是本发明基于主从副本数据解耦的高性能分布式键值存储方法的整体架构图；1 is an overall architecture diagram of a high-performance distributed key-value storage method based on master-slave replica data decoupling of the present invention;

图2是本发明主从副本数据的解耦流程示意图；2 is a schematic diagram of a decoupling flow of master-slave replica data of the present invention;

图3是本发明两层日志架构的示意图；Fig. 3 is the schematic diagram of the two-tier log architecture of the present invention;

图4是本发明本地日志中基于范围的数据分组示意图；Fig. 4 is the schematic diagram of data grouping based on scope in the local log of the present invention;

图5是本发明数据存储的有序度可调示意图；5 is a schematic diagram of an adjustable degree of ordering of data storage of the present invention;

图6是本发明数据并行恢复的示意图。FIG. 6 is a schematic diagram of parallel data recovery according to the present invention.

具体实施方式Detailed ways

在本实施例中，如图1所示，一种基于主从副本数据解耦的高性能分布式键值存储方法，包括主-从副本数据的解耦、主-从副本数据的差异化存储、存储的有序度调整和并行的数据恢复。本发明采用哈希计算并结合一致性哈希环来对主-从副本数据进行解耦，使得主从副本解耦操作是轻量级、低开销的，且不影响上层的数据分布机制及一致性协议；此外，本发明通过两层日志架构将从副本数据首先批量追加到第一层全局日志，然后后台线程将其分割到第二层的多个本地日志中，从而保证高效的从副本写入性能，以及良好的从副本读取和恢复性能；其次，本发明设计数据存储的有序度可调技术，可将两层日志调整为利于写或利于读，从而用户可根据不同的性能需求来调整两层日志的有序度，以在不同场景下都能提升键值存储系统的整体性能；最后，在恢复丢失数据时，本发明使用多线程并行地从日志结构合并树和两层日志中读写数据，以加速数据恢复操作。下面针对每个操作具体展开说明。In this embodiment, as shown in FIG. 1 , a high-performance distributed key-value storage method based on decoupling of master-slave replica data includes decoupling of master-slave replica data and differential storage of master-slave replica data , storage order adjustment and parallel data recovery. The present invention adopts hash calculation combined with a consistent hash ring to decouple the master-slave replica data, so that the master-slave replica decoupling operation is lightweight and low-cost, and does not affect the data distribution mechanism and consistency of the upper layer. In addition, the present invention first batch appends the data from the replica to the global log of the first layer through the two-layer log structure, and then the background thread divides it into multiple local logs of the second layer, so as to ensure efficient writing from the replica secondly, the present invention designs the data storage order-adjustable technology, which can adjust the two-layer log to be favorable for writing or reading, so that users can meet different performance requirements according to different performance requirements. to adjust the order degree of the two-layer log, so as to improve the overall performance of the key-value storage system in different scenarios; finally, when recovering lost data, the present invention uses multi-threading to merge the tree and the two-layer log from the log structure in parallel Read and write data in the middle to speed up data recovery operations. The following is a detailed description of each operation.

本实施例中，该基于主从副本数据解耦的高性能分布式键值存储方法是应用于包含若干个节点的存储集群中，若任一节点第i次接收到键值数据，则对第i个键值数据进行解耦操作和差异化存储操作；In this embodiment, the high-performance distributed key-value storage method based on the decoupling of master-slave replica data is applied to a storage cluster including several nodes. i key-value data for decoupling operations and differentiated storage operations;

解耦操作包括：Decoupling operations include:

对所接收到的第i个键值数据中的键进行哈希计算得到第i个哈希值，然后根据第i个哈希值和一致性哈希环计算得到第i个键值数据中的键所映射到的一个节点编号；如果计算得到的节点编号等于当前接收到第i个键值数据的节点编号，则将第i个键值数据识别为主副本；否则，将第i个键值数据识别为从副本；Hash the key in the received i-th key-value data to obtain the i-th hash value, and then calculate the i-th key-value data according to the i-th hash value and the consistent hash ring. A node number to which the key is mapped; if the calculated node number is equal to the node number that currently receives the i-th key-value data, the i-th key-value data is identified as the primary copy; otherwise, the i-th key-value data is identified as the primary copy; data identified as secondary copies;

图2给出了本实施例中主-从副本数据的解耦流程示意图。每个节点在区分接收到的键值数据是主副本还是从副本后，首先将键值数据写入对应的先写日志和内存缓存中，然后通知客户端该键值数据写入成功。当缓存存满时，则将其批量写入日志结构合并树或两层日志架构中。主-从副本数据的解耦操作是轻量级的，因为每个节点只需在I/O路径中执行一次额外的哈希计算。实验表明，解耦操作的时间开销小于数据总写入时间的0.4％。此外，主-从副本数据的解耦操作在每个节点上独立执行，不需要跨节点同步。FIG. 2 shows a schematic diagram of the decoupling process of master-slave replica data in this embodiment. After distinguishing whether the received key-value data is a master copy or a slave copy, each node first writes the key-value data into the corresponding write-first log and memory cache, and then notifies the client that the key-value data was successfully written. When the cache is full, it is batched into a log-structured merge tree or a two-tier log architecture. The decoupling of master-slave data is lightweight because each node only needs to perform one additional hash computation in the I/O path. Experiments show that the time overhead of decoupling operations is less than 0.4% of the total data writing time. In addition, the decoupling operation of master-slave replica data is performed independently on each node and does not require synchronization across nodes.

差异化存储操作包括：Differentiated storage operations include:

若第i个键值数据为主副本，则将第i个键值数据写入节点的日志结构合并树中进行存储和管理；以保留日志结构合并树高效的读取、写入和范围扫描的设计特性，但以更轻量级的方式，因为日志结构合并树中不包含从副本后，日志结构合并树的大小显著减少。If the i-th key-value data is the primary copy, write the i-th key-value data into the log-structured merge tree of the node for storage and management; to preserve the efficient read, write and range scanning of the log-structured merge tree Design features, but in a more lightweight way, since the log-structured merge tree size is significantly reduced after slave replicas are not included in the log-structured merge tree.

若第i个键值数据为从副本，则将第i个键值数据写入一个两层日志架构中进行存储和管理；以支持快速的从副本写入性能、以及高效的数据恢复性能。If the i-th key-value data is a slave copy, write the i-th key-value data into a two-tier log architecture for storage and management; to support fast slave copy write performance and efficient data recovery performance.

具体实施中，两层日志架构的写入步骤包括：In a specific implementation, the writing steps of the two-tier log architecture include:

(i)两层日志架构接收到第i个键值数据时，判断i>Nmax是否成立，若成立，则执行步骤(ii)，否则，将第i个键值数据写入第一层全局日志中；图3给出了本实施例中两层日志架构的示意图，为了实现键值数据的快速写入，每个节点将接收到的键值数据首先写入内存缓存并排序，当缓存满后，将其批量追加写入第一层全局日志的头部，形成一个段文件。由于全局日志不维护任何额外的索引结构，所以全局日志设计可以保证键值数据的高写入性能。然而，虽然将所有键值数据存储在全局日志可实现高写入性能，但会导致数据恢复性能下降。对于全局日志中的键值数据，它们对应的主副本可能存储在不同节点上，当一个节点发生数据丢失时，只需要读取全局日志中的部分键值数据(即对应主副本位于故障节点的这类键值数据)进行恢复。因此，恢复操作只会访问全局日志中的部分数据，从而导致大量随机I/O去搜索全局日志。为了保证高效的数据恢复性能，维护一个后台线程不断地将全局日志中的数据分割到多个本地日志中，见步骤(ii)。(i) When the two-layer log architecture receives the i-th key-value data, it is judged whether i>Nmax holds, and if so, execute step (ii), otherwise, write the i-th key-value data into the first-layer global log Figure 3 shows a schematic diagram of the two-tier log architecture in this embodiment. In order to realize the fast writing of key-value data, each node first writes the received key-value data into the memory cache and sorts it. When the cache is full , and append it to the head of the first-level global log in batches to form a segment file. Since the global log does not maintain any additional index structure, the global log design can guarantee high write performance for key-value data. However, while storing all key-value data in the global log achieves high write performance, it can result in lower data recovery performance. For the key-value data in the global log, their corresponding primary copies may be stored on different nodes. When a node loses data, only part of the key-value data in the global log needs to be read (that is, the corresponding primary copy is located in the faulty node's such key-value data) for recovery. As a result, recovery operations only access part of the data in the global log, causing a lot of random I/O to search the global log. To ensure efficient data recovery performance, a background thread is maintained to continuously split the data in the global log into multiple local logs, see step (ii).

(ii)对于第一层全局日志中的Nmax个键值数据，后台线程计算得到每个键值数据中的键所映射的节点编号；并根据不同的节点编号建立对应的本地日志，且在每个本地日志中均设置有若干个范围组；图4给出了本实施例中每个本地日志内基于范围的数据分组示意图，每个本地日志被进一步划分为多个范围组，每个范围组对应于一致性哈希环中的一个范围。注意，每个本地日志中的不同范围组在键上没有重叠，因此可以独立管理不同范围组。例如，附图4中的节点N2，它拥有的本地日志LOG0存储一类键值数据，其对应的主副本驻留在节点N0。由于节点N0有两个范围[0,10]和[51,60]，因此节点N2的本地日志LOG0包含两个范围组，分别保存[0,10]和[51,60]这两个范围的键值数据。(ii) For the Nmax key-value data in the first-layer global log, the background thread calculates the node number mapped by the key in each key-value data; and establishes a corresponding local log according to different node numbers, and in each Each local log is provided with several scope groups; Fig. 4 provides a schematic diagram of the scope-based data grouping in each local log in this embodiment, each local log is further divided into multiple scope groups, each scope group Corresponds to a range in a consistent hash ring. Note that the different scope groups in each local log do not overlap in keys, so the different scope groups can be managed independently. For example, the node N2 in FIG. 4 has a local log LOG0 that stores a type of key-value data, and its corresponding primary copy resides in the node N0. Since node N0 has two ranges [0, 10] and [51, 60], the local log LOG0 of node N2 contains two range groups, respectively saving [0, 10] and [51, 60]. key-value data.

(iii)分割Nmax个键值数据：(iii) Split Nmax key-value data:

A、定义变量j并初始化j＝1；A. Define variable j and initialize j=1;

B、对于第j个键值数据，后台线程计算得到第j个键值数据中的键所映射的节点编号为L，则将其映射到第L个本地日志；并判断在第L个本地日志中，第j个键值数据的键所属于的范围组；然后将第j个键值数据写入第L个本地日志中对应的范围组中；为确定每个键值数据所属的范围组，可将键值数据的键与一致性哈希环中每个范围的边界值进行比较，以找到相应的范围组。由于每个本地日志只保留同一类的从副本数据(其对应的主副本存储在同一个节点)，且每个本地日志内部进一步采用基于范围的数据分组，因此，数据恢复操作仅需访问与故障节点对应的本地日志中相应范围组的键值数据，无需访问所有本地日志和所有范围组，从而避免大量无效的随机I/O操作；B. For the jth key-value data, the background thread calculates that the node number mapped by the key in the jth key-value data is L, then maps it to the Lth local log; and judges that in the Lth local log , the range group to which the key of the jth key-value data belongs; then the jth key-value data is written into the corresponding range group in the Lth local log; in order to determine the range group to which each key-value data belongs, The keys of the key-value data can be compared to the boundary values of each range in the consistent hash ring to find the corresponding range group. Since each local log only retains the same type of slave copy data (its corresponding master copy is stored on the same node), and each local log further uses range-based data grouping, data recovery operations only require access and failure The key-value data of the corresponding range group in the local log corresponding to the node does not need to access all local logs and all range groups, thereby avoiding a large number of invalid random I/O operations;

本实施例中，对两层日志架构中的键值数据按如下步骤进行有序度可调操作：In this embodiment, the key-value data in the two-tier log architecture is subjected to the following steps to perform an orderly adjustable operation:

首先，由于每个本地日志的每个范围组可能包含多个有序段，并且范围组内有序段之间的键值数据没有完全排序。因此，从范围组中读取数据时需要从最新的有序段开始，逐个检查每个有序段，直到找到查询的数据。如果范围组包含太多的有序段，则对两层日志的读性能将会降低。因此，按照如下步骤来调整每个范围组内的有序段个数，图5显示了范围组内有序度可调方案的示意图：First, since each range group of each local log may contain multiple ordered segments, and the key-value data among the ordered segments within the range group is not completely ordered. Therefore, reading data from a range group needs to start with the latest ordered segment and check each ordered segment one by one until the queried data is found. If the range group contains too many ordered segments, the read performance of the two-tier log will degrade. Therefore, the number of ordered segments in each range group is adjusted according to the following steps. Figure 5 shows a schematic diagram of the adjustable degree of ordering within a range group:

若高性能分布式键值存储方法应用于读密集型的存储系统中，则增加范围组内有序段间的合并操作，即当一个范围组内有序段的个数超过阈值Δ₁时，本实施例中，将Δ₁设置为10，则将相应范围组内所有的有序段合并为一个有序段，使得范围组内有序段的个数维持在阈值Δ₁内；If the high-performance distributed key-value storage method is applied to a read-intensive storage system, the merge operation between ordered segments in a range group is increased, that is, when the number of ordered segments in a range group exceeds the threshold Δ ₁ , In this embodiment, if Δ ₁ is set to 10, all ordered segments in the corresponding range group are merged into one ordered segment, so that the number of ordered segments in the range group is maintained within the threshold Δ ₁ ;

若高性能分布式键值存储方法应用于写密集型的存储系统中，则减少范围组内有序段间的合并操作，即当一个范围组内有序段的个数超过Δ₂时，本实施例中，将Δ₂设置为100，则将相应范围组内所有的有序段合并为一个有序段，从而将范围组内有序段的个数维持在阈值Δ₂内；Δ₁＜Δ₂。If the high-performance distributed key-value storage method is used in a write-intensive storage system, the merge operation between ordered segments in a range group is reduced, that is, when the number of ordered segments in a range group exceeds Δ ₂ , this In the embodiment, if Δ ₂ is set to 100, all ordered segments in the corresponding range group are merged into one ordered segment, so that the number of ordered segments in the range group is maintained within the threshold Δ ₂ ; Δ ₁ < Δ ₂ .

范围组内有序段间的合并操作与日志结构合并树中的合并操作类似，包括以下步骤：(i)从范围组中读取出所有的有序段；(ii)将范围组中的所有有序段进行排序，如果在不同有序段中存在多个具有相同键的键值数据，则只保留最新有序段中的那个键值数据，丢弃其它有序段中对应的键值数据；(iii)将合并后的有效键值数据写回相应范围组。The merge operation between ordered segments in a range group is similar to the merge operation in a log-structured merge tree, including the following steps: (i) read all ordered segments from the range group; (ii) merge all the ordered segments in the range group Ordered segments are sorted. If there are multiple key-value data with the same key in different ordered segments, only the key-value data in the latest ordered segment is retained, and the corresponding key-value data in other ordered segments is discarded; (iii) Write the merged valid key-value data back to the corresponding range group.

通过上述在范围组内进行数据有序度调整操作，使得本发明在不同工作负载及不同系统配置下都能提升分布式键值存储系统的读写性能，实验验证本发明提出的有序度可调的两层日志设计可将分布式键值存储系统的读写性能分别提升1.4倍和2.5倍。Through the above-mentioned adjustment operation of the data order degree within the range group, the present invention can improve the read and write performance of the distributed key-value storage system under different workloads and different system configurations. Experiments have verified that the order degree proposed by the present invention can be The adjusted two-layer log design can improve the read and write performance of the distributed key-value storage system by 1.4 times and 2.5 times respectively.

本实施例中，对故障节点中的键值数据按如下步骤进行并行数据恢复操作，图6给出了本实施例中数据并行恢复操作的示意图，假设恢复故障节点N0上丢失的数据：In this embodiment, the parallel data recovery operation is performed on the key-value data in the faulty node according to the following steps. FIG. 6 shows a schematic diagram of the parallel data recovery operation in this embodiment, assuming that the data lost on the faulty node N0 is recovered:

(i)使用两个线程并行地从每个节点上的日志结构合并树和两层日志中读取数据，并根据读取的键值数据，计算每个键值数据的哈希值，并将其存储在梅克尔树的各个结点中，从而构建每个节点上的梅克尔树；图6中节点N0和N1分别使用两个线程并行地从每个节点上的日志结构合并树和两层日志中读取数据，并根据读取的数据在节点N0和N1上分别构建梅克尔树；(i) Use two threads to read data from the log-structured merge tree and the two-layer log on each node in parallel, and based on the read key-value data, calculate the hash value of each key-value data, and put It is stored in each node of the Merkle tree, thereby building the Merkle tree on each node; in Figure 6, nodes N0 and N1 use two threads to merge the tree and the log structure from each node in parallel, respectively. Read data from the two layers of logs, and build Merkle trees on nodes N0 and N1 according to the read data;

(ii)将故障节点上构建的梅克尔树分别与其他节点构建的梅克尔树依次进行比较，判断故障节点和其他梅克尔树在相同位置的结点的值是否相同；若相同，则表示相应结点所对应的键值数据未丢失，并结束流程，否则，表示相应结点所对应的键值数据已丢失；图6中分别比较节点N0和N1上构建的梅克尔树，由于节点N0中的数据全部丢失，所以节点N0上构建的梅克尔树为空，将其与节点N1上构建的梅克尔树比较后，检测到节点N0上已经丢失的键值数据；(ii) Compare the Merkle tree constructed on the faulty node with the Merkle tree constructed by other nodes in turn, and judge whether the fault node and other Merkle trees have the same value at the same position; if they are the same, It means that the key-value data corresponding to the corresponding node is not lost, and the process ends; otherwise, it means that the key-value data corresponding to the corresponding node has been lost; in Figure 6, the Merkle trees constructed on nodes N0 and N1 are compared respectively, Since all the data in node N0 is lost, the Merkle tree constructed on node N0 is empty. After comparing it with the Merkle tree constructed on node N1, it is detected that the key-value data that has been lost on node N0;

(iii)在非故障节点上使用两个线程并行地从日志结构合并树和两层日志中读取丢失数据所在的结点上对应的键值数据作为恢复数据，并发送到故障节点；图6中非故障节点N1使用两个线程并行地从日志结构合并树和两层日志中读取出步骤(ii)中检测到的节点N0上丢失的键值数据，如范围在[11,20],[21,30]的键值数据，并将其发送给故障节点N0；(iii) Use two threads on the non-faulty node to read the corresponding key-value data on the node where the lost data is located from the log-structured merge tree and the two-layer log in parallel as the recovery data, and send it to the faulty node; Figure 6 The non-faulty node N1 uses two threads to read the missing key-value data on node N0 detected in step (ii) from the log-structured merge tree and the two-layer log in parallel, such as in the range [11, 20], [21,30] key-value data and send it to the faulty node N0;

(iv)故障节点收到恢复数据后，使用两个线程并行地向日志结构合并树和两层日志中写入恢复数据；图6中故障节点N0从非故障节点N1接收到恢复的数据[11,20],[21,30]后，使用两个线程并行地向日志结构合并树和两层日志中分别写入范围为[11,20]和[21,30]的键值数据。(iv) After the faulty node receives the restored data, it uses two threads to write the restored data to the log-structured merge tree and the two-layer log in parallel; in Figure 6, the faulty node N0 receives the restored data from the non-faulty node N1 [11] ,20],[21,30], use two threads to write key-value data in the range of [11,20] and [21,30] to the log-structured merge tree and the two-level log in parallel.

本发明通过设计上述并行数据恢复操作，极大地提升了数据恢复性能，实验表明本发明设计的并行数据恢复操作可将数据恢复时间减少一半左右。By designing the above parallel data recovery operation, the present invention greatly improves the data recovery performance, and experiments show that the parallel data recovery operation designed by the present invention can reduce the data recovery time by about half.

Claims

1. A high-performance distributed key-value storage method based on master-slave replica data decoupling, characterized in that it is applied to a storage cluster comprising several nodes, if any node receives the key-value data for the i-th time, then The i-th key-value data performs decoupling operations and differentiated storage operations;

The decoupling operations include:

Hash the key in the received i-th key-value data to obtain the i-th hash value, and then calculate the i-th key-value data according to the i-th hash value and the consistent hash ring A node number to which the key in is mapped; if the calculated node number is equal to the node number that currently receives the i-th key-value data, the i-th key-value data is identified as the primary copy; otherwise, all The i-th key-value data is identified as a slave copy;

The differentiated storage operations include:

If the i-th key-value data is the master copy, write the i-th key-value data into the log structure merge tree of the node for storage and management;

If the i-th key-value data is a slave copy, the i-th key-value data is written into a two-tier log architecture for storage and management.

2. The high-performance distributed key-value storage method according to claim 1, wherein the writing step of the two-tier log architecture comprises:

(i) When the two-layer log architecture receives the i-th key-value data, it is judged whether i>Nmax holds, and if so, execute step (ii), otherwise, write the i-th key-value data into the first-layer global log middle;

(ii) For the Nmax key-value data in the first-layer global log, the background thread calculates the node number mapped by the key in each key-value data; and establishes a corresponding local log according to different node numbers, and in each Each local log has several range groups;

(iii) Split Nmax key-value data:

A. Define variable j and initialize j=1;

B. For the jth key-value data, the background thread calculates that the node number mapped by the key in the jth key-value data is L, then maps it to the Lth local log; and judges that in the Lth local log , the range group to which the key of the jth key-value data belongs; and then write the jth key-value data into the corresponding range group in the Lth local log;

C. After assigning j+1 to j, judge whether j>Nmax is true. If it is true, it means that Nmax key-value data are written into different range groups of different local logs respectively, and all the data in each range group are written into different range groups. Several key-value data written are stored in a file to form an ordered segment corresponding to each range group; otherwise, return to step B for execution.

3. The high-performance distributed key-value storage method according to claim 2, wherein the key-value data in the two-layer log architecture is subjected to an orderly adjustable operation according to the following steps:

If the high-performance distributed key-value storage method is applied to a read-intensive storage system, the merge operation between ordered segments in a range group is increased, that is, when the number of ordered segments in a range group exceeds the threshold Δ ₁ When , all ordered segments in the corresponding range group are merged into one ordered segment, so that the number of ordered segments in the range group is maintained within the threshold Δ ₁ ;

If the high-performance distributed key-value storage method is applied to a write-intensive storage system, the merge operation between ordered segments in a range group is reduced, that is, when the number of ordered segments in a range group exceeds _Δ2 , merge all ordered segments in the corresponding range group into one ordered segment, so as to maintain the number of ordered segments in the range group within the threshold Δ ₂ ; Δ ₁ <Δ ₂ .

4. The high-performance distributed key-value storage method according to claim 1, wherein a parallel data recovery operation is performed on the key-value data in the faulty node according to the following steps:

(i) Use two threads to read data from the log-structured merge tree and the two-layer log on each node in parallel, and based on the read key-value data, calculate the hash value of each key-value data, and put It is stored in each node of the Merkle tree, thereby constructing the Merkle tree on each node;

(ii) Compare the Merkle tree constructed on the faulty node with the Merkle tree constructed by other nodes in turn, and judge whether the fault node and other Merkle trees have the same value at the same position; if they are the same, It means that the key-value data corresponding to the corresponding node is not lost, and the process ends, otherwise, it means that the key-value data corresponding to the corresponding node has been lost;

(iii) Use two threads on the non-faulty node to read the corresponding key-value data on the node where the lost data is located from the log-structured merge tree and the two-layer log in parallel as recovery data, and send it to the faulty node;

(iv) After receiving the recovery data, the faulty node uses two threads to write the recovery data into the log-structured merge tree and the two-layer log in parallel.