CN117539389A

CN117539389A - Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium

Info

Publication number: CN117539389A
Application number: CN202311496327.8A
Authority: CN
Inventors: 任棒棒; 程葛瑶; 谢兴睿; 夏俊旭; 罗来龙; 郭得科
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-02-09

Abstract

The application relates to a cloud edge end longitudinal fusion deduplication storage system, a cloud edge end longitudinal fusion deduplication storage method, equipment and a medium. The edge layer reduces bandwidth resource overhead of the backbone network by storing hotter data blocks. The uploaded metadata can also be pre-de-duplicated at the edge layer to further reduce the data transmission quantity; the cloud layer maintains a global fingerprint index table for global data deduplication and supports distributed parallel indexing across cloud storage servers by partitioning the global fingerprint index table and incoming metadata to different servers of the cloud data center, thereby globally improving data deduplication and storage performance. The end-side-cloud longitudinal fusion deduplication storage architecture remarkably reduces the bidirectional data transmission quantity between layers while ensuring the optimal data deduplication performance, and achieves the effect of greatly reducing the resource overhead of a backbone network of the cloud side end architecture.

Description

Deduplicated storage systems, methods, equipment and media for vertical integration of cloud, edge and end

技术领域Technical field

本发明属于数据处理技术领域，涉及一种云边端纵向融合的去重存储系统、方法、设备和介质。The invention belongs to the field of data processing technology and relates to a deduplication storage system, method, equipment and medium for vertical integration of cloud, edge and terminal.

背景技术Background technique

随着数字信息体量和价值的不断增长，其已经引起了业界对数据保护的广泛关注。云备份服务可以通过保存客户重要数据的连续备份文件，为数据保护提供经济、高效、按需且始终可用的选择。根据现有的报告，采用云数据保护的组织数量正快速上涨，但是，将这些原始文件从终端传输到远端云服务器会给骨干网带来很大的数据传输负担，同时也会产生可观的传输成本。As the volume and value of digital information continues to grow, it has attracted widespread attention in the industry to data protection. Cloud backup services can provide a cost-effective, efficient, on-demand and always-available option for data protection by saving continuous backup files of customers' important data. According to existing reports, the number of organizations adopting cloud data protection is rising rapidly. However, transferring these original files from the endpoint to the remote cloud server will bring a large data transmission burden to the backbone network and also generate considerable costs. Transmission costs.

在这种趋势下，在备份存储系统中应用数据去重技术成为一种新的范式。由于这些连续备份文件之间的内部派生关系，文件之间可能存在大量的不可忽略的冗余数据。一项大型研究报告表面，桌面Windows机器上的文件系统内容的数据冗余可高达其原始存储空间的87％。数据去重技术的常见做法是将备份文件切分为数据块并为每个数据块计算一个指纹，具有相同指纹的两个数据块被认为重复，而不需要对数据块进行逐字节比较。多个重复的数据块只保留一份副本，从而实现在存储系统中节省存储空间的目的。Under this trend, applying data deduplication technology in backup storage systems has become a new paradigm. Due to the internal derivation relationship between these consecutive backup files, there may be a large amount of non-negligible redundant data between the files. A large study reports that the data redundancy of file system contents on desktop Windows machines can be as high as 87% of its original storage space. A common practice in data deduplication technology is to split the backup file into data blocks and calculate a fingerprint for each data block. Two data blocks with the same fingerprint are considered duplicates without the need for byte-by-byte comparison of the data blocks. Only one copy of multiple duplicate data blocks is retained, thereby saving storage space in the storage system.

目前已经有一些新兴的数据保护策略采用了数据去重技术，例如在云提供商处执行重复数据删除操作，尝试将类似的文件分配到相同的存储服务器以提高数据去重率等。然而，前述传统的数据去重技术在云边端架构的去重存储过程中，仍然存在着骨干网资源开销较大的技术问题。There are already some emerging data protection strategies that adopt data deduplication technology, such as performing data deduplication operations at cloud providers, trying to allocate similar files to the same storage server to improve the data deduplication rate, etc. However, the aforementioned traditional data deduplication technology still has the technical problem of high backbone network resource overhead in the deduplication and storage process of cloud-edge architecture.

发明内容Contents of the invention

针对上述传统方法中存在的问题，本发明提出了一种云边端纵向融合的去重存储系统、一种云边端纵向融合的去重存储方法、一种计算机设备和一种计算机可读存储介质，能够大幅降低云边端架构的骨干网资源开销。In view of the problems existing in the above traditional methods, the present invention proposes a deduplication storage system with vertical integration of cloud and edge, a deduplication storage method with vertical integration of cloud and edge, a computer device and a computer-readable storage. media, which can significantly reduce the resource overhead of the backbone network in the cloud-edge architecture.

为了实现上述目的，本发明实施例采用以下技术方案：In order to achieve the above objects, the embodiments of the present invention adopt the following technical solutions:

一方面，提供一种云边端纵向融合的去重存储系统，包括终端层、边缘层和云层，终端层的终端设备在将原始备份文件划分为各数据块后，生成各数据块对应的未处理的文件谱序并上传至边缘层；On the one hand, a deduplication storage system with vertical integration of cloud, edge, and end is provided, including the terminal layer, edge layer, and cloud layer. After the terminal device in the terminal layer divides the original backup file into each data block, it generates unprocessed data corresponding to each data block. The processed files are sequenced and uploaded to the edge layer;

边缘层中边缘服务器将未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签；In the edge layer, the edge server hashes the fingerprint information contained in the unprocessed file sequence into a fixed-size compact sketch data structure to estimate the data block heat, allocates edge storage locations for the estimated new hot data blocks, and then stores the edge The storage address of the location is recorded in the unprocessed file sequence and an edge upload tag is attached;

边缘层中边缘服务器将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层；设定部分索引表为用户感知和版本相邻的部分索引表；The edge server in the edge layer matches the entries in the unprocessed file sequence with the entries in the set partial index table, and copies the storage locations corresponding to the data blocks with matching fingerprints from the set partial index table to the unprocessed files. After the spectrum sequence, upload the partial unprocessed file sequence corresponding to the data block without matching fingerprint to the cloud layer; set the partial index table to the partial index table that the user perceives and is adjacent to the version;

云层中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后，向边缘层返回处理后的文件谱序；The cloud server in the cloud layer performs redundancy detection on the uploaded part of the unprocessed file sequence with the global fingerprint index table maintained in the cloud, and appends an entry in the unprocessed file sequence corresponding to the unrecognized unstored data block. After the cloud uploads the tag and allocates a new cloud storage location corresponding to the unstored data block, the processed file sequence is returned to the edge layer;

边缘层的边缘服务器将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层返回处理后的文件谱序；The edge server of the edge layer assembles the processed file sequence and the partially unprocessed file sequence corresponding to the data block with matching fingerprints into a complete processed file sequence, and returns the processed file sequence to the terminal layer;

终端层的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块至边缘层进行存储，并根据处理后的文件谱序中的新云存储位置和云上传标签上传未存储数据块至云层进行存储。The terminal device at the terminal layer uploads new hot data blocks to the edge layer for storage based on the edge storage location and edge upload tag in the processed file sequence, and based on the new cloud storage location and cloud upload in the processed file sequence The tag uploads unstored data blocks to the cloud for storage.

另一方面，还提供一种云边端纵向融合的去重存储方法，包括步骤：On the other hand, it also provides a cloud-edge-device vertical integration deduplication storage method, which includes the following steps:

将终端层上传的未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签；Hash the fingerprint information contained in the unprocessed file sequence uploaded by the terminal layer into a fixed-size compact sketch data structure to estimate the data block heat, and allocate edge storage locations to the estimated new hot data blocks. The storage address is recorded in the unprocessed file sequence and an edge upload tag is attached;

将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层；设定部分索引表为用户感知和版本相邻的部分索引表；Match the entries in the unprocessed file sequence with the entries in the set partial index table. After copying the storage location corresponding to the data block with matching fingerprint from the set partial index table to the unprocessed file sequence, Part of the unprocessed file sequence corresponding to the data block that does not match the fingerprint is uploaded to the cloud; the partial index table is set to the partial index table that the user perceives and is adjacent to the version;

将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层返回处理后的文件谱序；其中，处理后的文件谱序通过云层中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后得到并返回；Assemble the processed file sequence and the partially unprocessed file sequence corresponding to the data block with matching fingerprint into a complete processed file sequence, and return the processed file sequence to the terminal layer; where, the processed file sequence The file sequence uses the cloud server in the cloud layer to perform redundancy detection on the uploaded part of the unprocessed file sequence and the global fingerprint index table maintained in the cloud, and checks the unprocessed file sequence corresponding to the unidentified unstored data block. The entry is obtained and returned after appending a cloud upload tag and assigning a new cloud storage location corresponding to the unstored data block;

接收终端层的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块并进行存储；处理后的文件谱序中的新云存储位置和云上传标签还用于指示终端层的终端设备上传未存储数据块至云层进行存储。The terminal device at the receiving terminal layer uploads and stores new hot data blocks based on the edge storage location and edge upload tag in the processed file sequence; the new cloud storage location and cloud upload tag in the processed file sequence are also used Instruct the terminal device at the terminal layer to upload the unstored data blocks to the cloud layer for storage.

又一方面，还提供一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现上述的云边端纵向融合的去重存储方法的步骤。On another aspect, a computer device is also provided, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the above steps of the cloud-edge-device vertical integration deduplication storage method.

再一方面，还提供一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述的云边端纵向融合的去重存储方法的步骤。On the other hand, a computer-readable storage medium is also provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the above-mentioned cloud-edge-device vertical integration deduplication storage method are implemented.

上述技术方案中的一个技术方案具有如下优点和有益效果：One of the above technical solutions has the following advantages and beneficial effects:

上述云边端纵向融合的去重存储系统、方法、设备和介质，通过终端层负责将文件分块，并将生成的元数据信息上传到边缘层和云层以进行冗余检测。边缘层通过存储较热的数据块来减少骨干网的带宽资源开销。此外，上传的元数据也可以在边缘层进行预去重以进一步减少数据传输量；云层维护一个全局指纹索引表，用于进行全局数据去重并通过将全局指纹索引表和传入的元数据划分到云数据中心的不同服务器，以支持跨云存储服务器的分布式并行索引，从而全局提升数据去重和存储性能。相比于传统技术，这种端-边-云纵向融合的去重存储架构有效地整合了文件在传输、存储和检索过程中不同层次的技术和存储资源，在保证最佳数据去重性能的同时，显著减少了层间的双向数据传输量，达到了大幅降低云边端架构的骨干网资源开销的效果。The above-mentioned cloud-edge-end vertical integration deduplication storage system, method, device and media are responsible for dividing files into blocks through the terminal layer and uploading the generated metadata information to the edge layer and cloud layer for redundancy detection. The edge layer reduces the bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata can also be pre-deduplicated at the edge layer to further reduce the amount of data transmission; the cloud layer maintains a global fingerprint index table for global data deduplication and combines the global fingerprint index table with the incoming metadata. Divide it into different servers in the cloud data center to support distributed parallel indexing across cloud storage servers, thereby globally improving data deduplication and storage performance. Compared with traditional technology, this end-edge-cloud vertical integration deduplication storage architecture effectively integrates different levels of technology and storage resources in the process of file transmission, storage and retrieval, while ensuring the best data deduplication performance. At the same time, the amount of two-way data transmission between layers is significantly reduced, achieving the effect of significantly reducing the resource overhead of the backbone network of the cloud-edge architecture.

附图说明Description of drawings

为了更清楚地说明本申请实施例或传统技术中的技术方案，下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions in the embodiments of the present application or the traditional technology, the drawings needed to be used in the description of the embodiments or the traditional technology will be briefly introduced below. Obviously, the drawings in the following description are only for the purpose of explaining the embodiments or the technical solutions of the traditional technology. For some embodiments of the application, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1为一个实施例中云边端纵向融合的去重存储系统的CoopDedup架构示意图；Figure 1 is a schematic diagram of the CoopDedup architecture of a cloud-edge-end vertical integration deduplication storage system in one embodiment;

图2为一个实施例中备份文件之间的数据共享依赖关系示意图；Figure 2 is a schematic diagram of data sharing dependencies between backup files in one embodiment;

图3为一个实施例中备份文件的数据去重率示意图；Figure 3 is a schematic diagram of the data deduplication rate of backup files in one embodiment;

图4为一个实施例中云边端纵向融合的去重存储系统的总体结构示意图；Figure 4 is a schematic diagram of the overall structure of a deduplication storage system for vertical integration of cloud, edge and end in one embodiment;

图5为一个实施例中CoopDedup架构的数据流示意图；Figure 5 is a schematic diagram of the data flow of the CoopDedup architecture in one embodiment;

图6为一个实施例中备份文件上传过程中的评估性能示意图；Figure 6 is a schematic diagram of evaluation performance during backup file uploading in one embodiment;

图7为一个实施例中备份文件的数据块传输量示意图；Figure 7 is a schematic diagram of the data block transmission volume of the backup file in one embodiment;

图8为一个实施例中备份文件检索过程中的数据传输性能示意图；Figure 8 is a schematic diagram of data transmission performance during backup file retrieval in one embodiment;

图9为一个实施例中带宽节约率性能示意图；Figure 9 is a schematic diagram of bandwidth saving rate performance in one embodiment;

图10为一个实施例中云边端纵向融合的去重存储方法的流程示意图。Figure 10 is a schematic flowchart of a deduplication storage method for cloud-edge-end vertical integration in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used in the description of the present application are only for the purpose of describing specific embodiments and are not intended to limit the present application.

需要说明的是，在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。It should be noted that reference to "embodiments" herein means that specific features, structures or characteristics described in connection with the embodiments may be included in at least one embodiment of the present invention. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

本领域技术人员可以理解，本文所描述的实施例可以与其它实施例相结合。在本发明说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。Those skilled in the art will appreciate that the embodiments described herein may be combined with other embodiments. As used in this specification and the appended claims, the term "and/or" means and includes any and all possible combinations of one or more of the associated listed items.

个人计算环境中的连续文件需要云备份服务来进行数据保护。但是将这些原始文件从终端传输到远端云服务器会给骨干网带来较大的传输负担。即使备份服务可以使用数据去重技术来删除冗余的数据块，但终端有限的计算和内存资源难以将去重技术在端侧进行实际部署和应用。Contiguous files in personal computing environments require cloud backup services for data protection. However, transmitting these original files from the terminal to the remote cloud server will bring a greater transmission burden to the backbone network. Even though the backup service can use data deduplication technology to delete redundant data blocks, the limited computing and memory resources of the terminal make it difficult to actually deploy and apply the deduplication technology on the terminal side.

在本领域的现有技术中，也有在端侧进行数据去重并且只将新数据块(未存储过的)上传到云端的，其缺点在于端层稀缺的资源限制了数据去重技术的应用。而当前最先进的去重存储架构，是将数据块的元数据(如指纹)上传到云服务器进行冗余检测且只上传新数据块，然而，它仍然没有充分利用边缘侧存储资源，导致终端与远程云之间频繁的数据交换。Among the existing technologies in this field, there are also methods that perform data deduplication on the terminal side and only upload new data blocks (that have not been stored) to the cloud. The disadvantage is that the scarce resources of the terminal layer limit the application of data deduplication technology. . The current most advanced deduplication storage architecture uploads the metadata of data blocks (such as fingerprints) to the cloud server for redundancy detection and only uploads new data blocks. However, it still does not fully utilize edge-side storage resources, resulting in terminal Frequent data exchange with remote clouds.

为此，本申请提出了一种端-边-云纵向融合的去重存储架构CoopDedup，这种创新的架构使得端层、云层以及中间的边缘层协作完成数据的传输、存储和索引访问。CoopDedup架构的示例可以如图1所示。终端层将文件切分成数据块并计算其元数据进行上传，云层维护全局索引表并存储所有去重后的数据块。该架构的创新之处主要集中在端与云中间的边缘层，它对上传的元数据(UFR)进行预去重并识别出最热的数据块存储在边缘。基于此，在访问备份文件时，这些较热的数据块可以直接从附近的边缘层获取，从而优化数据检索过程，减少数据块传输距离。此外，边缘层对上传的UFR进行预去重并只将处理后的部分元数据(P-UFR)传输到云，这大大减少了元数据传输量，从而节省了骨干网的带宽资源。CoopDedup架构充分的融合了端-边-云三层的存储和计算资源，显著减少了层间的双向数据交换，同时仍能确保最佳的数据去重性能。To this end, this application proposes CoopDedup, a deduplication storage architecture with end-edge-cloud vertical integration. This innovative architecture enables the end layer, cloud layer, and intermediate edge layer to collaborate to complete data transmission, storage, and index access. An example of the CoopDedup architecture can be shown in Figure 1. The terminal layer divides the file into data blocks and calculates its metadata for uploading. The cloud layer maintains a global index table and stores all deduplicated data blocks. The innovation of this architecture is mainly focused on the edge layer between the end and the cloud. It pre-deduplicates uploaded metadata (UFR) and identifies the hottest data blocks and stores them at the edge. Based on this, when accessing backup files, these hotter data blocks can be obtained directly from nearby edge layers, thus optimizing the data retrieval process and reducing data block transmission distance. In addition, the edge layer pre-deduplicates the uploaded UFR and only transmits the processed partial metadata (P-UFR) to the cloud, which greatly reduces the amount of metadata transmission and thus saves the bandwidth resources of the backbone network. The CoopDedup architecture fully integrates the storage and computing resources of the three layers of end-edge-cloud, significantly reducing the two-way data exchange between layers, while still ensuring the best data deduplication performance.

下面将结合本发明实施例图中的附图，对本发明实施方式进行详细说明。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings of the embodiments of the present invention.

首先需要说明的是，数据去重技术是一种被广泛应用的数据约简技术，它可以删除冗余数据块，避免对相同数据的重复写入。这样既降低了系统对存储空间的占用，又减少了数据传输和检索过程中的网络带宽消耗。数据去重的常见做法是将文件切分为多个固定大小的或可变大小的数据块。每个块都用指纹唯一标识，指纹本质上是一个加密安全的哈希签名，具有相同指纹的两个数据块被认为是重复的，而不需要对数据块进行逐字节比较。所有存储数据块的指纹会被记录到指纹索引表中。通过查找指纹索引表，重复的数据块将会被直接找到并删除，而只有具有唯一指纹的数据块会被存储在存储系统中。First of all, it needs to be explained that data deduplication technology is a widely used data reduction technology, which can delete redundant data blocks and avoid repeated writing of the same data. This not only reduces the storage space occupied by the system, but also reduces the network bandwidth consumption during data transmission and retrieval. A common practice for data deduplication is to split a file into multiple fixed-size or variable-size data blocks. Each block is uniquely identified with a fingerprint, which is essentially a cryptographically secure hash signature, and two data blocks with the same fingerprint are considered duplicates without the need for a byte-by-byte comparison of the data blocks. The fingerprints of all stored data blocks will be recorded in the fingerprint index table. By looking up the fingerprint index table, duplicate data blocks will be directly found and deleted, and only data blocks with unique fingerprints will be stored in the storage system.

指纹比对虽然是一种高效的冗余检测方法，但是存储指纹索引表也需要占用大量的额外空间。举个例子，当存储800TB的文件数据集时，假设平均切分数据块的大小为4KB，则将会产生至少4TB的指纹(使用SHA-1编码，每个指纹为20B)。这些大容量指纹的传输和存储对于备份文件系统来说是一个巨大的挑战。当存储数据被频繁检索时，这些影响将会被进一步加剧，因为这些生成的指纹需要在用户和云层之间来回传输以进行重复数据块比对，给骨干网带来了巨大的通信负担。但是与直接传输数据块的方式相比，它已经是一个相对较好的选择。Although fingerprint comparison is an efficient redundancy detection method, it also requires a large amount of extra space to store the fingerprint index table. For example, when storing an 800TB file data set, assuming the average block size is 4KB, at least 4TB of fingerprints will be generated (using SHA-1 encoding, each fingerprint is 20B). The transmission and storage of these large-capacity fingerprints is a huge challenge for backup file systems. These impacts will be further aggravated when stored data is frequently retrieved, as these generated fingerprints need to be transmitted back and forth between the user and the cloud for duplicate data block comparisons, placing a huge communication burden on the backbone network. But compared with the direct transmission of data blocks, it is already a relatively better choice.

需要说明的是，假设云存储集群由大量的存储服务器组成，为云备份业务预留的资源也相对充足。因此云层可以存储一个全局指纹索引表(包含云存储集群中所有存储的数据块的指纹)，以进行彻底的冗余检测。这些被存储的数据块也应该采用一些容错机制，如副本和纠删码来确保存储在云中数据的可靠性。It should be noted that assuming that the cloud storage cluster consists of a large number of storage servers, the resources reserved for the cloud backup business are relatively sufficient. The cloud layer can therefore store a global fingerprint index table (containing the fingerprints of all data blocks stored in the cloud storage cluster) for thorough redundancy detection. These stored data blocks should also adopt some fault-tolerant mechanisms, such as replicas and erasure coding, to ensure the reliability of the data stored in the cloud.

减少存储空间可以显著降低数据保护成本。因此，数据去重技术被广泛应用于备份存储系统中以减少对存储空间的占用。除了存储成本之外，从数据生成的端侧到云层的带宽消耗也是一个值得关注的问题。因为层之间的网络吞吐量是有限的，大量的数据传输会占用大量的带宽资源，会其他应用造成延迟影响。此外，传输一次数据的成本可能会超过每月存储数据的成本。Reducing storage space can significantly reduce data protection costs. Therefore, data deduplication technology is widely used in backup storage systems to reduce storage space usage. In addition to storage costs, bandwidth consumption from the end side of data generation to the cloud layer is also a concern. Because the network throughput between layers is limited, large amounts of data transmission will occupy a large amount of bandwidth resources and cause delays in other applications. Additionally, the cost of transferring data once may exceed the monthly cost of storing the data.

为此，在网络边缘放置部分数据块是一种创新和有效的现实尝试。因为边缘资源大多位于靠近用户的位置，在访问备份文件时，可以从边缘层取回相关的数据块，而只从远端的云数据中心下载边缘层中不存在的数据块。除此之外，本申请还可以利用边缘资源对传输的数据进行预去重，而只将预去重之后的数据传向云层。如此，这些边缘辅助的备份服务模式可以有效节省骨干网的宝贵带宽资源。To this end, placing some data blocks at the edge of the network is an innovative and effective practical attempt. Because edge resources are mostly located close to users, when accessing backup files, relevant data blocks can be retrieved from the edge layer, while only data blocks that do not exist in the edge layer are downloaded from the remote cloud data center. In addition, this application can also use edge resources to pre-deduplicate the transmitted data, and only transmit the pre-deduplicated data to the cloud layer. In this way, these edge-assisted backup service models can effectively save valuable bandwidth resources on the backbone network.

但是边缘集群的存储空间和计算资源是有限的，难以应对文件爆炸式增长所带来的巨大存储和计算需求。例如，在去重数据存储场景中，随着文件的不断到来，存储的数据块的数量不断增长，相应的指纹索引表的体量也会不断变大。因此，如何有效地利用边缘集群上的宝贵资源来放置部分数据块并进行数据预去重是一个亟待解决的技术问题。However, the storage space and computing resources of edge clusters are limited, making it difficult to cope with the huge storage and computing demands brought about by the explosive growth of files. For example, in a deduplication data storage scenario, as files continue to arrive, the number of stored data blocks continues to grow, and the corresponding fingerprint index table will continue to grow in size. Therefore, how to effectively utilize precious resources on edge clusters to place partial data blocks and perform data pre-deduplication is an urgent technical issue that needs to be solved.

备份文件特性观测：重要的数字信息一般会有一系列的备份版本。例如用户会定期对其虚拟机拍摄快照，其中每个快照对应于一个备份文件。在这些连续文件中，大多数的数据块是保持不变的。本文还提供了一些基于备份文件的系统性观测，这些观测结果有助于高效利用有限的存储和计算资源进行数据去重操作。Observation on backup file characteristics: Important digital information generally has a series of backup versions. For example, users regularly take snapshots of their virtual machines, where each snapshot corresponds to a backup file. In these contiguous files, most of the data blocks remain unchanged. This article also provides some systematic observations based on backup files, which can help to efficiently utilize limited storage and computing resources for data deduplication operations.

观测1：由于数据来源的多样性，不同用户的备份文件之间的重复数据块数量可以忽略不计。Observation 1: Due to the diversity of data sources, the number of duplicate data blocks between backup files of different users is negligible.

在大数据时代，不同的用户可能会备份不同内容和格式的文件。本文观测了这些文件之间的数据共享依赖关系，测试数据集是去重系统中常用的大学生主目录快照。为了发现不同用户内部和用户之间的数据冗余量，本文将这些数据集划分为平均大小为4KB的可变长度数据块，并记录了各自的数据块重复量。In the era of big data, different users may back up files with different contents and formats. This paper observes the data sharing dependencies between these files. The test data set is a snapshot of college students' home directories commonly used in deduplication systems. In order to discover the amount of data redundancy within and between different users, this paper divides these data sets into variable-length data blocks with an average size of 4KB, and records the duplication of the respective data blocks.

本文将用户内和用户间的去重率进行了记录，实验结果显示：此测试中分别包含四个用户的连续五个备份文件，用户内去重率一直保持在41％以上，甚至个别用户的用户内去重率高达55.02％。这表明有大量的数据块维持在一个用户的连续多个备份文件中。相比之下，用户间去重率通常在2％左右，与用户内去重率相比显得微不足道。这也证实了本文的观测，即由于数据来源的多样性，不同用户之间的数据冗余可以忽略不计。This article records the intra-user and inter-user deduplication rates. The experimental results show that this test contains five consecutive backup files of four users. The intra-user deduplication rate has always remained above 41%, even for individual users. The intra-user deduplication rate is as high as 55.02%. This indicates that there are a large number of data blocks maintained in multiple consecutive backup files of a user. In comparison, inter-user deduplication rates are typically around 2%, which is insignificant compared to intra-user deduplication rates. This also confirms the observation of this paper that data redundancy among different users is negligible due to the diversity of data sources.

经验观测表明：一个用户连续备份文件中的数据冗余率很高，而来自不同用户的备份文件间的数据块重复量相对较小，甚至可以忽略不计。这种现象促使本文研究不同用户之间的差异性。本文可以将全局指纹索引表按照用户来源划分为若干独立的子索引表，并将备份文件基于其相应用户关联的特定子索引表进行索引。这种索引表的划分利于对每个用户进行单独的索引管理；此外，这种索引表加快了指纹索引的速度，避免了索引查找瓶颈，同时索引的并行性可以提高索引吞吐量。最重要的一点是，这种用户感知的索引划分在维持去重效果的同时，额外占用的索引内存空间很小，这是因为不同用户备份文件之间的重复数据块的数量可以忽略不计。Empirical observations show that the data redundancy rate in one user's consecutive backup files is very high, while the amount of data block duplication between backup files from different users is relatively small or even negligible. This phenomenon prompted this paper to study the differences between different users. This article can divide the global fingerprint index table into several independent sub-index tables according to user sources, and index the backup files based on the specific sub-index tables associated with their corresponding users. The division of this index table facilitates individual index management for each user; in addition, this index table speeds up fingerprint indexing and avoids index search bottlenecks, while the parallelism of the index can improve index throughput. The most important point is that this user-aware index partitioning takes up very little additional index memory space while maintaining the deduplication effect, because the number of duplicate data blocks between different users' backup files is negligible.

观测2：备份文件的大多数重复数据块来自其之前的相邻备份版本，而两个较远的备份版本只包含少量的重复数据块。Observation 2: Most of the duplicate blocks of a backup file come from its previous adjacent backup versions, while the two more distant backup versions contain only a small number of duplicate blocks.

为了观测连续备份文件之间的内部数据共享依赖关系并检测它们的数据块组成，本文对一个用户的连续30个备份版本进行数据切块。当从备份文件的先前版本中检测到这些数据块的指纹时，这些数据块就会被识别为重复。将备份文件B_j中的四种块表示如下：其一是内部重复数据块。其二是相邻重复数据块：B_j的数据块也被相邻的版本B_j-1引用。其三是跳跃重复数据块：B_j的数据块被其在B_j-1之前的备份版本引用，例如B_j-2，但不被相邻的B_j-1引用。其四是唯一数据块：不重复的数据块。In order to observe the internal data sharing dependencies between consecutive backup files and detect their data block composition, this paper performs data slicing on 30 consecutive backup versions of a user. These data blocks are identified as duplicates when their fingerprints are detected from previous versions of the backup file. The four types of blocks in the backup file B _j are expressed as follows: One is the internal repeated data block. The second is adjacent duplicate data blocks: the data block of B _j is also referenced by the adjacent version B _j-1 . The third is skipping duplicate data blocks: the data block of B _j is referenced by its backup version before B _j-1 , such as B _j-2 , but is not referenced by the adjacent B _j-1 . The fourth is the unique data block: a non-duplicate data block.

图2和图3展示了来自一个用户的连续30个备份文件之间的数据共享依赖关系。首先，图2显示了这些备份文件的数据块分布，可以观察到，备份文件的大多数重复数据块来自其以前的版本(即相邻重复数据块)，它们占整个备份文件数据量的55％左右。而跳跃重复数据块只包含一小部分，甚至不到所有数据块的0.3％。图3展示了对于当前的第30个备份文件而言，其各自基于之前的29个版本的指纹索引表得出的数据去重率。观测结果表明，数据去重率从大约42％(基于初始备份版本1)逐渐增长到65％以上(基于当前备份的相邻版本29)。Figures 2 and 3 illustrate the data sharing dependencies between 30 consecutive backup files from one user. First, Figure 2 shows the data block distribution of these backup files. It can be observed that most of the duplicate data blocks of the backup file come from its previous version (i.e., adjacent duplicate data blocks), and they account for 55% of the entire backup file data volume. about. Skipping duplicate data blocks only contain a small portion, not even 0.3% of all data blocks. Figure 3 shows the data deduplication rate for the current 30th backup file based on the previous 29 versions of the fingerprint index table. Observations show that the data deduplication rate gradually increases from approximately 42% (based on the initial backup version 1) to over 65% (based on the adjacent version 29 of the current backup).

这些观测的结果验证了一个用户连续备份文件之间的数据共享依赖关系和版本派生关系。如果备份版本更接近，则可以相互检测出更多的重复数据块，而与较远版本的指纹索引将削弱冗余检测的效果。这使得本文在内存空间不足以保存所有以前备份版本的指纹时，更多地关注于备份文件的相邻版本。The results of these observations verify the data sharing dependencies and version derivation relationships between successive backup files of a user. If backup versions are closer, more duplicate data blocks can be detected from each other, whereas fingerprinting indexes with more distant versions will weaken the effectiveness of redundancy detection. This makes this article focus more on adjacent versions of backup files when the memory space is insufficient to save the fingerprints of all previous backup versions.

请参阅图4，在一个实施例中，提供了一种云边端纵向融合的去重存储系统，包括终端层12、边缘层14和云层16。终端层12的终端设备在将原始备份文件划分为各数据块后，生成各数据块对应的未处理的文件谱序并上传至边缘层14。边缘层14中边缘服务器将未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签。边缘层14中边缘服务器将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层16；设定部分索引表为用户感知和版本相邻的部分索引表。云层16中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后，向边缘层14返回处理后的文件谱序。边缘层14的边缘服务器将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层12返回处理后的文件谱序。终端层12的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块至边缘层14进行存储，并根据处理后的文件谱序中的新云存储位置和云上传标签上传未存储数据块至云层16进行存储。Referring to Figure 4, in one embodiment, a deduplication storage system with vertical integration of cloud, edge and end is provided, including a terminal layer 12, an edge layer 14 and a cloud layer 16. After dividing the original backup file into data blocks, the terminal device of the terminal layer 12 generates an unprocessed file sequence corresponding to each data block and uploads it to the edge layer 14 . In edge layer 14, the edge server hashes the fingerprint information contained in the unprocessed file sequence into a fixed-size compact sketch data structure to estimate the data block heat, allocates edge storage locations for the estimated new hot data blocks, and then The storage address of the storage location is recorded in the unprocessed file sequence and an edge upload tag is attached. The edge server in edge layer 14 matches the entries in the unprocessed file sequence with the entries in the set part index table, and copies the storage locations corresponding to the data blocks with matching fingerprints from the set part index table to the unprocessed After the file is sequenced, the partially unprocessed file sequence corresponding to the data block that does not match the fingerprint is uploaded to the cloud layer 16; the partial index table is set to a partial index table that is perceived by the user and adjacent to the version. The cloud server in cloud layer 16 performs redundancy detection on the uploaded part of the unprocessed file sequence and the global fingerprint index table maintained by the cloud, and appends entries in the unprocessed file sequence corresponding to the unrecognized unstored data blocks. After a cloud uploads the tag and allocates a new cloud storage location corresponding to the unstored data block, it returns the processed file sequence to the edge layer 14. The edge server of the edge layer 14 assembles the processed file spectrum sequence and the partially unprocessed file spectrum sequence corresponding to the data block with matching fingerprints into a complete processed file spectrum sequence, and returns the processed file spectrum sequence to the terminal layer 12 sequence. The terminal device of the terminal layer 12 uploads the new hot data block to the edge layer 14 for storage according to the edge storage location and edge upload tag in the processed file sequence, and based on the new cloud storage location and edge upload tag in the processed file sequence. The cloud upload tag uploads unstored data blocks to cloud layer 16 for storage.

上述云边端纵向融合的去重存储系统100，通过终端层12负责将文件分块，并将生成的元数据信息上传到边缘层14和云层16以进行冗余检测。边缘层14通过存储较热的数据块来减少骨干网的带宽资源开销。此外，上传的元数据也可以在边缘层14进行预去重以进一步减少数据传输量；云层16维护一个全局指纹索引表，用于进行全局数据去重并通过将全局指纹索引表和传入的元数据划分到云数据中心的不同服务器，以支持跨云存储服务器的分布式并行索引，从而全局提升数据去重和存储性能。相比于传统技术，这种端-边-云纵向融合的去重存储架构有效地整合了文件在传输、存储和检索过程中不同层次的技术和存储资源，在保证最佳数据去重性能的同时，显著减少了层间的双向数据传输量，达到了大幅降低云边端架构的骨干网资源开销的效果。The above-mentioned cloud-edge-end vertical integration deduplication storage system 100 is responsible for dividing files into blocks through the terminal layer 12 and uploading the generated metadata information to the edge layer 14 and the cloud layer 16 for redundancy detection. The edge layer 14 reduces the bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata can also be pre-deduplicated at the edge layer 14 to further reduce the amount of data transmission; the cloud layer 16 maintains a global fingerprint index table for global data deduplication and combines the global fingerprint index table with the incoming Metadata is divided into different servers in the cloud data center to support distributed parallel indexing across cloud storage servers, thereby globally improving data deduplication and storage performance. Compared with traditional technology, this end-edge-cloud vertical integration deduplication storage architecture effectively integrates different levels of technology and storage resources in the process of file transmission, storage and retrieval, while ensuring the best data deduplication performance. At the same time, the amount of two-way data transmission between layers is significantly reduced, achieving the effect of significantly reducing the resource overhead of the backbone network of the cloud-edge architecture.

可以理解，CoopDedup架构在去重存储服务中有几个难题需要解决。第一个关键问题是应该如何合理地估计数据块访问频率，并将最频繁访问的块(热块)存储在边缘层14？由于边缘层14存储和计算资源有限，记录每个数据块的访问频次是不现实的。第二个关键问题是应该如何有效地利用边缘有限的内存空间来检测更多的冗余元数据？在边缘维护全局指纹索引表以进行冗余检测会对内存稀缺的边缘服务器构成严重挑战。即使边缘服务器插入大型存储磁盘来辅助维护这个大容量索引表，这种慢速的磁盘索引仍会成为去重系统的主要性能瓶颈。It is understandable that the CoopDedup architecture has several problems that need to be solved in the deduplication storage service. The first key question is how to reasonably estimate the data block access frequency and store the most frequently accessed blocks (hot blocks) in the edge layer 14? Due to the limited storage and computing resources of the edge layer 14, it is unrealistic to record the access frequency of each data block. The second key question is how to effectively utilize the limited memory space at the edge to detect more redundant metadata? Maintaining a global fingerprint index table at the edge for redundancy detection poses a serious challenge to memory-scarce edge servers. Even if the edge server inserts large storage disks to assist in maintaining this large-capacity index table, this slow disk index will still become the main performance bottleneck of the deduplication system.

对于第一个难题，本文采用空间友好的Count-Min Sketch(一个可以用来计数的算法，在数据大小非常大时，可通过牺牲准确性提高的效率)来估计数据块的访问热度，这个大小固定的数据结构可以较高的精度识别出较热的数据块。对于第二个难题，本文利用备份文件派生关系，设计了一个高效的轻量索引表对上传的元数据进行预去重，其中大多数元数据冗余可以被检测出来并删除掉。这些边缘辅助的方法有效的节省了从边缘到远端云的骨干网的宝贵带宽资源。For the first problem, this article uses the space-friendly Count-Min Sketch (an algorithm that can be used for counting. When the data size is very large, the efficiency can be improved by sacrificing accuracy) to estimate the access heat of the data block. This size Fixed data structures identify hotter data blocks with higher accuracy. For the second problem, this paper uses the backup file derivation relationship to design an efficient lightweight index table to pre-deduplicate the uploaded metadata. Most of the metadata redundancy can be detected and deleted. These edge-assisted methods effectively save valuable bandwidth resources on the backbone network from the edge to the remote cloud.

需要首先说明的一些概念：元数据结构，指纹是数据块的唯一标识符。通过比较指纹，可以判别两个数据块是否为重复。在文件存储或检索过程中，通常将这些指纹按顺序组合成文件谱序。文件谱序本质上是文件中每个数据块元数据的顺序列表，反映这些数据块在文件中的出现顺序。即使一个数据块在文件中出现多次，其对应的元数据仍然会在对应的文件谱序中多次列出。本文的CoopDedup架构中存在两种类型的谱序：未处理的文件谱序(Unprocessed File recipes,UFR)和处理后的文件谱序(Processed File recipes,PFR)。Some concepts that need to be explained first: metadata structure, fingerprint is the unique identifier of the data block. By comparing fingerprints, it can be determined whether two data blocks are duplicates. During file storage or retrieval, these fingerprints are usually combined into a file sequence in sequence. File sequence is essentially a sequential list of metadata for each data block in the file, reflecting the order in which these data blocks appear in the file. Even if a data block appears multiple times in the file, its corresponding metadata will still be listed multiple times in the corresponding file sequence. There are two types of file sequences in the CoopDedup architecture of this article: Unprocessed File recipes (UFR) and Processed File recipes (PFR).

其中，未处理的文件谱序(UFR)是在终端层12对备份文件进行划分时生成的，其中的元数据只包含数据块的指纹。UFR记录备份文件的数据块组成并作为后续数据去重过程的输入。一旦在边缘层14或云层16确定了任意数据块的存储位置，UFR将被转换为已处理文件谱序(PFR)，其中每个条目都进一步添加了对应数据块的存储地址。Among them, the unprocessed file sequence (UFR) is generated when the terminal layer 12 divides the backup file, and the metadata in it only contains the fingerprint of the data block. UFR records the data block composition of the backup file and serves as input to the subsequent data deduplication process. Once the storage location of any data block is determined at the edge layer 14 or cloud layer 16, the UFR is converted into a processed file sequence (PFR), where each entry further adds the storage address of the corresponding data block.

PFR在备份保护中有两个角色。第一个角色是在备份文件上传的过程中，数据块可以根据PFR上记录的存储地址进行上传。为了避免重复传输已经被存储的数据块，每个条目将会额外附加给需要上传的唯一块一个上传标记(Uploading tag)。只有带有标记条目的数据块才会被上传，其他数据块则被认定为重复数据块，不需要上传。第二个角色在备份文件访问过程中，备份文件所包含的数据块根据PFR中记录的地址从边缘层14或云层16下载，并根据PFR中数据块的序列将下载的块组装成一个完整的备份文件返回给请求访问的用户端。PFR has two roles in backup protection. The first role is that during the backup file upload process, data blocks can be uploaded according to the storage address recorded on the PFR. To avoid duplication of data blocks that have already been stored, each entry will have an additional Uploading tag attached to the unique block that needs to be uploaded. Only data blocks with marked entries will be uploaded, other data blocks are considered duplicate data blocks and do not need to be uploaded. During the backup file access process, the second role downloads the data blocks contained in the backup file from the edge layer 14 or the cloud layer 16 according to the address recorded in the PFR, and assembles the downloaded blocks into a complete file according to the sequence of the data blocks in the PFR. The backup file is returned to the client requesting access.

另一个重要的元数据结构是指纹索引表(可以简称为FingIdx)，它记录从数据块指纹到数据块存储地址的映射。FingIdx的容量一般比PFR小，因为它对任意数据块只包含唯一的一个条目。一个全局的指纹索引表记录所有存储数据块的元数据信息。在数据备份文件上传过程中，当数据块在FingIdx中检测到其指纹时，该数据块被识别为重复块。在这种情况下，它们的存储位置将直接传递给UFR，而不需要进行地址的重新分配。Another important metadata structure is the fingerprint index table (which can be referred to as FingIdx for short), which records the mapping from the fingerprint of the data block to the storage address of the data block. The capacity of FingIdx is generally smaller than that of PFR because it contains only one entry for any data block. A global fingerprint index table records the metadata information of all stored data blocks. During the data backup file upload process, when a data block has its fingerprint detected in FingIdx, the data block is identified as a duplicate block. In this case, their storage locations will be passed directly to the UFR without the need for reallocation of addresses.

CoopDedup架构的终端层12：终端层12是CoopDedup架构的第一层，不同用户的终端设备会产生大量的备份文件。将终端设备的原始文件上传到远端的云层16进行数据去重是不经济的，因为这会造成大量的冗余数据在骨干网上多次传输。因此，本文将数据去重进程推进到前端的终端层12和边缘层14。终端设备的主要任务是将生成的备份文件划分为大小不等的数据块并为这些切分的数据块计算指纹，如图5所示的终端层12，也称端层。一个备份文件中计算出来的指纹将被按顺序组装到UFR中。此UFR将上传到后续的边缘层14进行进一步处理。传输UFR而不是传输所有切分的数据块，可以有效地减轻网络传输负担。Terminal layer 12 of CoopDedup architecture: Terminal layer 12 is the first layer of CoopDedup architecture. Different users' terminal devices will generate a large number of backup files. It is uneconomical to upload the original files of the terminal device to the remote cloud layer 16 for data deduplication, because this will cause a large amount of redundant data to be transmitted multiple times on the backbone network. Therefore, this article advances the data deduplication process to the terminal layer 12 and edge layer 14 of the front end. The main task of the terminal device is to divide the generated backup file into data blocks of different sizes and calculate fingerprints for these divided data blocks. The terminal layer 12 shown in Figure 5 is also called the terminal layer. Fingerprints calculated from a backup file will be assembled into UFR sequentially. This UFR will be uploaded to the subsequent edge layer 14 for further processing. Transmitting UFR instead of transmitting all segmented data blocks can effectively reduce the network transmission burden.

因此，终端层12上的创新之处在于：不会在终端设备上删除重复数据块，而是选择上传UFR以进行冗余识别，原因是，首先由于终端设备上的存储和计算资源有限，存储全局指纹索引表来记录该终端设备切分过的所有数据块的元数据信息是不切实际的。稀缺的计算资源也阻碍了基于指纹信息的冗余数据检测。其次在每个终端设备上进行孤立的数据去重可能会忽略多个终端设备之间可能存在的数据冗余，从而影响数据去重的效果。Therefore, the innovation on the terminal layer 12 is that instead of deduplicating data blocks on the terminal device, UFR is chosen to be uploaded for redundancy identification, because firstly due to the limited storage and computing resources on the terminal device, the storage It is impractical to use a global fingerprint index table to record the metadata information of all data blocks split by the terminal device. Scarce computing resources also hinder redundant data detection based on fingerprint information. Secondly, performing isolated data deduplication on each terminal device may ignore the data redundancy that may exist between multiple terminal devices, thereby affecting the effect of data deduplication.

CoopDedup架构的边缘层14：一个边缘服务器为一个区域内的多个终端设备提供服务，所有这些边缘服务器构成了CoopDedup架构中的边缘层14。作为终端层12和云层16之间的中间桥梁，本文创新性地提出了通过边缘层14承担传统去重备份方法中由终端层12和云层16执行的部分任务。将终端层12的任务推向边缘层14，可以释放终端设备的计算和存储压力。将云层16任务拉近到边缘层14，可以减少终端与远程的云端之间频繁的数据传输，从而降低网络传输开销。Edge layer 14 of the CoopDedup architecture: An edge server provides services for multiple terminal devices in an area. All these edge servers constitute the edge layer 14 of the CoopDedup architecture. As an intermediate bridge between the terminal layer 12 and the cloud layer 16, this article innovatively proposes to use the edge layer 14 to undertake some of the tasks performed by the terminal layer 12 and the cloud layer 16 in the traditional deduplication backup method. Pushing the tasks of the terminal layer 12 to the edge layer 14 can release the computing and storage pressure of the terminal device. Bringing the cloud layer 16 tasks closer to the edge layer 14 can reduce frequent data transmission between the terminal and the remote cloud, thereby reducing network transmission overhead.

需要注意的是，边缘层14的存储空间与云数据中心是无法比拟的。因此如何有效利用边缘层14有限的资源，最大限度地提高数据去重的效果以及降低传输成本是边缘层14关注的重点。本文从以下两个方面考虑解决这一问题：第一个方面是在网络边缘存储部分高访问频率的热数据块。在相同的空间资源下，存储热数据块比存储冷数据块能服务更多的端侧数据请求，从而最大限度地降低数据检索过程中的传输成本。第二个方面是可以在边缘层14维护一个部分索引表，以便对上传的元数据信息(UFR)进行预去重，检测到的冗余元数据信息将不会进一步传输到远端的云层16，以减少骨干网的数据传输量。这两种方式借助于边缘层14的存储和计算资源，辅助了备份文件的去重处理，可以有效的节省骨干网的带宽资源，缓解可能出现的网络传输拥塞。It should be noted that the storage space of edge layer 14 is incomparable with that of cloud data centers. Therefore, how to effectively utilize the limited resources of the edge layer 14, maximize the effect of data deduplication and reduce transmission costs is the focus of the edge layer 14. This article considers solving this problem from the following two aspects: The first aspect is to store some hot data blocks with high access frequency at the edge of the network. Under the same space resources, storing hot data blocks can serve more end-side data requests than storing cold data blocks, thereby minimizing the transmission cost during data retrieval. The second aspect is that a partial index table can be maintained at the edge layer 14 to pre-deduplicate the uploaded metadata information (UFR), and the detected redundant metadata information will not be further transmitted to the remote cloud layer 16 , to reduce the data transmission volume of the backbone network. These two methods use the storage and computing resources of the edge layer 14 to assist in the deduplication process of backup files, which can effectively save the bandwidth resources of the backbone network and alleviate possible network transmission congestion.

基于计数的基于草图的块选择：通过直接跟踪引用的计数来记录数据块的访问热度是不经济的，因为每个数据块都需要记录其指纹及其引用频率的计数，这会带来不可忽略的内存开销。因此，本文选择利用固定大小的紧凑草图数据结构(即Count-Min Sketch)，来估计数据块的引用计数(数据块热度)。Count-Min Sketch是一个二维数组，宽度记为r，深度记为w(r和w都是可配置参数)。对于每个到达的数据块，其指纹将被w个独立的哈希函数映射到每个r行的w个计数器，如图5的边缘层14所示。然后，各个哈希位置中的计数器将加1。数据块的引用计数即数据块热度，通过其指纹哈希到的所有计数器的最小值Min来估计。估计误差已经被证明在n*e/r内有界，概率至少为1-1/w^e，其中，n为数据块的总数，e为欧拉数。Count-based sketch-based block selection: It is uneconomical to record the access popularity of a data block by directly tracking the count of references, because each data block needs to record its fingerprint and a count of its reference frequency, which incurs non-negligible memory overhead. Therefore, this article chooses to use a fixed-size compact sketch data structure (i.e., Count-Min Sketch) to estimate the reference count (data block popularity) of the data block. Count-Min Sketch is a two-dimensional array, the width is recorded as r, and the depth is recorded as w (r and w are both configurable parameters). For each arriving data block, its fingerprint will be mapped by w independent hash functions to w counters for each r row, as shown in edge layer 14 of Figure 5. The counter in the respective hash location will then be incremented by 1. The reference count of a data block, that is, the popularity of the data block, is estimated by the minimum value Min of all counters that its fingerprint hashed to. The estimation error has been proven to be bounded within n*e/r with probability at least 1-1/w ^e , where n is the total number of data blocks and e is Euler's number.

本文可以通过一个简单的分析证明这种基于草图的数据块热度估计是可以显著节省内存的。例如，如果需要记录所有划分数据块的引用计数(n＝2¹²)，那么需要2¹²*(20B+4B)内存空间。而如果使用Count-Min Sketch，并将其参数设置为r＝2²和w＝2¹⁰，则共存在2⁽¹⁰⁺²⁾个计数器，它只占用直接跟踪引用计数器方法所需内存的1/6。当生成更多的数据块时，这种草图结构对空间节省的效果会呈指数级增长，而不会额外增加空间开销。This article can prove through a simple analysis that this sketch-based data block heat estimation can significantly save memory. For example, if you need to record the reference counts of all divided data blocks (n=2 ¹² ), then 2 ¹² * (20B + 4B) memory space is required. And if you use Count-Min Sketch and set its parameters to r=2 ² and w=2 ¹⁰ , there are 2 ⁽¹⁰⁺²⁾ counters in total, which only occupy 1/1 of the memory required by the direct tracking reference counter method. 6. The space-saving effect of this sketch structure increases exponentially as more data blocks are generated without additional space overhead.

用户感知和版本相邻的部分指纹索引：对元数据进行预去重可以减轻网络的传输负担。例如，当本文使用SHA-1编码计算数据块指纹并假设平均数据块大小为4KB，则存储800TB的文件需要上传至少4TB的指纹。然而由于边缘内存空间有限，保存全局指纹索引表(包含已处理文件的所有数据块指纹)是不切实际的。因此，关键的考虑因素在于如何选择全局指纹索引表的一个子集进行冗余检测，在减少索引表体量的同时最大化冗余检测率。根据系统观测，发现来自不同用户的备份文件之间的内容差异性较大，但是同一用户相邻备份文件之间会共享大量的数据块。基于此，本文创新性地提出了用户感知和版本相邻的部分索引表(User-aware and Version-adjacent Partial Index Table,UVPIdx)}。UVPIdx索引表基于隶属的用户信息将指纹索引表分割为几个独立的子索引表，每个子索引表记录该用户的一个相邻备份版本的元数据信息。User-perceived and version-adjacent partial fingerprint index: Pre-deduplication of metadata can reduce the transmission burden of the network. For example, when this article uses SHA-1 encoding to calculate block fingerprints and assumes the average block size is 4KB, storing a file of 800TB requires uploading at least 4TB of fingerprints. However, due to limited edge memory space, it is impractical to save a global fingerprint index table (containing fingerprints of all data blocks of processed files). Therefore, the key consideration is how to select a subset of the global fingerprint index table for redundancy detection to maximize the redundancy detection rate while reducing the size of the index table. According to system observations, it is found that the content of backup files from different users is quite different, but a large number of data blocks are shared between adjacent backup files of the same user. Based on this, this paper innovatively proposes the User-aware and Version-adjacent Partial Index Table (UVPIdx)}. The UVPIdx index table divides the fingerprint index table into several independent sub-index tables based on the affiliated user information. Each sub-index table records the metadata information of an adjacent backup version of the user.

用户感知的指纹索引加快了指纹比对过程，方便了各个用户所对应的子索引表更新操作。需要注意的是，每个UVPIdx索引表应该始终被更新到该用户的备份文件的最新版本，以避免由于备份版本相差过大而影响重复数据检测效果。版本相邻的指纹索引表在保证表体积较小的情况下可以检测到大部分重复项，冗余检测精度较高。如3所示，基于一个相邻版本的索引表，可以检测到大约66.538％/66.568％＝99.95％个重复项。当UFR到达边缘层14后，通过基于UVPIdx的指纹索引，匹配的条目直接添加数据块的存储地址。只有不匹配的部分UFR将被进一步上传到远程云，这样可以有效地节省边缘层14和云层16之间的带宽资源。The user-perceived fingerprint index speeds up the fingerprint comparison process and facilitates the update operation of the sub-index table corresponding to each user. It should be noted that each UVPIdx index table should always be updated to the latest version of the user's backup file to avoid affecting the duplicate data detection effect due to too large a difference in backup versions. The fingerprint index table with adjacent versions can detect most duplicates while ensuring that the table size is small, and the redundancy detection accuracy is high. As shown in 3, based on an adjacent version of the index table, approximately 66.538%/66.568% = 99.95% duplicates can be detected. When the UFR reaches edge layer 14, the matching entries are directly added to the storage address of the data block through the fingerprint index based on UVPIdx. Only the unmatched portion of UFR will be further uploaded to the remote cloud, which can effectively save bandwidth resources between the edge layer 14 and the cloud layer 16 .

CoopDedup架构的云层16：云层16实际上是一个大型存储集群，包括大量同构存储服务器。足够的存储和计算资源支持云层16拥有自己的数据去重结构。云层16会保留一个全局指纹索引表，它记录所有处理过的备份文件所切分的所有数据块指纹。通过比对上传的UFR和全局指纹索引表，匹配的条目所对应的数据块被认定为已经存储在云端，而不匹配的数据块被认为是未存储过的，此类数据块应该随后从终端上传到云端进行存储。云层16的全局指纹索引表对比来自所有终端的数据，可以检测出所有的重复数据块，实现彻底的冗余删除。Cloud layer 16 of CoopDedup architecture: Cloud layer 16 is actually a large storage cluster, including a large number of homogeneous storage servers. Sufficient storage and computing resources support cloud layer 16 to have its own data deduplication structure. Cloud layer 16 will maintain a global fingerprint index table that records the fingerprints of all data blocks split by all processed backup files. By comparing the uploaded UFR with the global fingerprint index table, the data blocks corresponding to the matching entries are deemed to have been stored in the cloud, while the unmatched data blocks are considered to have not been stored. Such data blocks should be subsequently retrieved from the terminal. Upload to the cloud for storage. Cloud 16's global fingerprint index table compares data from all terminals to detect all duplicate data blocks and achieve complete redundancy deletion.

在一个实施例中，边缘层14的边缘服务器还用于将存储在边缘的热数据块对应的副本上传至云层16中进行存储。In one embodiment, the edge server of the edge layer 14 is also used to upload copies corresponding to hot data blocks stored at the edge to the cloud layer 16 for storage.

可以理解，进一步的，所有数据块，包括存储在边缘的热数据块，都还需要在云中保留至少一个副本。这样可以在边缘资源不可靠的时候，依然保证数据块的可用性。此外，云存储还可以采用容错机制，例如实现副本或擦除代码，来进一步确保数据存储的可靠性。It can be understood that further, all data blocks, including hot data blocks stored at the edge, need to retain at least one copy in the cloud. This can still ensure the availability of data blocks when edge resources are unreliable. In addition, cloud storage can also adopt fault-tolerance mechanisms, such as implementing replicas or erasure codes, to further ensure the reliability of data storage.

在一个实施例中，云层16中的各云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表以分布式索引的方式进行冗余检测。In one embodiment, each cloud server in cloud layer 16 performs redundancy detection on the uploaded partial unprocessed file spectrum sequence and the global fingerprint index table maintained on the cloud in a distributed index manner.

可以理解，进一步的，在一台存储服务器上进行指纹索引是耗时且资源密集的。为了支持跨云存储服务器的分布式指纹索引，本实施例可以选择将全局指纹索引表和传入的UFR基于数据块指纹映射到不同的桶中，如图5所示的分布式索引。映射到同一桶中的条目将分配给一个云服务器。由于指纹是数据块的唯一标识，基于指纹的桶映射可以确保所有匹配的条目(UFR和全局指纹索引表)会被分配到相同的桶中，在不影响数据去重效果的前提下加速指纹索引进程。在分布式索引之后，确定未存储的新数据块，并为其分配新的云存储位置。It can be understood that, further, fingerprint indexing on a storage server is time-consuming and resource-intensive. In order to support distributed fingerprint indexing across cloud storage servers, this embodiment can choose to map the global fingerprint index table and the incoming UFR to different buckets based on data block fingerprints, as shown in the distributed index in Figure 5. Entries mapped to the same bucket will be assigned to a cloud server. Since fingerprints are unique identifiers of data blocks, fingerprint-based bucket mapping can ensure that all matching entries (UFR and global fingerprint index table) will be assigned to the same bucket, accelerating fingerprint indexing without affecting the data deduplication effect. process. After distributed indexing, new unstored data blocks are identified and assigned new cloud storage locations.

总的来说，通过端-边-云这三层的合作，CoopDedup架构减少了双向数据交换，节省了带宽资源，降低了传输成本。在端-边-云这三层之间传输两种类型的数据，第一种是用于数据去重检测的元数据，即前面介绍的UFR和PFR。第二种是需要上传的未存储的新数据块。当文件谱序被处理回终端时，这些新数据块将根据PFR中的地址信息从终端层12上传到边缘层14或云层16进行数据存储因此，在CoopDedup架构中跨三个层的数据通信模型可以总结如下文介绍的所示。In general, through the cooperation of the three layers of end-edge-cloud, the CoopDedup architecture reduces two-way data exchange, saves bandwidth resources, and reduces transmission costs. Two types of data are transmitted between the three layers of end-edge-cloud. The first is metadata for data deduplication detection, namely the UFR and PFR introduced earlier. The second is new unstored data blocks that need to be uploaded. When the file sequence is processed back to the terminal, these new data blocks will be uploaded from the terminal layer 12 to the edge layer 14 or cloud layer 16 for data storage according to the address information in the PFR. Therefore, the data communication model across three layers in the CoopDedup architecture It can be summarized as shown below.

在终端层12，原始备份文件首先被划分为数据块，生成未处理的文件谱序(UFR)，即文件中所有数据块的指纹列表。该UFR被上传并作为边缘层14的输入进行进一步处理。At the terminal layer 12, the original backup file is first divided into data blocks, generating an unprocessed file sequence (UFR), that is, a fingerprint list of all data blocks in the file. This UFR is uploaded and used as input to edge layer 14 for further processing.

在边缘层14，UFR包含的指纹信息将被哈希到Count-Min Sketch中。一旦所估计的数据块热度刚好超过设定的阈值，相应的数据块将被识别为新的热数据块。本实施例中认为这些热数据块应该存储一个副本在边缘层14并为其分配一个边缘存储的位置。这个存储地址将被记录到UFR相应的条目中并附加一个边缘上传标签。此外，UFR中的条目还将与UAPIdx中的条目进行比较。具有匹配指纹的数据块被认为是重复的，即已经存储在云层16，在这种情况下，它们的位置将直接从UAPIdx复制到UFR。这些边缘处理过的部分文件谱序(P-PFR)将在边缘等待进一步组装。而UFR的未匹配部分将进一步上传到资源充足的云层16。At edge layer 14, the fingerprint information contained in the UFR will be hashed into the Count-Min Sketch. Once the estimated data block hotness just exceeds the set threshold, the corresponding data block will be identified as a new hot data block. In this embodiment, it is considered that these hot data blocks should store a copy in the edge layer 14 and assign an edge storage location to them. This storage address will be recorded in the corresponding UFR entry and an edge upload tag will be appended. Additionally, entries in UFR are compared with entries in UAPIdx. Data blocks with matching fingerprints are considered duplicates, i.e. already stored in cloud layer 16, in which case their locations will be copied directly from UAPIdx to UFR. These edge-processed partial file sequences (P-PFRs) will wait at the edge for further assembly. The unmatched portion of UFR will be further uploaded to the resource-rich cloud layer 16.

在云层16，上传的部分UFR和全局索引表以分布式索引的方式进行冗余检测。识别出的未存储的数据块所对应的UFR条目将会附加一个云上传标签，并为该数据块分配一个新的云存储位置。当确定了所有未存储数据块的位置(存储容器和偏移)时，云处理后的文件谱序将会返回到边缘层14，并与边缘层14处理的文件谱序组装为完整的PFR。需要注意的是，该完整的PFR中的条目将更新边缘层14中的UVPIdx，这是因为边缘层14中的UVPIdx应该始终记录各个用户最新备份版本的元数据，以避免数据去重效果降级。In cloud layer 16, some of the uploaded UFR and global index tables are redundantly detected in a distributed index manner. The UFR entry corresponding to the identified unstored data block will be appended with a cloud upload tag and a new cloud storage location will be assigned to the data block. When the locations (storage containers and offsets) of all unstored data blocks are determined, the cloud-processed file sequence will be returned to the edge layer 14 and assembled with the file sequence processed by the edge layer 14 into a complete PFR. It should be noted that the entries in the complete PFR will update the UVPIdx in edge layer 14. This is because the UVPIdx in edge layer 14 should always record the metadata of the latest backup version of each user to avoid degradation of the data deduplication effect.

该完整的PFR最后被发送回终端层12，并附有数据块的位置信息和上传标签。终端负责根据PFR的存储位置和上传标签上传相应的数据块。新数据块根据云上传标签上传到云层16指定位置，而热数据块根据边缘上传标签上传到附近边缘层14指定的边缘存储的位置。备份文件的检索只有在所有数据块成功上传后才能被标记为就绪。The complete PFR is finally sent back to the terminal layer 12 with the location information of the data block and the upload tag. The terminal is responsible for uploading the corresponding data block according to the storage location and upload tag of the PFR. New data chunks are uploaded to cloud layer 16 designated locations based on cloud upload tags, while hot data chunks are uploaded to nearby edge storage locations designated by edge layer 14 based on edge upload tags. Retrieval of the backup file is marked as ready only after all data blocks have been uploaded successfully.

在一个实施例中，进一步的，在文件检索过程中，终端层12的终端设备根据处理后的文件谱序中的数据块的位置信息向边缘服务器和/或云服务器发送块请求，并将检索到的数据块按顺序组装为原始的备份文件。In one embodiment, further, during the file retrieval process, the terminal device of the terminal layer 12 sends a block request to the edge server and/or cloud server based on the location information of the data blocks in the processed file sequence, and retrieves The obtained data blocks are assembled sequentially into the original backup file.

可以理解，在文件检索过程中，终端根据PFR中的数据块的位置信息向边缘服务器或云服务器发送块请求，并将检索到的数据块按顺序组装为原始的备份文件。访问频繁的热数据块可以直接从附近边缘服务器下载，减少了骨干网的带宽占用。即使边缘服务器损坏或发生其他存储不可靠的问题，数据块请求依然可以根据云端的全局指纹索引表找到该数据块在云端的副本，实现数据块可用性和数据存储的可靠性。It can be understood that during the file retrieval process, the terminal sends a block request to the edge server or cloud server based on the location information of the data blocks in the PFR, and assembles the retrieved data blocks into the original backup file in order. Frequently accessed hot data blocks can be downloaded directly from nearby edge servers, reducing the bandwidth usage of the backbone network. Even if the edge server is damaged or other unreliable storage problems occur, the data block request can still find the copy of the data block in the cloud based on the cloud's global fingerprint index table, achieving data block availability and data storage reliability.

综上，端-边-云这三个纵向互联层之间的协作显著减少了层间的双向数据交换，同时保证了最优的数据去重性能。In summary, the collaboration between the three vertical interconnection layers of end-edge-cloud significantly reduces the two-way data exchange between layers while ensuring optimal data deduplication performance.

在一些实施方式中，为了更清楚且直观地展示本申请上述系统的效果，此处还提供了其中一些实验示例，仅作辅助说明而非是对本申请上述技术方案的唯一限定。本示例使用真实世界的数据集来经验地评估本文的CoopDedup架构的性能。实验设置：使用了一台式电脑。数据集：本示例使用现实世界的FSL数据集来评估本文的CoopDedup架构的通用性，该数据集包含13名学生的文件主目录的连续快照。每个快照对应一个备份文件。这些快照具有各种各样的典型工作负载，例如文件系统快照和虚拟机映像。数据集总大小为67.0GB，快照文件平均大小约为154.2MB。全局去重率约为48.62％，即约有一半的数据块是重复的。In some embodiments, in order to more clearly and intuitively demonstrate the effect of the above-mentioned system of the present application, some experimental examples are also provided here, which are only for auxiliary explanation and are not the sole limitation of the above-mentioned technical solutions of the present application. This example uses a real-world dataset to empirically evaluate the performance of our CoopDedup architecture. Experimental setup: A desktop computer was used. Dataset: This example evaluates the generalizability of our CoopDedup architecture using a real-world FSL dataset, which contains consecutive snapshots of the file home directories of 13 students. Each snapshot corresponds to a backup file. These snapshots come with a wide variety of typical workloads, such as file system snapshots and virtual machine images. The total dataset size is 67.0GB, and the average snapshot file size is approximately 154.2MB. The global deduplication rate is about 48.62%, that is, about half of the data blocks are duplicates.

为了更全面地说明数据去重性能，考虑了四种比较方法：CoopDedup，上文所提出的端-边-云纵向融合的去重存储架构。InftyDedup，是当前的最新去重存储架构，它跨云层和终端发送元数据信息以进行冗余检测，但是它忽略了对边缘层资源的高效利用。PostDedup，是一种被广泛采用的后处理数据去重策略。它将备份文件直接传输到远端云层进行存储和数据去重操作。CoopDedup_FIFO是CoopDedup架构的一种变体，它在边缘层的部分索引表以及存储的数据块都采用先进先出的原则进行维护。索引表和存储数据块的体量可以参考CoopDdup架构。在空间溢出时，CoopDedup_FIFO中的数据采用先进先出的原则进行更新。In order to more comprehensively illustrate the data deduplication performance, four comparison methods were considered: CoopDedup, the end-edge-cloud vertical integration deduplication storage architecture proposed above. InftyDedup is the latest deduplication storage architecture. It sends metadata information across cloud layers and terminals for redundancy detection, but it ignores the efficient use of edge layer resources. PostDedup is a widely adopted post-processing data deduplication strategy. It transfers backup files directly to the remote cloud for storage and data deduplication. CoopDedup_FIFO is a variant of the CoopDedup architecture. Some index tables and stored data blocks in the edge layer are maintained on a first-in, first-out principle. For the size of the index table and storage data blocks, please refer to the CoopDdup architecture. When space overflows, the data in CoopDedup_FIFO is updated on a first-in, first-out basis.

比较指标：在备份文件上传过程中，实验的主要比较指标是骨干网的数据传输量，由元数据传输量和数据块传输量组成。上传数据量的值越小，说明数据在传输到远端云之前已经消除了很多的冗余。这节省了骨干网上宝贵的网络资源，对提高网络性能至关重要。Comparison indicators: During the backup file upload process, the main comparison indicator of the experiment is the data transmission volume of the backbone network, which is composed of metadata transmission volume and data block transmission volume. The smaller the value of the uploaded data amount, it means that a lot of redundancy has been eliminated before the data is transmitted to the remote cloud. This saves valuable network resources on the backbone network and is crucial to improving network performance.

在备份文件检索过程中，重点关注带宽节约率(Bandwidth Saving Ratio,BSR)，它表示从附近边缘获取数据块而导致的骨干网上数据传输量减少的百分比。本示例还记录了从远端云层的数据下载量。当然，边缘层对数据去重的辅助优势是以在边缘占用额外存储空间为代价的，因此，本示例也记录了边缘层额外空间占用(Extra Space Occupation,ESO)，其包含边缘层存储数据块的体量以及所使用的CM-Sketch的大小。During the backup file retrieval process, focus on the Bandwidth Saving Ratio (BSR), which represents the percentage reduction in data transmission volume on the backbone network caused by obtaining data blocks from nearby edges. This example also records data downloads from the remote cloud. Of course, the auxiliary advantage of the edge layer in data deduplication comes at the cost of occupying additional storage space at the edge. Therefore, this example also records the extra space occupancy (ESO) of the edge layer, which contains edge layer storage data blocks. The volume and the size of CM-Sketch used.

参数设置：本示例首先将数据集中的文件解压缩并划分为数据块。每个数据块的指纹使用SHA-1编码表示。对于备份文件的访问热度，本示例采用广泛使用的Zipf分布来生成备份文件的访问热度，其中数据访问的集中度设为1。本示例所使用的CM-Sketch参数默认设置为r＝1000000和w＝10。在CoopDedup架构中，在边缘层存储的数据块体量默认设置为所有去重数据块体量的30％。CoopDedup_FIFO对比方法的边缘存储量总是与CoopDedup架构一致。Parameter settings: This example first decompresses the files in the dataset and divides them into data blocks. The fingerprint of each data block is represented using SHA-1 encoding. For the access popularity of backup files, this example uses the widely used Zipf distribution to generate the access popularity of backup files, in which the concentration of data access is set to 1. The default settings of CM-Sketch parameters used in this example are r=1000000 and w=10. In the CoopDedup architecture, the volume of data blocks stored in the edge layer is set to 30% of the volume of all deduplicated data blocks by default. The amount of edge storage for the CoopDedup_FIFO comparison method is always consistent with the CoopDedup architecture.

数值结果：本示例进行了大规模实验，分别测试了CoopDedup架构及其对比方法在备份文件上传和检索过程中的性能。备份文件上传过程中的评估性能如图6所示。其中，在图6中，InftyDedup的元数据传输量最大：上传450个文件的元数据传输量高达371.27MB。而CoopDedup架构的元数据传输量只有InftyDedup的一半左右，原因是InftyDedup通过在终端和远程云之间传输整个元数据信息来进行冗余检测，而CoopDedup架构中的边缘预去重功能可以有效减少元数据在边缘到云端的传输体量。CoopDedup_FIFO虽然也在边缘维护了一个索引表进行元数据预去重，但是由于其忽略了备份文件派生关系的关键作用，元数据的传输体量仍然很大，即上传450个文件时需要大约297.27MB的元数据传输量。Numerical results: This example conducted a large-scale experiment to test the performance of the CoopDedup architecture and its comparison methods in the backup file upload and retrieval processes. The evaluation performance during the backup file upload process is shown in Figure 6. Among them, in Figure 6, InftyDedup has the largest metadata transfer volume: the metadata transfer volume for uploading 450 files is as high as 371.27MB. The metadata transmission volume of the CoopDedup architecture is only about half that of InftyDedup. The reason is that InftyDedup performs redundancy detection by transmitting the entire metadata information between the terminal and the remote cloud, while the edge pre-deduplication function in the CoopDedup architecture can effectively reduce metadata transmission. Data transmission volume from edge to cloud. Although CoopDedup_FIFO also maintains an index table at the edge for metadata pre-deduplication, because it ignores the key role of the backup file derivation relationship, the metadata transmission volume is still very large, that is, approximately 297.27MB is required to upload 450 files. The amount of metadata transferred.

需要注意的是，传统PostDedup方法的元数据传输量保持为零。这是因为它直接将原始文件传输到云，而不需要计算出元数据信息进行冗余检测。但是如图7所示，PostDedup的数据块传输量在所有对比方法中是最高的。这是因为其他方法经由元数据信息的冗余检测后，只需要上传去重之后的数据块，数据块传输量只有PostDedup的一半左右。It is important to note that the metadata transfer amount of the traditional PostDedup method remains zero. This is because it transfers raw files directly to the cloud without the need to work out metadata information for redundancy detection. However, as shown in Figure 7, the data block transmission volume of PostDedup is the highest among all comparison methods. This is because other methods only need to upload deduplicated data blocks after redundancy detection of metadata information, and the data block transmission volume is only about half of PostDedup.

备份文件检索过程中的数据传输性能如图8所示。本示例根据文件的热度随机生成了1000个文件请求。在图8中，CoopDedup架构在数据下载量上具有绝对优势。具体来说，当60％的数据块可以存储在边缘层时，CoopDedup架构的数据下载量只有31.79GB左右。相反，由于PostDedup和InftyDedup并没有充分利用边缘层资源，总是从远程云层下载所有涉及的数据块，导致较高的数据下载量(1000个文件请求的数据下载量约为150GB)。The data transfer performance during backup file retrieval is shown in Figure 8. This example randomly generates 1,000 file requests based on the popularity of the file. In Figure 8, the CoopDedup architecture has an absolute advantage in data download volume. Specifically, when 60% of the data blocks can be stored in the edge layer, the data download volume of the CoopDedup architecture is only about 31.79GB. On the contrary, since PostDedup and InftyDedup do not fully utilize the edge layer resources, all involved data blocks are always downloaded from the remote cloud layer, resulting in high data download volume (the data download volume for 1000 file requests is about 150GB).

带宽节约率性能如图9所示。当只有10％的数据块可以存储在边缘层时，CoopDedup架构可以节省32.77％的带宽资源。这得益于CM-Sketch对较热数据块的精确选择。这使得文件检索过程中可以在不访问远程云层的情况下在边缘获取更多的数据块。这种随存储数据量增加而提高带宽节约率的趋势逐渐趋于缓慢，最后10％的数据块(90％-100％)相对应的带宽节约率仅为4.33％。这是因为最后选择的数据块通常为较冷的数据块，它们可能只服务于少数的文件访问请求。此外，CoopDeup_FIFO方法由于使用先进先出的规则维护存储的数据块，只能提供线性增长的带宽节约率。The bandwidth saving rate performance is shown in Figure 9. When only 10% of data blocks can be stored in the edge layer, the CoopDedup architecture can save 32.77% of bandwidth resources. This is due to CM-Sketch's precise selection of hot data blocks. This allows more data blocks to be fetched at the edge during file retrieval without accessing the remote cloud layer. This trend of increasing the bandwidth saving rate as the amount of stored data increases gradually becomes slow, and the bandwidth saving rate corresponding to the last 10% of data blocks (90%-100%) is only 4.33%. This is because the last selected data blocks are usually cooler data blocks, which may only serve a small number of file access requests. In addition, the CoopDeup_FIFO method can only provide linear growth in bandwidth savings due to the use of first-in, first-out rules to maintain stored data blocks.

最后，不同草图(CM-Sketch)宽度的CoopDedup架构性能结果中，边缘层存储数据块的比例固定为30％。随着草图宽度从100增加到10⁸，边缘层的额外空间占用也逐渐增加，这是因为所使用的CM-Sketch需要更多的空间来存储草图中的哈希计数器。但是，这种额外空间占用的增加趋势并不明显，因为大部分的额外空间占用由边缘存储的数据块组成。随着CM-Sketch草图宽度的增加，带宽节约率也随之增大。这证明了大型草图结构可以减少对数据块访问频次估计的偏差。当草图宽度设置为大于10⁷时，带宽节约率保持不变。这表明该宽度设置下的草图结构已经可以准确地估计出数据块的热度。Finally, in the performance results of the CoopDedup architecture with different sketch (CM-Sketch) widths, the proportion of data blocks stored in the edge layer is fixed at 30%. As the sketch width increases from 100 to 10 ⁸ , the additional space occupied by the edge layer gradually increases because the CM-Sketch used requires more space to store the hash counters in the sketch. However, this increasing trend in additional space usage is not significant because most of the additional space usage consists of data blocks stored at the edge. As the width of the CM-Sketch sketch increases, the bandwidth saving rate also increases. This demonstrates that large sketch structures can reduce bias in estimates of data block access frequency. When the sketch width is set to greater than 10 ⁷ , the bandwidth saving rate remains unchanged. This shows that the sketch structure under this width setting can accurately estimate the heat of the data block.

综上所述，CoopDedup架构充分利用了端-边-云三层资源，与目前最先进的InftyDedup相比，将上传的元数据体量减少了一半。此外，当只有10％的数据块存储在边缘时，带宽节约率可达到33％左右。To sum up, the CoopDedup architecture makes full use of the three-layer resources of end-edge-cloud and reduces the amount of uploaded metadata by half compared with the current state-of-the-art InftyDedup. Furthermore, when only 10% of data blocks are stored at the edge, bandwidth savings can reach around 33%.

在一个实施例中，如图10所示，提供一种云边端纵向融合的去重存储方法100，可以包括如下处理步骤S12至S18：In one embodiment, as shown in Figure 10, a cloud-edge-device vertical integration deduplication storage method 100 is provided, which may include the following processing steps S12 to S18:

S12，将终端层上传的未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签。S12, hash the fingerprint information contained in the unprocessed file sequence uploaded by the terminal layer into a fixed-size compact sketch data structure to estimate the data block heat, allocate edge storage locations for the estimated new hot data blocks, and then add the edge The storage address of the storage location is recorded in the unprocessed file sequence and an edge upload tag is attached.

S14，将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层；设定部分索引表为用户感知和版本相邻的部分索引表。S14: Match the entries in the unprocessed file sequence with the entries in the set partial index table, and copy the storage locations corresponding to the data blocks with matching fingerprints from the set partial index table to the unprocessed file sequence. , upload the partial unprocessed file sequence corresponding to the data block that does not match the fingerprint to the cloud layer; set the partial index table to the partial index table that the user perceives and is adjacent to the version.

S16，将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层返回处理后的文件谱序。其中，处理后的文件谱序通过云层中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后得到并返回。S16: Assemble the processed file sequence and the partial unprocessed file sequence corresponding to the data block with matching fingerprints into a complete processed file sequence, and return the processed file sequence to the terminal layer. Among them, the processed file sequence is used through the cloud server in the cloud layer to perform redundancy detection on the uploaded part of the unprocessed file sequence and the global fingerprint index table maintained in the cloud, and the unprocessed data corresponding to the unidentified unstored data blocks is checked. The entry in the file sequence is obtained and returned after appending a cloud upload tag and assigning a new cloud storage location corresponding to the unstored data block.

S18，接收终端层的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块并进行存储；处理后的文件谱序中的新云存储位置和云上传标签还用于指示终端层的终端设备上传未存储数据块至云层进行存储。S18, the terminal device at the receiving terminal layer uploads and stores the new hot data block according to the edge storage location and edge upload tag in the processed file sequence; the new cloud storage location and cloud upload tag in the processed file sequence It is also used to instruct the terminal device at the terminal layer to upload unstored data blocks to the cloud layer for storage.

可以理解，关于云边端纵向融合的去重存储方法的具体限定，可以参见上文中云边端纵向融合的去重存储系统100的相应限定，在此不再赘述。上述云边端纵向融合的去重存储方法是站在边缘层的视角上进行描述的，以便于更直观地理解本申请的技术方案。It can be understood that for specific limitations on the deduplication storage method for cloud-edge-end vertical integration, please refer to the corresponding limitations of the deduplication storage system 100 for cloud-edge-end vertical integration mentioned above, which will not be described again here. The above deduplication storage method for vertical integration of cloud and edge devices is described from the perspective of the edge layer, so as to facilitate a more intuitive understanding of the technical solution of this application.

上述云边端纵向融合的去重存储方法，通过终端层负责将文件分块，并将生成的元数据信息上传到边缘层和云层以进行冗余检测。边缘层通过存储较热的数据块来减少骨干网的带宽资源开销。此外，上传的元数据也可以在边缘层进行预去重以进一步减少数据传输量；云层维护一个全局指纹索引表，用于进行全局数据去重并通过将全局指纹索引表和传入的元数据划分到云数据中心的不同服务器，以支持跨云存储服务器的分布式并行索引，从而全局提升数据去重和存储性能。相比于传统技术，这种端-边-云纵向融合的去重存储架构有效地整合了文件在传输、存储和检索过程中不同层次的技术和存储资源，在保证最佳数据去重性能的同时，显著减少了层间的双向数据传输量，达到了大幅降低云边端架构的骨干网资源开销的效果。The above-mentioned deduplication storage method of vertical integration of cloud, edge and end is responsible for dividing files into blocks through the terminal layer, and uploading the generated metadata information to the edge layer and cloud layer for redundancy detection. The edge layer reduces the bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata can also be pre-deduplicated at the edge layer to further reduce the amount of data transmission; the cloud layer maintains a global fingerprint index table for global data deduplication and combines the global fingerprint index table with the incoming metadata. Divide it into different servers in the cloud data center to support distributed parallel indexing across cloud storage servers, thereby globally improving data deduplication and storage performance. Compared with traditional technology, this end-edge-cloud vertical integration deduplication storage architecture effectively integrates different levels of technology and storage resources in the process of file transmission, storage and retrieval, while ensuring the best data deduplication performance. At the same time, the amount of two-way data transmission between layers is significantly reduced, achieving the effect of significantly reducing the resource overhead of the backbone network of the cloud-edge architecture.

在一个实施例中，云层中的各云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表以分布式索引的方式进行冗余检测。In one embodiment, each cloud server in the cloud layer performs redundancy detection in a distributed index manner between the uploaded partial unprocessed file spectrum sequence and the global fingerprint index table maintained in the cloud.

在一个实施例中，上述云边端纵向融合的去重存储方法还可以包括如下处理步骤：In one embodiment, the above-mentioned deduplication storage method for vertical integration of cloud and edge devices may also include the following processing steps:

边缘服务器将存储在边缘的热数据块对应的副本上传至云层中进行存储。The edge server uploads a copy of the hot data block stored at the edge to the cloud for storage.

在一个实施例中，在文件检索过程中，终端层的终端设备根据处理后的文件谱序中的数据块的位置信息向边缘服务器和/或云服务器发送块请求，并将检索到的数据块按顺序组装为原始的备份文件。In one embodiment, during the file retrieval process, the terminal device at the terminal layer sends a block request to the edge server and/or the cloud server based on the location information of the data blocks in the processed file sequence, and sends the retrieved data blocks Assemble the original backup file in sequence.

应该理解的是，虽然上述流程图10中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且上述流程图10的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although each step in the above flowchart 10 is shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps of the above-mentioned flow chart 10 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily sequential, but may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of the stages.

在一个实施例中，还提供一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现如下处理步骤：将终端层上传的未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签；将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层；设定部分索引表为用户感知和版本相邻的部分索引表；将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层返回处理后的文件谱序；其中，处理后的文件谱序通过云层中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后得到并返回；接收终端层的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块并进行存储；处理后的文件谱序中的新云存储位置和云上传标签还用于指示终端层的终端设备上传未存储数据块至云层进行存储。In one embodiment, a computer device is also provided, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the following processing steps: convert the unprocessed file sequence uploaded by the terminal layer to The fingerprint information is hashed into a fixed-size compact sketch data structure to estimate the data block heat. After allocating an edge storage location for the estimated new hot data block, the storage address of the edge storage location is recorded into the unprocessed file sequence and attached. An edge upload tag; matches entries in the unprocessed file's spectrum sequence with entries in the set part index table, and copies the storage locations corresponding to data blocks with matching fingerprints from the set part index table to the unprocessed file After the spectrum is sequenced, upload the partially unprocessed file spectrum sequence corresponding to the data block that does not match the fingerprint to the cloud; set the partial index table to the partial index table that the user perceives and is adjacent to the version; compare the processed file spectrum sequence with Part of the unprocessed file sequence corresponding to the data block matching the fingerprint is assembled into a complete processed file sequence, and the processed file sequence is returned to the terminal layer; among which, the processed file sequence is passed through the cloud server in the cloud layer The partially unprocessed file sequence uploaded is checked for redundancy with the global fingerprint index table maintained in the cloud, and a cloud upload tag is attached to the entry in the unprocessed file sequence corresponding to the unidentified unstored data block and assigned The new cloud storage location corresponding to the unstored data block is obtained and returned; the terminal device receiving the terminal layer uploads the new hot data block and stores it according to the edge storage location and edge upload tag in the processed file sequence; the processed The new cloud storage location and cloud upload tag in the file sequence are also used to instruct the terminal device at the terminal layer to upload unstored data blocks to the cloud layer for storage.

可以理解，上述计算机设备除上述述及的存储器和处理器外，还包括其他本说明书未列出的软硬件组成部分，具体可以根据不同应用场景下的具体边缘服务器设备的型号确定，本说明书不再一一列出详述。It can be understood that, in addition to the above-mentioned memory and processor, the above-mentioned computer equipment also includes other software and hardware components not listed in this manual. The details can be determined according to the model of the specific edge server device in different application scenarios. This manual does not List the details one by one.

在一个实施例中，处理器执行计算机程序时还可以实现上述云边端纵向融合的去重存储方法各实施例中增加的步骤或者子步骤。In one embodiment, when the processor executes the computer program, it can also implement the steps or sub-steps added in the above embodiments of the cloud-edge-device vertical integration deduplication storage method.

在一个实施例中，还提供一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现如下处理步骤：将终端层上传的未处理的文件谱序中包含的指纹信息哈希到固定大小的紧凑草图数据结构进行数据块热度估计，为估计到的新的热数据块分配边缘存储位置后将边缘存储位置的存储地址记录到未处理的文件谱序中并附件一个边缘上传标签；将未处理的文件谱序中的条目与设定部分索引表中的条目进行匹配，将具有匹配指纹的数据块对应的存储位置从设定部分索引表复制到未处理的文件谱序后，将没有匹配指纹的数据块对应的部分未处理的文件谱序上传至云层；设定部分索引表为用户感知和版本相邻的部分索引表；将处理后的文件谱序与具有匹配指纹的数据块对应的部分未处理的文件谱序组装为完整的处理后的文件谱序，向终端层返回处理后的文件谱序；其中，处理后的文件谱序通过云层中的云服务器将上传的部分未处理的文件谱序与云端维护的全局指纹索引表进行冗余检测，对未识别出的未存储数据块对应的未处理的文件谱序中的条目附加一个云上传标签并分配对应未存储数据块的新云存储位置后得到并返回；接收终端层的终端设备根据处理后的文件谱序中的边缘存储位置和边缘上传标签上传新的热数据块并进行存储；处理后的文件谱序中的新云存储位置和云上传标签还用于指示终端层的终端设备上传未存储数据块至云层进行存储。In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored. When the computer program is executed by the processor, the following processing steps are implemented: fingerprints contained in the unprocessed file sequence uploaded by the terminal layer Hash the information into a fixed-size compact sketch data structure to estimate the data block heat. After allocating edge storage locations for the estimated new hot data blocks, record the storage addresses of the edge storage locations into the unprocessed file sequence and attach a Edge upload tag; match the entries in the unprocessed file spectrum sequence with the entries in the set part index table, and copy the storage locations corresponding to the data blocks with matching fingerprints from the set part index table to the unprocessed file spectrum. After the sequence, upload the partial unprocessed file sequence corresponding to the data block without matching fingerprint to the cloud layer; set the partial index table to the partial index table that the user perceives and is adjacent to the version; compare the processed file sequence with the matching file sequence Part of the unprocessed file sequence corresponding to the fingerprint data block is assembled into a complete processed file sequence, and the processed file sequence is returned to the terminal layer; among which, the processed file sequence is sent to the terminal layer through the cloud server in the cloud layer. The uploaded part of the unprocessed file sequence is checked for redundancy with the global fingerprint index table maintained in the cloud, and a cloud upload tag is attached to the entry in the unprocessed file sequence corresponding to the unidentified unstored data block and assigned a corresponding The new cloud storage location of the unstored data block is obtained and returned; the terminal device receiving the terminal layer uploads the new hot data block and stores it according to the edge storage location and edge upload tag in the processed file sequence; the processed file The new cloud storage location and cloud upload tag in the spectrum sequence are also used to instruct the terminal device at the terminal layer to upload unstored data blocks to the cloud layer for storage.

在一个实施例中，计算机程序被处理器执行时，还可以实现上述云边端纵向融合的去重存储方法各实施例中增加的步骤或者子步骤。In one embodiment, when the computer program is executed by the processor, the added steps or sub-steps in the above embodiments of the cloud-edge-device vertical integration deduplication storage method can also be implemented.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线式动态随机存储器(RambusDRAM，简称RDRAM)以及接口动态随机存储器(DRDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus dynamic random access memory (RambusDRAM, RDRAM for short) and interface dynamic random access memory (DRDRAM), etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可做出若干变形和改进，都属于本申请保护范围。因此本申请专利的保护范围应以所附权利要求为准。The above embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, which all fall within the protection scope of the present application. Therefore, the scope of protection of this patent application shall be determined by the appended claims.

Claims

1. A deduplication storage system with vertical integration of cloud, edge and end, characterized in that it includes a terminal layer, an edge layer and a cloud layer. After the terminal device of the terminal layer divides the original backup file into each data block, it generates each data block. The unprocessed file sequence corresponding to the data block is sequenced and uploaded to the edge layer;

The edge server in the edge layer hashes the fingerprint information contained in the unprocessed file sequence into a fixed-size compact sketch data structure to estimate the data block heat, and allocates edge storage locations for the estimated new hot data blocks. Then record the storage address of the edge storage location into the unprocessed file sequence and attach an edge upload tag;

The edge server in the edge layer matches the entries in the unprocessed file sequence with the entries in the set partial index table, and removes the storage locations corresponding to the data blocks with matching fingerprints from the set partial index table. After copying to the unprocessed file sequence, upload the part of the unprocessed file sequence corresponding to the data block that does not match the fingerprint to the cloud layer; the set partial index table is user-perceived and version-adjacent. Partial index table;

The cloud server in the cloud layer performs redundancy detection on the uploaded part of the unprocessed file spectrum sequence and the global fingerprint index table maintained by the cloud, and checks the unprocessed file spectrum sequence corresponding to the unidentified unstored data block. After attaching a cloud upload tag to the entry in the sequence and assigning a new cloud storage location corresponding to the unstored data block, return the processed file sequence to the edge layer;

The edge server of the edge layer assembles the processed file sequence and the unprocessed file sequence corresponding to the data block with matching fingerprint into a complete processed file sequence, and sends it to the terminal layer Return the processed file sequence;

The terminal device of the terminal layer uploads a new hot data block to the edge layer for storage according to the edge storage location and edge upload tag in the processed file sequence, and uploads the new hot data block to the edge layer according to the processed file sequence. The new cloud storage location and cloud upload tag upload the unstored data chunks to the cloud layer for storage.

2. The cloud-edge-end vertical integration deduplication storage system according to claim 1, characterized in that each cloud server in the cloud layer combines the uploaded part of the unprocessed file sequence with the global fingerprint maintained by the cloud. The index table performs redundancy detection in a distributed index manner.

3. The cloud-edge-end vertical integration deduplication storage system according to claim 2, characterized in that the edge server of the edge layer is also used to upload corresponding copies of hot data blocks stored at the edge to the cloud layer. stored in.

4. The cloud-edge-end vertical integration deduplication storage system according to any one of claims 1 to 3, characterized in that, during the file retrieval process, the terminal device at the terminal layer performs the deduplication according to the processed file spectrum. The block request is sent to the edge server and/or the cloud server based on the location information of the data blocks in the sequence, and the retrieved data blocks are assembled into the original backup file in order.

5. A deduplication storage method for vertical integration of cloud, edge and end, which is characterized by including the steps:

Hash the fingerprint information contained in the unprocessed file sequence uploaded by the terminal layer into a fixed-size compact sketch data structure to estimate the data block heat, and allocate edge storage locations to the estimated new hot data blocks. The storage address is recorded in the unprocessed file sequence and an edge upload tag is attached;

Match the entries in the unprocessed file sequence with the entries in the setting part index table, and copy the storage locations corresponding to the data blocks with matching fingerprints from the setting part index table to the unprocessed After the file sequence, the unprocessed file sequence corresponding to the data block without matching fingerprint is uploaded to the cloud layer; the set partial index table is the partial index table that the user perceives and is adjacent to the version;

Assemble the processed file sequence and the unprocessed file sequence corresponding to the data block with matching fingerprint into a complete processed file sequence, and return the processed file sequence to the terminal layer ; Wherein, the processed file sequence is used by the cloud server in the cloud layer to perform redundancy detection on the uploaded part of the unprocessed file sequence and the global fingerprint index table maintained by the cloud, and the unidentified and unidentified file sequences are detected for redundancy. The entry in the unprocessed file sequence corresponding to the stored data block is obtained and returned after appending a cloud upload tag and assigning a new cloud storage location corresponding to the unstored data block;

The terminal device that receives the terminal layer uploads and stores new hot data blocks according to the edge storage location and edge upload tag in the processed file sequence; the new cloud storage location in the processed file sequence and the cloud upload tag are also used to instruct the terminal device of the terminal layer to upload unstored data blocks to the cloud layer for storage.

6. The cloud-edge-end vertical integration deduplication storage method according to claim 5, characterized in that each cloud server in the cloud layer combines the uploaded part of the unprocessed file sequence with the global fingerprint maintained by the cloud. The index table performs redundancy detection in a distributed index manner.

7. The cloud-edge-end vertical integration deduplication storage method according to claim 6, characterized in that it further includes the steps:

Upload the corresponding copy of the hot data block stored at the edge to the cloud layer for storage.

8. The cloud-edge-end vertical integration deduplication storage method according to any one of claims 5 to 7, characterized in that, during the file retrieval process, the terminal device at the terminal layer performs the deduplication and storage according to the processed file spectrum. The block request is sent to the edge server and/or the cloud server based on the location information of the data blocks in the sequence, and the retrieved data blocks are assembled into the original backup file in order.

9. A computer device, including a memory and a processor. The memory stores a computer program. It is characterized in that when the processor executes the computer program, the cloud-edge-end vertical integration described in claim 5 or 7 is realized. Steps of deduplication storage method.

10. A computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the deduplication storage method for cloud-edge-end vertical integration described in claim 5 or 7 is implemented. step.