CN117971789B - Distributed storage system based on cloud computing and file backup method thereof - Google Patents
Distributed storage system based on cloud computing and file backup method thereof Download PDFInfo
- Publication number
- CN117971789B CN117971789B CN202410389633.XA CN202410389633A CN117971789B CN 117971789 B CN117971789 B CN 117971789B CN 202410389633 A CN202410389633 A CN 202410389633A CN 117971789 B CN117971789 B CN 117971789B
- Authority
- CN
- China
- Prior art keywords
- data
- backup
- version
- storage
- data item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000011084 recovery Methods 0.000 claims abstract description 46
- 238000012790 confirmation Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 29
- 238000013500 data storage Methods 0.000 claims abstract description 23
- 238000009826 distribution Methods 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims description 28
- 238000005457 optimization Methods 0.000 claims description 20
- 238000005516 engineering process Methods 0.000 claims description 12
- 238000007726 management method Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 10
- 238000012986 modification Methods 0.000 claims description 9
- 230000004048 modification Effects 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 8
- 238000013508 migration Methods 0.000 description 5
- 230000005012 migration Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013496 data integrity verification Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000010076 replication Effects 0.000 description 2
- 238000009827 uniform distribution Methods 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 238000000700 time series analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1464—Management of the backup or restore process for networked environments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1873—Versioning file systems, temporal file systems, e.g. file system supporting different historic versions of files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data processing, in particular to a distributed storage system based on cloud computing and a file backup method thereof. Comprising the following steps: acquiring a data storage request, and taking the data storage request as a data item; carrying out distribution processing on storage nodes of the data items, and obtaining update operation information of the data items by using a data consistency algorithm based on version control; then incremental data backup and duplicate removal processing are carried out, and backup operation confirmation information is obtained; meanwhile, managing and maintaining version history records of the files; further adjusting the storage position and the backup strategy of the data item, and generating a file backup operation instruction; based on the backup operation confirmation information and the data recovery request, when the distributed storage system fails or the data is lost, the data is recovered from the backup. The method and the system solve the problems of low data access efficiency, poor data integrity, low data backup and recovery efficiency and poor flexibility in the existing distributed storage system based on cloud computing.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a distributed storage system based on cloud computing and a file backup method thereof.
Background
In the current information technology age, the generation, storage, processing and backup of data have become an indispensable requirement for businesses and individuals. With the rapid development of cloud computing and big data technology, the data volume has shown an explosive growth, and how to effectively and safely manage the data becomes a challenge. Distributed storage systems based on cloud computing are popular because of their high scalability, high availability, and cost effectiveness, and become an ideal choice for handling large-scale data storage requirements. However, with the continuous expansion of the system scale and the continuous increase of the data volume, the efficiency and reliability of operations such as data storage, retrieval, backup and recovery have become a problem to be solved in system design and operation and maintenance. In a distributed storage environment, the efficiency of uniform distribution, fast access, version control, load balancing, and data backup and recovery of data directly affects the performance and stability of the system. In addition, the conventional data backup method often involves a large amount of data replication, which not only occupies a large amount of storage space, but also increases the time and cost of the backup operation. Therefore, how to reduce unnecessary data duplication and optimize the use of storage resources through incremental backup and data deduplication technology, and at the same time, ensure the integrity of data backup and recovery efficiency, become the key for improving the performance of the distributed storage system.
The invention patent of China is a data collaborative backup system and method based on cloud computing and distributed storage, and the application number is as follows: "CN202010208182.7", mainly includes: the cloud storage system comprises a multi-cloud processor, a cloud storage and a plurality of cooperative user terminals, wherein each cooperative user terminal is provided with a cooperative storage space. The invention also discloses a corresponding data collaborative backup method based on cloud computing and distributed storage, which comprises the following steps: the cloud processor acquires data to be backed up, allocates a collaborative storage space for the data to be backed up and stores the collaborative storage space; the cloud processor generates a storage record of the data to be backed up and stores the storage record into the cloud memory. According to the invention, the storage space of the user terminal itself needing to backup data is utilized to provide data backup for other users, and only the data backup record is needed to be stored in the cloud storage, so that the size of the data stored in the cloud storage is greatly reduced, and the hardware requirement of cloud backup is reduced. And because the data are stored in each user terminal in a scattered way, even if the data in a certain collaborative storage space are leaked, serious consequences can not be caused.
However, the above technology has at least the following technical problems: in the existing distributed storage system based on cloud computing, the technical problems of low data access efficiency, poor data integrity, low data backup and recovery efficiency and poor flexibility are solved.
Disclosure of Invention
The invention provides a distributed storage system based on cloud computing and a file backup method thereof, which are used for solving the technical problems of lower data access efficiency, poorer data integrity, lower data backup and recovery efficiency and poor flexibility in the existing distributed storage system based on cloud computing.
The invention discloses a distributed storage system based on cloud computing and a file backup method thereof, which concretely comprise the following technical scheme:
a cloud computing-based distributed storage system, comprising:
The system comprises a distributed hash storage module, a data consistency and version control module, a data backup optimization module, a dynamic data access scheduling module, a file version management module and a disaster recovery and data recovery module;
the distributed hash storage module takes the data storage request as a data item, performs distributed processing on storage nodes of the data item to obtain storage position information of the data item, and sends the storage position information of the data item to the data consistency and version control module and the dynamic data access scheduling module;
The data consistency and version control module is used for obtaining update operation information of the data item by using a data consistency algorithm based on version control based on storage position information of the data item; the method comprises the steps of sending update operation information of a data item to a data backup optimization module, a dynamic data access scheduling module and a file version management module;
The data backup optimization module is used for implementing incremental backup and data deduplication technology based on the update operation information of the data item to obtain backup operation confirmation information, and sending the backup operation confirmation information to the dynamic data access scheduling module and the disaster recovery and data recovery module;
The dynamic data access scheduling module is used for adjusting the storage position and the backup strategy of the data item based on the storage position information and the updating operation information of the data item, the backup operation confirmation information and the history record and the mode of the data access request to obtain data access and backup strategy adjustment operation;
the file version management module is used for managing and maintaining the version history record of the file based on the update operation information of the data item and the access request of the user to the file version;
and the disaster recovery and data recovery module is used for recovering data from the backup when the distributed storage system fails or the data is lost based on the backup operation confirmation information and the data recovery request.
A file backup method of a distributed storage system based on cloud computing comprises the following steps:
s1, acquiring a data storage request, and taking the data storage request as a data item; performing distribution processing on storage nodes of the data items by using an intelligent distribution algorithm based on hashing to obtain storage position information of the data items; further using a data consistency algorithm based on version control to obtain updating operation information of the data item;
S2, performing incremental data backup and deduplication processing based on the update operation information of the data item to obtain backup operation confirmation information; meanwhile, based on the update operation information of the data item and the access request of the user to the version of the file, the version history of the file is managed and maintained;
s3, adjusting the storage position and the backup strategy of the data item based on the storage position information, the updating operation information, the backup operation confirmation information and the historical record and mode of the data access request of the data item; generating a file backup operation instruction according to the storage position of the data item and the adjustment result of the backup strategy; based on the backup operation confirmation information and the data recovery request, recovering data from the backup when the distributed storage system fails or the data is lost;
the cloud computing-based distributed storage system is applied to the cloud computing-based distributed storage system.
Preferably, the S1 specifically includes:
in the implementation process of the intelligent distribution algorithm based on the hash, an intelligent data positioning algorithm is introduced to optimize the storage position information of the data item.
Preferably, the S2 specifically includes:
Based on the update operation information of the data item, incremental data backup and deduplication processing are performed by using an incremental data backup and deduplication algorithm, and confirmation of the backup operation information is obtained.
Preferably, in the step S2, the method further includes:
In the implementation process of managing and maintaining the version history of the file, a file version control and query algorithm is introduced.
Preferably, the S3 specifically includes:
And intelligently adjusting the storage position and the backup strategy of the data item by using a data storage and backup strategy optimization algorithm to obtain data access and backup strategy adjustment operation.
Preferably, in the step S3, the method further includes:
In the implementation process of the data storage and backup strategy optimization algorithm, firstly, the access frequency of each data item in each time period is calculated; further, based on the obtained access frequency, calculating the storage and backup priority of each data item in combination with the content importance score and the size of the data item; further, according to the calculated storage and backup priorities, dynamically adjusting the storage positions and the backup strategies of the data items; and finally, carrying out dynamic optimization processing on the storage resources.
Preferably, in the step S3, the method further includes:
in the process of dynamically optimizing the storage resources, introducing a load balance index and a storage efficiency factor to evaluate the load balance degree and the storage resource use efficiency of the current storage configuration; based on the evaluation result, dynamic reassignment of storage resources is performed.
Preferably, in the step S3, the method further includes:
based on the backup operation confirmation information and a data recovery request initiated by a user or the distributed storage system, the data recovery algorithm is used for recovering the data from the backup when the distributed storage system fails or the data is lost.
The technical scheme of the invention has the beneficial effects that:
1. The distributed storage system can automatically and uniformly distribute the data items to the storage nodes by adopting the intelligent distribution algorithm based on the hash, so that the overload risk of a single node is effectively reduced, and the efficiency of data access and the overall stability of the distributed storage system are improved; the intelligent data positioning algorithm is utilized to dynamically optimize the storage position of the data according to the size of the data item, the system capacity and the load condition, so that the speed and the efficiency of data retrieval are improved, and the adaptability of the distributed storage system to different data access modes is enhanced; the data version is effectively managed by adopting a data consistency algorithm based on version control, so that consistency and integrity of data updating are ensured; by introducing the concepts of the load factors and the node weights, the data storage can be flexibly adjusted according to the real-time system load and the storage capacity of each node, so that better load balance is realized, and the high availability of key data is ensured; the method combines the storage position information of the data item and a data consistency algorithm based on version control, can process the current data storage request, dynamically adjusts the storage of the data item according to the access and update frequency of the data item, and further improves the utilization efficiency of storage resources and the flexibility of data processing.
2. According to the invention, by calculating the timestamp difference and adjusting the increment version difference, the backup operation is only executed for the data changed since the last backup, so that unnecessary data replication is obviously reduced, and the backup efficiency is improved; by calculating and monitoring the storage efficiency index of the backup data, the saving effect brought by the duplication elimination technology is evaluated, and data support is provided for further storage optimization; the backup strategy is dynamically adjusted according to the actual change condition of the data item, so that the pertinence and the efficiency of the backup operation are improved, and the timely backup and recovery capability of key data are ensured; through intelligent incremental backup and data deduplication algorithm, data redundancy is reduced, storage utilization rate is optimized, cost is saved for enterprises, and data processing efficiency is improved.
3. According to the method, the time sequence analysis is carried out on the historical record and the mode of the data access request, and the access and update modes of the data items are finely identified, so that the storage position of the data is intelligently adjusted, the frequently accessed data is more easily accessed, and the overall data access efficiency is improved; the load balance index and the storage efficiency factor are introduced, so that the evaluation can be performed according to the current load and the use condition of the storage resources, and further the dynamic redistribution of the storage resources is performed, including the data migration and the adjustment of the capacity of the storage nodes, thereby realizing better load balance and being beneficial to saving the storage resources; the incremental data backup and the deduplication algorithm are used for backing up the data changed since the last backup, unnecessary storage space occupation is reduced through deduplication processing, and the storage efficiency maximization of the backup data is realized.
Drawings
FIG. 1 is a block diagram of a distributed storage system based on cloud computing according to an embodiment of the present invention;
fig. 2 is a flowchart of a file backup method of a distributed storage system based on cloud computing according to an embodiment of the present invention.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
Referring to fig. 1, a block diagram of a distributed storage system based on cloud computing according to an embodiment of the present invention is shown, where the method includes the following steps:
The system comprises a distributed hash storage module, a data consistency and version control module, a data backup optimization module, a dynamic data access scheduling module, a file version management module and a disaster recovery and data recovery module;
the distributed hash storage module takes a data storage request as a data item, and uses an intelligent hash-based distribution algorithm to carry out distribution processing on storage nodes of each data item, so that the data items are uniformly distributed on different storage nodes, storage position information of the data items is obtained, and the storage position information of the data items is sent to the data consistency and version control module and the dynamic data access scheduling module;
the storage location information of the data item includes a key value of the data item and the data item;
The data consistency and version control module is used for ensuring consistency and integrity of the storage position information of the data items by using a data consistency algorithm based on version control based on the storage position information of the data items output by the distributed hash storage module, and obtaining update operation information of the data items; the method comprises the steps of sending update operation information of a data item to a data backup optimization module, a dynamic data access scheduling module and a file version management module;
the update operation information of the data item includes the data item, version number, and update type (full or incremental);
the data backup optimization module is used for implementing an incremental backup and data deduplication technology based on the update operation information of the data items of the data consistency and version control module, identifying the changed data from the last backup, performing the incremental backup, reducing the storage space requirement through the data deduplication technology, obtaining backup operation confirmation information, and sending the backup operation confirmation information to the dynamic data access scheduling module and the disaster recovery and data recovery module;
The dynamic data access scheduling module is used for intelligently adjusting the storage position and the backup strategy of the data item according to the data access frequency and the importance of the backup data, dynamically optimizing the use of storage resources, ensuring the high availability and quick recovery of key data, obtaining the adjustment operation of the data access and backup strategy, and realizing the backup of the file based on the storage position information, the update operation information and the backup operation confirmation information of the data item of the distributed hash storage module, the data consistency and version control module and the data backup optimization module and the history record and mode of the data access request;
the data access and backup strategy adjustment operation comprises detailed information of data migration and backup frequency adjustment;
The file version management module is used for managing and maintaining the version history record of the file based on the update operation information of the data items of the data consistency and version control module and the access request of a user to a specific file version, allowing the user to access and recover to the specific version of the file, tracking the change history of each file, including creating, modifying and deleting operations, and distributing a unique version identifier for each version to realize the management of the file version;
And the disaster recovery and data recovery module is used for recovering data from the backup based on the backup operation confirmation information of the data backup optimization module and a data recovery request initiated by a user or the distributed storage system when the distributed storage system fails or the data is lost, so that the continuity and the integrity of the data are ensured, and the data recovery operation is realized.
Referring to fig. 2, a flowchart of a file backup method of a distributed storage system based on cloud computing according to an embodiment of the present invention is shown, including the following steps:
s1, acquiring a data storage request, and taking the data storage request as a data item; performing distribution processing on storage nodes of the data items by using an intelligent distribution algorithm based on hashing to obtain storage position information of the data items; further using a data consistency algorithm based on version control to obtain updating operation information of the data item;
Firstly, receiving a data storage request through a monitoring interface, taking the data storage request as a data item, and carrying out distributed processing on storage nodes of each data item by using an intelligent distribution algorithm based on hash to obtain storage position information of the data item, so as to realize processing of the data storage request; the specific implementation process is as follows:
first, a hash function H is applied to the key value K of each data item, and a hash value is calculated:
h=H(K)
the hash function H maps the key value K to a hash value H of a fixed size;
Introducing a load factor based on the size of the data item and the load of the storage node; the load factor takes into account the current load of the storage node and the size of the data item; the load factor of each storage node is determined by the amount of data currently stored and the maximum capacity of the storage node:
Wherein currentLoad i is the current load of storage node i, maxCapacity i is the maximum capacity of storage node i; size D is the size of the data item; l i is a load factor to ensure that data is not distributed to storage nodes that have been overloaded;
further, determining a storage node; note that there are N storage nodes in the distributed storage system, introduce node weights W i, and calculate a weighted hash value h' to determine the storage node S of the data item:
h′=(h·Lavg)mod N
Wherein L avg is the average value of all the storage node load factors, and is used for adjusting the hash value to reflect the overall load state of the distributed storage system; w i is the weight of storage node i, representing the storage capacity of the storage node; mod is a modulo operation; Representing a downward rounding;
finally, the storage node S obtained through calculation is used as storage position information of the data item, and the storage position information of the data item is sent to a data consistency and version control module to realize the processing of a data storage request;
In order to improve the data positioning efficiency, an intelligent data positioning algorithm is introduced, and the speed and the efficiency of data retrieval are improved by optimizing the calculation process of the storage position information of the data item; the specific implementation process is as follows:
Based on the weighted hash value h', the data item size D, and the storage node capacity vector c= [ C 1,c2,...,cN ], including the storage capacity of each storage node; first, the ratio R D of the data item size to the total storage capacity is calculated:
wherein TotalCapacity is the total storage capacity of the distributed storage system, i.e., the sum of all storage node capacities:
Where c i denotes the i-th storage node capacity.
The process evaluates the relationship between the size of the data item to be stored and the total storage capacity of the distributed storage system, and provides a basis for data positioning;
further, based on the weighted hash value h', a data location index DI is calculated by combining the root number operation of the product of the ratio R D of the data item size to the total storage capacity and all storage node capacities:
The above process integrates the data item size, the total storage capacity of the distributed storage system, and the weighted hash value to determine the optimal data storage location.
Finally, the final storage node F for the data item is determined by the data positioning index DI and the number of storage nodes N:
the above procedure ensures that data items will be distributed evenly and intelligently to specific storage nodes; and replacing S with the calculated final storage node F as final storage position information of the data item.
Based on the storage position information of the data item, a data consistency algorithm based on version control is used for ensuring consistency and integrity of the storage position information of the data item, and updating operation information of the data item is obtained; the specific implementation process is as follows:
Based on the storage location information F of the data items and the key value K of the data items, and a version vector v= [ V 1,v2,…,vN ] representing the current version information of all the data items in the distributed storage system; first, based on the hash value h, the version number v i of the data item is updated by introducing the time factor t:
wherein Δt is the time difference since the last update, τ is a time constant for adjusting the degree of influence of time on version update; is the updated version number of the data item;
Further carrying out consistency verification; a weighted global version consistency index GCI w is calculated based on the version contribution w i of each storage node:
Finally, generating update operation information of the data item; for data items needing to be updated, besides the storage position and the version number, the data items further comprise the weight and the updating time of the data items, and more contexts are provided for subsequent operations:
The U i is update operation information of the data items and comprises a latest storage position, a version number, a version contribution degree of each storage node and a time difference from the last update of each data item; f i represents the latest storage location of the data item;
through the process, consistency and integrity of storage position information of the data item are ensured, and update operation information of the data item is obtained.
S2, performing incremental data backup and deduplication processing based on the update operation information of the data item to obtain backup operation confirmation information; meanwhile, based on the update operation information of the data item and the access request of the user to the version of the file, the version history of the file is managed and maintained;
Based on the update operation information of the data item, performing incremental data backup and deduplication processing by using an incremental data backup and deduplication algorithm to obtain backup operation confirmation information; the specific implementation process is as follows:
Identifying for the increment; first, the timestamp difference ΔT j is calculated to determine if there has been a modification since the last backup:
ΔTj=Tj new-Tj old
Wherein T j new is the latest modification timestamp of the jth data item; t j old is the modification timestamp of the last backup of the jth data item;
The incremental version differences are then adjusted using the modification frequency ζ i to more accurately reflect the urgency of the data changes and the priority of the backups:
Wherein ζ j is the modification frequency of the jth data item, i.e. the number of times the data item is modified per unit time; Representing delta version differences; Is the latest version number; Is the version number of the last backup; Is an incremental version difference;
For the jth data item, if it is adjusted for incremental version differences If the data item is greater than 0, the data item is considered to be updated since the last backup and needs to be backed up;
De-duplication for data; calculating a deduplication hash value for each data item to be backed up by using a hash function, identifying duplicate data items by comparing the hash values, and executing a backup operation only on unrepeated data items;
finally, for each identified and unrepeated data item, a backup operation is performed and a storage efficiency indicator SE of the backup data is calculated:
wherein B is a collection of backed up data items; SE denotes the proportion of memory space saved by the deduplication technique; data j is the content of the j-th Data item; size (Data j) is the size of the jth Data item; m represents the total number of data items;
through the above processing, backup operation confirmation information is obtained: a list containing backed up data items, a backup timestamp and a storage efficiency indicator SE;
Meanwhile, based on the update operation information of the data item and the access request of the user to the specific file version, the file version control and query algorithm is used for managing and maintaining the version history record of the file, so that the management of the file version is realized; the specific implementation process is as follows:
Update operation information based on the data item: the method comprises the steps of including a key value K, a latest version number v new and updated contents of each data item; user access request to specific file version: key values including files And version number of request
Firstly, updating a version identifier; for each data item update operation, defining a version update identification function lambada u for updating the version number and the content of the file;
Λu(K,vnew,Content)=(H(K),vnew,H(Content))
Wherein H (·) is a hash function for generating hash values of file key values and contents to ensure uniqueness and security; content represents the Content of the data item; utilizing a hash function to ensure the uniqueness of file key values and contents, and updating version numbers at the same time;
updating the version history record; maintaining a version history matrix M v for each file, including hash values of all versions and corresponding version numbers; when the new version is updated, adding the new version information into M v:
Wherein, Representing matrix expansion operation, adding new version information as a row into a version history record matrix; accumulating version update records of each time through matrix expansion operation to form a complete version history;
Querying and recovering the file version; when a user requests access to a particular version of a file, version query function Θ q is used:
Based on the user request, searching the file content of a specific version from the version history record matrix to realize version control and recovery; if the corresponding version record is found, returning the content of the requested version; if not found, returning a prompt that the version does not exist;
Finally, the update operation for the version is obtained: the confirmation information comprises the latest version number of the file and the update state of the version history record matrix; for file version access requests: a hint that the file content or version of the requested version does not exist.
S3, adjusting the storage position and the backup strategy of the data item based on the storage position information, the updating operation information, the backup operation confirmation information and the historical record and mode of the data access request of the data item; generating a file backup operation instruction according to the storage position of the data item and the adjustment result of the backup strategy; based on the backup operation confirmation information and the data recovery request, recovering data from the backup when the distributed storage system fails or the data is lost;
Based on the storage position information, the updating operation information and the backup operation confirmation information of the data items and the history record and mode of the data access request, the storage position and the backup strategy of the data items are intelligently adjusted by using a data storage and backup strategy optimization algorithm, the use of storage resources is dynamically optimized, the high availability and the quick recovery of key data are ensured, the data access and backup strategy adjustment operation is obtained, and the backup of files is realized; the specific implementation process is as follows:
Applying time series analysis based on the history of the data item's storage location information, the update operation information, the backup operation confirmation information, and the data access request and the pattern to identify the data item's access and update pattern; to capture the data item access and update patterns more finely, time period d is introduced and each time period is analyzed to calculate the access frequency of each data item within each time period:
where d represents a time period number, T represents a length of each time period, P d(Kj) represents an access frequency of the data item K j within the time window T; a (K j, t) represents the number of access and update activities corresponding to time point t; alpha is an attenuation factor for weighting the most recent activity;
Classifying the data items according to P d(Kj), calculating the storage and backup priority Ω (K j) of each data item in combination with the content importance score Imp (K j) and the size (K j) of the data item:
Ω(Kj)=λ·Pd(Kj)+μ·Imp(Kj)+γ·size(Kj)-1
where λ, μ, γ are adjustment factors for balancing the effects of access frequency, content importance and data item size; Ω (K j) represents the storage and backup priorities of the data item K j;
Dynamically adjusting the storage location and the backup strategy of the data item by using the existing data migration and backup frequency adjustment strategy according to the calculated storage and backup priority omega (K j); dynamically adjusting operations such as data migration, changing backup frequency and the like to optimize the use of storage resources and improve the response speed and the data recovery capacity of the system;
Further, dynamic resource optimization processing; introducing a load balancing index LBI and a storage efficiency factor SEF, and evaluating the load balancing degree and the storage resource use efficiency of the current storage configuration;
The LBI is used for measuring the uniform distribution of storage loads, and the SEF is used for evaluating the overall use efficiency of storage resources;
Based on the evaluation results of LBI and SEF, performing dynamic reassignment of storage resources by using the existing resource scheduling and load balancing algorithm and capacity adjustment optimizing algorithm; dynamic reassignment includes data migration, adjusting capacity assignment of storage nodes, etc.; the method aims to realize better load balance and improve storage efficiency, and ensure that the distributed storage system can efficiently process key data;
Finally, according to the storage position of the data item and the adjustment result of the backup strategy, a specific operation instruction is generated to realize file backup;
Based on the backup operation confirmation information and a data recovery request initiated by a user or a distributed storage system, recovering data from the backup when the distributed storage system fails or the data is lost by using a data recovery algorithm, ensuring the continuity and the integrity of the data, and realizing the data recovery operation; the specific implementation process is as follows:
firstly, sorting and indexing all the backup operation confirmation information, and creating a backup information index table, so that quick retrieval is facilitated; the index table comprises backup records ordered according to the time stamp and version number of the backup, so that the subsequent backup version positioning can be efficiently executed;
In backup version locating, for each restore request, the backup version closest to the request version v restore or timestamp T restore is located:
Or (b)
Wherein,AndThe backup version number and timestamp of the selected closest request, respectively; Is a backup version number; is a backup timestamp;
Based on the selected backup version number or timestamp, the corresponding backup Data restore is retrieved:
The retriever is a return function, and returns the data content to be restored according to given backup information and backup version number or time stamp;
integrity verification is performed on the retrieved backup Data restore by using a Data integrity verification technology, so that the Data is ensured not to be tampered or damaged: the data integrity verification technology ensures that the restored data is not tampered or damaged by comparing the actual hash value of the restored data with the expected hash value recorded during backup, and ensures the consistency and the safety of the data;
Finally, the restored data and the integrity verification result are processed by adopting the existing result feedback technology, and the final restoration operation feedback is obtained; if the integrity verification is successful, providing the recovered data to the requestor; if the verification fails, error information and possibly resolution advice are provided to assist the user or system administrator in taking subsequent actions.
In summary, the distributed storage system based on cloud computing and the file backup method thereof are completed.
The sequence of the embodiments of the invention is merely for description and does not represent the advantages or disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.
Claims (8)
1. A cloud computing-based distributed storage system, comprising:
The system comprises a distributed hash storage module, a data consistency and version control module, a data backup optimization module, a dynamic data access scheduling module, a file version management module and a disaster recovery and data recovery module;
The distributed hash storage module is used for taking the data to be stored as data items, and carrying out distributed processing on storage nodes of each data item by using an intelligent distribution algorithm based on hash to obtain storage position information of the data item; the storage position information of the data item is sent to a data consistency and version control module and a dynamic data access scheduling module; the specific implementation process of the intelligent distribution algorithm based on the hash is as follows:
Firstly, applying a hash function to the key value of each data item to calculate a hash value h; introducing a load factor determined by the current stored data amount and the maximum capacity of the storage node based on the data item size and the storage node load;
Further, determining a storage node; n storage nodes in the distributed storage system are recorded, node weights are introduced, and a weighted hash value h' is calculated to determine a storage node S of the data item;
h′=(h·Lavg)mod N
Wherein L avg is the average of all storage node load factors; w i is the weight of storage node i; mod is a modulo operation; Representing a downward rounding;
finally, the storage node S obtained through calculation is used as storage position information of the data item, and the storage position information of the data item is sent to a data consistency and version control module;
the data consistency and version control module is used for obtaining update operation information of the data item by using a data consistency algorithm based on version control based on storage position information of the data item; the specific implementation process is as follows:
Based on the storage location information F of the data items and the key value K of the data items, and a version vector v= [ V 1,v2,…,vN ] representing the current version information of all the data items in the distributed storage system; first, based on the hash value h, the version number v i of the data item is updated by introducing the time factor t:
where Δt is the time difference since the last update, τ is the time constant; is the updated version number of the data item;
Performing consistency verification; a weighted global version consistency index GCI w is calculated based on the version contribution w i of each storage node:
finally, update operation information U i of the data item is generated:
Wherein F i represents the latest storage location of the data item; the method comprises the steps of sending update operation information of a data item to a data backup optimization module, a dynamic data access scheduling module and a file version management module;
The data backup optimization module is used for implementing incremental backup and data deduplication technology based on the update operation information of the data item to obtain backup operation confirmation information; the backup operation confirmation information is sent to the dynamic data access scheduling module and the disaster recovery and data recovery module;
The dynamic data access scheduling module is used for changing the storage position and the backup strategy of the data item based on the storage position information and the updating operation information of the data item, the backup operation confirmation information and the history record and the mode of the data access request to obtain the results of the data access and backup strategy adjustment operation;
the file version management module is used for managing and maintaining the version history record of the file based on the update operation information of the data item and the access request of the user to the file version;
and the disaster recovery and data recovery module is used for recovering data from the backup when the distributed storage system fails or the data is lost based on the backup operation confirmation information and the data recovery request.
2. The file backup method of the distributed storage system based on cloud computing, which is applied to the distributed storage system based on cloud computing as claimed in claim 1, is characterized by comprising the following steps:
s1, acquiring a data storage request, and taking the data to be stored as a data item; performing distribution processing on storage nodes of the data items by using an intelligent distribution algorithm based on hashing to obtain storage position information of the data items; further using a data consistency algorithm based on version control to obtain updating operation information of the data item;
S2, performing incremental data backup and deduplication processing based on the update operation information of the data item to obtain backup operation confirmation information; meanwhile, based on the update operation information of the data item and the access request of the user to the version of the file, the version history of the file is managed and maintained;
S3, adjusting the storage position and the backup strategy of the data item based on the storage position information, the updating operation information, the backup operation confirmation information and the historical record and mode of the data access request of the data item; generating a file backup operation instruction according to the storage position of the data item and the adjustment result of the backup strategy; based on the backup operation confirmation information and the data recovery request, when the distributed storage system fails or the data is lost, the data is recovered from the backup.
3. The method for backing up files in a distributed storage system based on cloud computing according to claim 2, wherein S1 specifically comprises:
In the implementation process of the intelligent distribution algorithm based on the hash, an intelligent data positioning algorithm is introduced to optimize the storage position information of the data item, and the specific implementation process is as follows:
firstly, calculating the proportion of the size of a data item relative to the total storage capacity;
further, based on the weighted hash value h', a data location index DI is calculated by combining the root number operation of the product of the ratio R D of the data item size to the total storage capacity and all storage node capacities:
Wherein c i represents the i-th storage node capacity; n represents the number of storage nodes;
finally, the final storage node F of the data item is determined by the data positioning index DI and the number of storage nodes N:
4. the method for backing up files in a distributed storage system based on cloud computing according to claim 2, wherein S2 specifically comprises:
Based on the update operation information of the data items, performing incremental data backup and deduplication processing by using an incremental data backup and deduplication algorithm to obtain backup operation confirmation information, wherein the backup operation confirmation information comprises a list of backed-up data items, a backup time stamp and a storage efficiency index SE; the specific implementation process is as follows:
identifying for the increment; calculating a timestamp difference deltat j:
Wherein, Is the latest modification timestamp of the jth data item; is the modified timestamp of the j-th data item when last backed up;
Adjusting the delta version difference with the modification frequency:
Where ζ j is the modification frequency of the jth data item; Representing delta version differences; Is the latest version number; Is the version number of the last backup;
For the jth data item, when the adjusted incremental version differs If the data item is greater than 0, the data item is considered to be updated since the last backup and needs to be backed up;
De-duplication for data; calculating a deduplication hash value for each data item to be backed up by using a hash function, identifying duplicate data items by comparing the hash values, and executing a backup operation only on unrepeated data items;
finally, for each identified and unrepeated data item, a backup operation is performed and a storage efficiency indicator SE of the backup data is calculated:
Wherein B is a collection of backed up data items; data j is the content of the j-th Data item; size (Data j) is the size of the jth Data item; m represents the total number of data items.
5. The method for backing up files in a distributed storage system based on cloud computing as recited in claim 2, further comprising, at S2:
In the implementation process of managing and maintaining the version history record of the file, introducing a file version control and query algorithm; the specific implementation process is as follows:
Update operation information based on the data item: the method comprises the steps of including a key value K, a latest version number v new and updated contents of each data item; user access request to specific file version: key values including files And version number of request
Firstly, updating a version identifier; for each data item update operation, a version update identification function Λ u is defined:
Λu(K,vnew,Content)=(H(K),vnew,H(Content))
wherein H (·) is a hash function for generating a hash value of the file key and the content; content represents the Content of the data item;
updating the version history record; maintaining a version history matrix M v for each file, including hash values of all versions and corresponding version numbers; when the new version is updated, adding the new version information into M v:
Wherein, Representing matrix expansion operation, adding new version information as a row into a version history record matrix;
Querying and recovering the file version; when a user requests access to a particular version of a file, version query function Θ q is used:
searching file content of a specific version from the version history record matrix based on a user request, and performing version control and recovery; when the corresponding version record is found, returning the content of the requested version; returning a prompt that the version does not exist when the corresponding version record is not found;
Finally, get the update operation for the version: the confirmation information comprises the latest version number of the file and the update state of the version history record matrix; for file version access requests: a hint that the file content or version of the requested version does not exist.
6. The method for backing up files in a distributed storage system based on cloud computing according to claim 2, wherein S3 specifically comprises:
Intelligently changing the storage position and the backup strategy of the data item by using a data storage and backup strategy optimization algorithm to obtain the result of data access and backup strategy adjustment operation; in the implementation process of the data storage and backup strategy optimization algorithm, firstly, the access frequency of each data item in each time period is calculated; further, based on the obtained access frequency, calculating the storage and backup priority of each data item in combination with the content importance score and the size of the data item; further, according to the calculated storage and backup priorities, dynamically adjusting the storage positions and the backup strategies of the data items; and finally, carrying out dynamic optimization processing on the storage resources.
7. The method for backing up files in a cloud computing-based distributed storage system of claim 6, further comprising, at S3:
in the process of dynamically optimizing the storage resources, introducing a load balance index and a storage efficiency factor to evaluate the load balance degree and the storage resource use efficiency of the current storage configuration; based on the evaluation result, dynamic reassignment of storage resources is performed.
8. The method for backing up files in a distributed storage system based on cloud computing as recited in claim 2, further comprising, at S3:
Based on the backup operation confirmation information and a data recovery request initiated by a user or the distributed storage system, recovering data from the backup when the distributed storage system fails or the data is lost by using a data recovery algorithm; the specific implementation process is as follows:
sorting and indexing all the backup operation confirmation information, and creating a backup information index table; the index table comprises backup records ordered according to the time stamp and version number of the backup;
In backup version locating, for each restore request, the backup version closest to the request version v restore or timestamp T restore is located:
Or (b)
Wherein,AndThe backup version number and timestamp of the selected closest request, respectively; Is a backup version number; is a backup timestamp;
Based on the selected backup version number or timestamp, the corresponding backup Data restore is retrieved:
The retriever is a return function, and returns the data content to be restored according to given backup information and backup version numbers or time stamps;
Carrying out integrity verification on the retrieved backup Data restore by comparing the actual hash value of the recovered Data with the expected hash value recorded during backup;
Finally, processing the recovered data and the integrity verification result to obtain final recovery operation feedback; when the integrity verification is successful, providing the recovered data to the requester; when verification fails, error information and resolution advice are provided.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410389633.XA CN117971789B (en) | 2024-04-02 | 2024-04-02 | Distributed storage system based on cloud computing and file backup method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410389633.XA CN117971789B (en) | 2024-04-02 | 2024-04-02 | Distributed storage system based on cloud computing and file backup method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117971789A CN117971789A (en) | 2024-05-03 |
CN117971789B true CN117971789B (en) | 2024-07-05 |
Family
ID=90849953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410389633.XA Active CN117971789B (en) | 2024-04-02 | 2024-04-02 | Distributed storage system based on cloud computing and file backup method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117971789B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118590503B (en) * | 2024-08-05 | 2024-10-18 | 浙江云针信息科技有限公司 | Multi-source data synchronous processing system and method |
CN118760551B (en) * | 2024-09-06 | 2024-11-05 | 深圳市欣茂鑫实业有限公司 | Data storage management method and system for digital twin production line |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117369971A (en) * | 2023-10-16 | 2024-01-09 | 南方电网数字企业科技(广东)有限公司 | Innovative business platform service data processing system based on cloud computing |
CN117454414A (en) * | 2023-09-26 | 2024-01-26 | 上海应用技术大学 | Dynamic searchable encryption method and system based on distributed storage |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8977598B2 (en) * | 2012-12-21 | 2015-03-10 | Zetta Inc. | Systems and methods for on-line backup and disaster recovery with local copy |
US20170262345A1 (en) * | 2016-03-12 | 2017-09-14 | Jenlong Wang | Backup, Archive and Disaster Recovery Solution with Distributed Storage over Multiple Clouds |
CN117608527A (en) * | 2023-10-30 | 2024-02-27 | 中冶焦耐(大连)工程技术有限公司 | File management system |
-
2024
- 2024-04-02 CN CN202410389633.XA patent/CN117971789B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117454414A (en) * | 2023-09-26 | 2024-01-26 | 上海应用技术大学 | Dynamic searchable encryption method and system based on distributed storage |
CN117369971A (en) * | 2023-10-16 | 2024-01-09 | 南方电网数字企业科技(广东)有限公司 | Innovative business platform service data processing system based on cloud computing |
Also Published As
Publication number | Publication date |
---|---|
CN117971789A (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117971789B (en) | Distributed storage system based on cloud computing and file backup method thereof | |
US11327799B2 (en) | Dynamic allocation of worker nodes for distributed replication | |
US10761765B2 (en) | Distributed object replication architecture | |
US9740706B2 (en) | Management of intermediate data spills during the shuffle phase of a map-reduce job | |
Lakshman et al. | Cassandra: a decentralized structured storage system | |
US8510538B1 (en) | System and method for limiting the impact of stragglers in large-scale parallel data processing | |
EP3062227B1 (en) | Scalable grid deduplication | |
US20200348863A1 (en) | Snapshot reservations in a distributed storage system | |
US10817380B2 (en) | Implementing affinity and anti-affinity constraints in a bundled application | |
CN112286903B (en) | Containerization-based relational database optimization method and device | |
CN111930716A (en) | Database capacity expansion method, device and system | |
US20240214442A1 (en) | Data stream processing system and methods for use therewith | |
US11134121B2 (en) | Method and system for recovering data in distributed computing system | |
CN113688115A (en) | File big data distributed storage system based on Hadoop | |
CN113515518A (en) | Data storage method and device, computer equipment and storage medium | |
CN113791935A (en) | Data backup method, network node and system | |
ELomari et al. | New data placement strategy in the HADOOP framework | |
CN111897636A (en) | Scheduling method, device and storage medium based on data calculation and analysis | |
Lee et al. | Benchmarking large-scale object storage servers | |
US20230376451A1 (en) | Client support of multiple fingerprint formats for data file segments | |
CN118035135B (en) | Cache replacement method and storage medium | |
US20240004893A1 (en) | Method For Copying Spanner Databases From Production To Test Environments | |
JPH11232153A (en) | Data base system | |
Zhang et al. | An Adaptive RPC Mechanism for Performance and Node Fault Tolerance Optimization in HDFS | |
CN118732926A (en) | Data storage method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |