CN104331478B

CN104331478B - Data consistency management method for self-compaction storage system

Info

Publication number: CN104331478B
Application number: CN201410614846.4A
Authority: CN
Inventors: 马春
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-11-05
Filing date: 2014-11-05
Publication date: 2017-09-22
Anticipated expiration: 2034-11-05
Also published as: CN104331478A

Abstract

The invention provides a data consistency management method for a self-compaction storage system, belongs to the technical field of automatic compaction configuration, and designs a metadata structure for data block management and an implementation scheme of the metadata storage structure. For a metadata structure for managing data blocks, an improved B + Tree structure is designed, and the self-simplification management of the data blocks is realized by matching metadata such as super blocks, metadata bitmaps and data bitmaps. On the basis of the original B + Tree data structure, the space of each non-leaf node is expanded by one time and is divided into an active domain and an inactive domain, so that the space of an external magnetic disk is basically not divided in the B + Tree modifying process, the metadata modifying operation complexity is reduced, meanwhile, the allocated extra inactive domain space can also be used as a metadata copy or used for historical operation records, and the storage system copy maintenance or log maintenance cost is reduced.

Description

It is a kind of to simplify memory system data consistency management method certainly

Technical field

It is specifically that one kind simplifies memory system data coherency management certainly the present invention relates to thin provisioning field Method.

Background technology

The data volume that current internet is produced is in explosive growth, and higher want is proposed to the capacity and performance of storage system Ask.Be present the problem of disk storage utilization rate is relatively low, storage resource is wasted in existing storage system, therefore occurred in that in recent years automatic Simplify configuration technology.

Thin provisioning utilizes " being distributed when writing " strategy, is distributed according to need by the resource for changing storage system, energy Enough improve disk storage space utilization rate, improve performance of storage system while reach reduction storage system lower deployment cost and The purpose economized on resources." being distributed when writing " is exactly, to when simplifying logical volume certainly and writing data, just to be distributed from storage pool is simplified certainly Memory space.Storage pool memory space will be simplified certainly now in fact and be divided into equal-sized data block, and pass through B+Tree etc. Form is organized：Distribution, recovery, lookup including data block etc. are operated.It is divided into data field and member from storage pool is simplified Data field, data field is used for data storage, and meta-data region includes storage pool superblock, metadata bitmap, data bitmap, logic Volume information etc. is organizer and governor to simplifying storage pool certainly, extremely important, once metadata goes out active, the mistake such as inconsistent By mistake, user data will be lost even more so that whole storage system collapse.Meanwhile, when storage system is normally run, metadata is Be stored in internal memory and timing write with a brush dipped in Chinese ink on disk, and when system occurs abnormal, such as controller failure, controller power down, The hardware error such as RAID failures and RAID power down, is likely to result in metadata and writes with a brush dipped in Chinese ink failure, cause metadata error.Therefore how to protect The uniformity of card metadata is the emphasis of storage system and automatic reduction techniques.

In the realization of automatic reduction techniques, B+Tree data structures multi-purpose greatly carry out management data block.In order to ensure first number According to B+Tree uniformity, a kind of management method that can be taken is：When carrying out B+Tree modification operation, one is additionally created Individual and another B+Tree of B+Tree identicals, and operated on the B+Tree additionally created, when whole operation is completed Afterwards, the pointer for pointing to original B+Tree root nodes points to the B+Tree newly created root node, and by original B+Tree sky Between discharge, reach modification metadata purpose.The advantage of this implementation method is the uniformity that ensure that metadata B+Tree, Either phase during whole modification, is stored in metadata all being consistent property on disk, can preferably prevent because of control The metadata caused by hardware error such as device failure processed are inconsistent.It is exactly each modification member and the shortcoming of the method is also more obvious Data B+Tree needs to rebuild the B+Tree of an equal size, is needed in process of reconstruction for each node distribution in B+Tree Space；Simultaneously in order to ensure metadata availability, in other places with the same RAID of metadata B+Tree, portion is also stored Copy.Therefore expense is all larger over time and space for this method.

The management method of another metadata is the special storage member as metadata space using single RAID in storage system Data, this causes the RAID to be easily referred to as " focus " and single failure point of data access in system.A kind of solution be by Metadata is scattered to be stored in several RAID, if but in system controller failure metadata access can still be affected greatly.

The content of the invention

The present invention provides one kind and simplifies memory system data consistency management method certainly, you can solve in above two method Shortcoming, it is ensured that metadata consistency, the Time ＆ Space Complexity of metadata operation can be reduced again.

The present invention devises the metadata structure of block management data and the implementation of metadata storage organization.For data The metadata structure of block management, devises a kind of B+Tree improved structure, while coordinating superblock, metadata bitmap and data What the metadata such as bitmap realized data block simplifies management certainly.On the basis of original B+Tree data structures, by each nonleaf node One times of space enlargement, and be divided into active scope and inactive domain two parts so that during modification B+Tree substantially not Distribute extra disk space, reduction metadata modification operation complexity, while the extra inactive domain space of distribution is alternatively arranged as Metadata copy is recorded for historical operation, is reduced storage system copy and is safeguarded or daily record maintenance costs.

It is a kind of to simplify memory system data consistency management method certainly, it is characterized in that：

S1：In the B+Tree of metadata organization, increase the space of each non-leaf nodes of B+Tree.In original B+ On the basis of Tree data structures, by one times of the space enlargement of each nonleaf node, and active scope and two, inactive domain are divided into Point, the data of storage mapping B+Tree nodes wherein in active scope, i.e.,（Key, value）Key-value pair；Rather than basis in active scope Different Strategies can store the copy of activity numeric field data, can also store the data before the last node modification.Modification to node Carried out in the inactive domain of node, after the completion of the modification of node, active scope and inactive domain are swapped.Each nonleaf node exists Initial address is alignd with node size during distribution, if such as node size is 8KB, and wherein active scope and inactive domain respectively accounts for 4KB, Then node initial address is alignd with 8KB.So allow for during repairing metadata not allocation external memory storage space, reduction member Data modification operation complexity.

Modification to metadata is related to three kinds of operations：Increase data block mapping, delete data block mapping and modification data block Mapping.Each operating process is as follows：

A, increase data block mapping

1st, father node N of the newly-increased data block in Mapping B+Tree is searched；

2nd, N activity numeric field data is replicated to inactive domain；

3rd, N inactive domain, increase key and index point are changed, pointer is pointed into newly-increased node, node will be increased newly and inserted Enter to N；

4th, judge whether N is needed into line splitting.If need not if turn to step 7；If N needs division, step 5 is turned to；

5th, divide N, obtain node N ' and node N ' ', now origin node N father node subsequently points to N ' in N divisions；

5.1. metadata bitmap B+Tree is searched, the meta data block of free time is found；

5.2. distribute new node N ' ' and initialize, more new metadata bitmap B+Tree；

5.3. division each self-contained metadata information of posterior nodal point N ' and N ' ' is calculated, i.e.,（Key, value）The model of key-value pair Enclose；

5.4. the data part for treating split vertexes N active scopes is copied into inactive domain, another portion according to result of calculation Divide the active scope for copying to new distribution node N ' ', now node N is called N '；

6th, step 2 is gone to, node N ' ' is inserted to the father node M for being split off node；

The 7th, the pointer of the father node of each node changed is pointed to the inactive domain for the node changed；

8th, update the data and increase the corresponding position of data block in bitmap B+Tree newly, be set to and used；

9th, the other metadata such as superblock of more new metadata, changes the big of the logical equipment objects such as storage pool, logical volume It is small；

10th, operation is completed.

B, deletion data block mapping

1st, father node N of the node to be deleted in Mapping B+Tree is searched；

2nd, N activity numeric field data is replicated to inactive domain；

3rd, N inactive domain is changed, key and index point is deleted；

4th, judge whether N needs to merge with other nodes.If need not if turn to step 7；If desired merge, then turn to step Rapid 5；

5th, the node N ' with merging is found, and carries out node union operation.Now N father node subsequently points to close in its merging And after new node M；

5.1. node N and N ' to be combined is calculated, it is determined that merging the metadata information that posterior nodal point is included；

5.2. according to result of calculation by the inactive domain of node N and N ' data duplication to node N ', now node N is called Node M；

6th, step 2 is gone to, the node N being merged is deleted；

The 7th, the pointer of the father node of each node changed is pointed to the inactive region for the node changed；

8th, the space of deleted node is discharged；

9th, update the data and the corresponding position of data block is deleted in bitmap B+Tree, be set to unused；

10th, the other metadata such as superblock of more new metadata, changes the big of the logical equipment objects such as storage pool, logical volume It is small；

11st, operation is completed.

C, modification data block mapping

1st, Mapping B+Tree is searched, father's section after the father node N and modification mapping relations belonging to data block to be modified is determined Point N '；

2nd, data block mapping is deleted under node N；

3rd, data block is inserted under node N '；

4th, operation is completed.

S2：What metadata was hashed is stored in each bottom memory cell of storage pool.

Metadata is distributed on the RAID of each in storage pool, organized by modes such as B+Tree；Preferably lifting Metadata access performance, reduces the risk that hardware anomalies bring metadata to lose again.

Because metadata is stored in storage pool each RAID, dilatation and capacity reducing of the storage pool in units of RAID need by One times of metadata space enlargement in each RAID, is equally divided into activity space and inactive space.So dilatation is carried out in system During with capacity reducing, the inactive space of metadata is only changed, the normal access without influenceing activity space.When metadata is inactive After space allocation is finished, the metadata activity space in each RAID and inactive space are exchanged, new metadata is enabled, completed The dilatation of storage system and capacity reducing, the data in last synchronous movement space and inactive space, and set up the metadata across RAID Copy.

The beneficial effects of the invention are as follows：1）Be conducive to the disk of metadata to write with a brush dipped in Chinese ink, data are managed compared to existing B+Tree The mode of block, reduces the hash degree of metadata spatial distribution in disk；2）Reduce metadata modification time space distribution Expense, in addition to the distribution that progress new node is needed when node split, remaining operation does not need additional allocation space；3）Using Flexibly, the inactive domain of Mapping B+Tree nonleaf nodes can be used as the copy of Mapping B+Tree metadata to mode, also can conduct The record of historical operation, to support to operate rollback.When the inactive domain of node is used as copy, tied to mapping B+Tree operations Each Activity On the Node numeric field data, is synchronized in inactive domain by Shu Hou first.Inactive domain node pointer is rebuild afterwards, by each node Pointer point to child nodes inactive domain.Now the inactive domain of each node forms an independent Mapping B+Tree pair This, if in addition to root node any node active scope corrupted data, only modification point to root node pointer can be switched fast To inactive domain copy, normal access map B+Tree metadata.When the inactive domain of node is used as copy, due to copy being divided Dissipate and be saved together with script, reduce the time for safeguarding the extra disk access that copy consistency is brought and space expense；Section When the inactive domain of point is used as operation historical record, its last time is saved as data during active scope, system journal is reduced The data volume of record is needed, the time of log recording and space expense is being reduced simultaneously, is decreasing the data weight of rolling back action Build complexity.4）Protect metadata consistency.Due to before to mapping B+Tree operations, the activity numeric field data of each node can be answered Make inactive domain, and operated in inactive domain, so even occur in operation controller failure or Situations such as RAID power down, exist the data in Activity On the Node domain also being consistent property, it is not complete in the simply inactive domain influenceed Into the operation of modification.Meanwhile, even if single RAID data is lost, also data can be carried out by the intersection copy stored in other RAID Reconstruct.5）Metadata access performance is lifted, due to metadata is distributed in all RAID of system, many RAID are taken full advantage of simultaneously The performance of hair, improves the IOPS of metadata access, solves the problems, such as metadata single-point performance bottleneck, while supporting storage system to exist Line dilatation.Realize the seamless switching of new and old metadata after System Expansion capacity reducing.

This method compensate for it is existing from simplifying in storage system to ensure the metadata complex operations that use safely, reduction Metadata increase and application and release disk space are caused repeatedly during deleting overhead and memory space fragmentation. The access performance of metadata is reduced on the premise of it ensure that metadata consistency.Deposited simultaneously using the metadata of super distributed Method for storing it also avoid the Single Point of Faliure problem of metadata access.

Brief description of the drawings

Fig. 1 is metadata structure schematic diagram.

Fig. 2 is Mapping B+Tree structural representations.

Fig. 3 is insertion node step 1.

Fig. 4 is insertion node step 2.

Fig. 5 is insertion node step 3.

Fig. 6 is split vertexes step 1.

Fig. 7 is split vertexes step 2.

Fig. 8 is split vertexes step 3.

Fig. 9 is split vertexes step 4.

Figure 10 is split vertexes step 5.

Figure 11 is metadata storage organization schematic diagram.

Embodiment

With reference to the accompanying drawings, by taking the insertion node of Mapping B+Tree in the present invention and split vertexes operation as an example, emphasis is said Mapping B+Tree operations in the bright present invention in increase, deletion and modification data block mapping.Deletion of node and merge node operation It is the inverse process for inserting node and split vertexes operation respectively, will not be repeated here.Illustrate metadata in each RAID simultaneously Operation during storage organization and storage system dilatation, capacity reducing to metadata.

Accompanying drawing 1 is Mapping B+Tree data structure schematic diagram, wherein each nonleaf node includes active scope and non-live Dynamic domain, both are equal in magnitude, and address space is adjacent, and nonleaf node initial address is alignd with node size.Leaf node is directed to The pointer of data block.Active scope is that the pointer for pointing to present node by father node is determined with inactive domain in nonleaf node. The adjacent two spaces in address are called A domains and B domains in node, and the initial address in wherein A domains is the initial address of node.Due to non- The address of leaf node is alignd with node size, if the address that the pointer that present node is then pointed in father node is stored is risen for A domains Beginning address, while also be present node initial address, then A domains be active scope, B domains be inactive domain；If otherwise father node middle finger The address stored to the pointer of present node is B domains initial address, and now the address can not be alignd with node size, then A domains For inactive domain, B domains are active scope.

The operation of accompanying drawing 3 to 5 pairs of mapping B+Tree insertion nodes of accompanying drawing has carried out process description；

1）As shown in Figure 3, Mapping B+Tree is first looked for, it is determined that the father node of newly-increased leaf node, by its Activity On the Node Data duplication is to inactive domain in domain；

2）As shown in Figure 4, the inactive domain of modification node, adds new leaf node index, changes key assignments；

3）As shown in Figure 5, the pointer of present node is pointed in the father node of modification present node, its sensing is worked as prosthomere The inactive domain of point, completes the conversion in active scope and inactive domain, enables new node metadata, and insertion nodal operation is completed.

Some nonleaf node is after new node is inserted in Mapping B+Tree, and the index value in node may be beyond mapping Node limitation in B+Tree data structures is, it is necessary to carry out node split two new nodes of formation, and each new node storage is former to be saved The data of a point part.The operation of accompanying drawing 6 to 10 pairs of mapping B+Tree split vertexes of accompanying drawing has carried out process description；

1）As shown in Figure 6, a certain nonleaf node index value after node is inserted reaches maximum in Mapping B+Tree, needs Enter line splitting；

2）As shown in Figure 7, the activity numeric field data of present node is replicated to inactive domain；

3）As shown in Figure 8, a new nonleaf node and initialization are distributed, by one in the inactive domain of present node Divided data is moved in the active scope of newly assigned node；

4）As shown in Figure 9, it is inserted into newly assigned node as a new node in Mapping B+Tree；

5）As shown in Figure 10, modification relates to the index point of node, and sensing has the node of new metadata Inactive domain, completes the conversion in each Activity On the Node domain and inactive domain, enables new node metadata, split vertexes have been operated Into.

Accompanying drawing 11 is metadata actual storage structural representation.The superblock of metadata is deposited in each RAID in systems A copy, deposits data block Mapping B+Tree root nodes, metadata bitmap B+Tree root nodes, data bitmap in superblock B+Tree root nodes, and other metadata such as equipment UUID, device name, device object index, device attribute letter in system Breath etc..Data block Mapping B+Tree, metadata bitmap B+Tree and data bitmap B+Tree are not to be stored in each RAID A identical data, but B+Tree data block will be constituted according to the scattered storage of certain load balancing to all RAID In.The inactive space of metadata in each RAID stores the copy of metadata in current RAID metadata activity space, meanwhile, According to diversification strategies by data block Mapping B+Tree, metadata bitmap B+Tree in RAID and data bitmap B+Tree at other Two parts of copies are stored in RAID, two parts of copies are not in same RAID, it is ensured that the metadata two in systems of scattered storage RAID can also keep integrality when failing.

Storage system carries out as follows to metadata processing procedure during dilatation：

1st, newly-increased RAID is initialized；

2nd, meta-data distribution after dilatation is calculated according to load balancing；

3rd, metadata activity space and inactive space in synchronous each RAID so that the metadata stored in both is consistent；

4th, according to step 2 result of calculation, metadata is replicated to newly-increased RAID metadata activity space；

5th, according to step 2 result of calculation, the metadata for changing the inactive space of metadata in each RAID is state after dilatation；

6th, superblock, metadata bitmap B+Tree and data bitmap B+Tree in each RAID are updated；

7th, enabling the original each RAID inactive space of metadata turns into activity space；

8th, synchronous each RAID metadata activity space and inactive space, and re-establish the metadata pair across RAID This；

9th, operation is completed.

The capacity reducing operation of storage system is similar with dilatation operating process, will not be repeated here.

It is described above, it is only the preferable embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims

1. a kind of simplify memory system data consistency management method certainly, it is characterised in that devises the metadata of block management data The implementation of structure and metadata storage organization, wherein

Metadata structure includes：

（1）The improved structure of B+Tree data structures；

（2）The mapping of increase data block, the mapping of deletion data block and the modification data realized using B+Tree improvement data structure Block map operation；

（3）The active scope of nonleaf node and the decision procedure in inactive domain in B+Tree improvement data structure；

Metadata is distributed in storage system in all RAID according to different allocation strategies, passes through the mode tissue such as B+Tree Management, while metadata does intersection backup in different RAID；

Metadata storage organization includes：（1）Storage and backup of the metadata across all RAID；（2）Using metadata storage organization Storage system dilatation capacity reducing operation；

In the B+Tree of metadata organization, increase the space of each non-leaf nodes of B+Tree in original B+Tree data knot On the basis of structure, by one times of the space enlargement of each nonleaf node, and active scope and inactive domain two parts are divided into, wherein activity The data of storage mapping B+Tree nodes in domain, i.e. (key, value) key-value pair；Rather than can according to Different Strategies in active scope The copy of storage activity numeric field data, can also store the data before the modification of the last node to the modification of node node non-live Dynamic domain is carried out, after the completion of the modification of node, and active scope and inactive domain swap each nonleaf node starting point in distribution Alignd with node size location；

Each nonleaf node includes active scope and inactive domain, and both are equal in magnitude, and address space is adjacent, nonleaf node starting Alignd with node size address；Leaf node is directed to the pointer of data block；Active scope and inactive domain are logical in nonleaf node Cross the pointer decision of father node sensing present node；The adjacent two spaces in address are called A domains and B domains, wherein A domains in node Initial address be node initial address；Because the address of nonleaf node is alignd with node size, if then being pointed in father node The address that the pointer of present node is stored is A domains initial address, while being also present node initial address, then A domains are activity Domain, B domains are inactive domain；Otherwise if the address that the pointer that present node is pointed in father node is stored is B domains initial address, this When the address can not be alignd with node size, then A domains be inactive domain, B domains be active scope.

2. according to the method described in claim 1, it is characterised in that the improved structure design of the B+Tree data structures, is former There is nonleaf node in B+Tree data structures to distribute exceptional space so that the nonleaf node space size after improvement is original two Times, and node initial address alignd with node size；Node space is divided into adjacent active scope and inactive domain, active scope For the operation of normal metadata query；Inactive domain is used to store active scope copy or a preceding operation historical record.

3. according to the method described in claim 1, it is characterised in that the improved structure of the application B+Tree data structures is realized Block management data operation, will be movable in the node of B+Tree improved structures when carrying out from the map operation for simplifying data block Data duplication is to inactive domain in domain, and the operation of data is carried out in the inactive domain of node in all modifications node, when in node After data modification is finished, the pointer for each node that there is data modification is pointed in modification, it is pointed to original inactive of each node Domain, active scope is changed into by the inactive domain of each node for having amended data, and each Activity On the Node domain originally is changed into inactive Domain.

4. according to the method described in claim 1, it is characterised in that the metadata, will be each across all RAID storage and backup The metadata space that metadata is stored in RAID is divided into that two sizes are identical, the adjacent metadata activity space in address and first number According to inactive space；The less data structure of data volume, i.e. superblock each RAID metadata within the storage system in metadata A identical copies are deposited in activity space；And for the larger data structure of data volume, i.e. metadata Mapping B+Tree, first number According to bitmap B+Tree and data bitmap B+Tree, it is distributed to according to load balancing in each RAID metadata activity space, Each RAID deposits a part for metadata, and the copy of metadata activity space is deposited in the inactive space of the metadata in RAID； Meanwhile, the copy of metadata in other two RAID is deposited according to certain strategy in each RAID.

5. according to the method described in claim 1, it is characterised in that the storage system dilatation of the application metadata storage organization With capacity reducing operation, when dilatation and capacity reducing are operated, metadata activity space synchronous first and the inactive space of metadata, Zhi Hougen The inactive space of metadata in each RAID is changed according to metadata diversification strategies, each RAID inactive space of metadata is finally changed For activity space, enable new metadata and complete the operation of dilatation capacity reducing.