CN113641681B

CN113641681B - Space self-adaptive mass data query method

Info

Publication number: CN113641681B
Application number: CN202111189827.8A
Authority: CN
Inventors: 许扬汶; 刘天鹏; 韩冬; 孙腾中; 刘灵娟
Original assignee: Nanjing Big Data Group Co ltd
Current assignee: Nanjing Big Data Group Co ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-02-22
Anticipated expiration: 2041-10-13
Also published as: CN113641681A

Abstract

The invention discloses a space self-adaptive mass data query method, which comprises the following steps: step 1, dividing all records in a data set into a plurality of disjoint data set sub-blocks according to the storage position of the data set, and then constructing a spatial adaptive filter for all elements in the same domain to be queried of all records in each data set sub-block; step 2, a certain target element in the domain to be queried is given, all space adaptive filters corresponding to the domain to be queried are traversed, and a data set sub-block set with the target element is obtained; and traversing each data set subblock in the data set subblock set to obtain all records with target elements in the domain to be queried. The method provides stable query acceleration in a mode of reducing the query quantity of the disk data in the whole life cycle, and avoids the influence on the spatial position and the mapping relation of other data when the data is inserted and deleted.

Description

Space self-adaptive mass data query method

Technical Field

The invention belongs to the field of databases, and particularly relates to a space self-adaptive mass data query method.

Background

With the explosive increase of data volume, the time consumed by data query in mass data is rapidly increased, and many actual services cannot be completed due to too long operation time, so how to quickly retrieve data in a mass data scene is an urgent need of the existing system.

Aiming at the problems, a filter (filter) is used for pre-analyzing data in the mass data query process in advance, unnecessary data block reading is pruned, the disk reading overhead can be obviously reduced, and the data query speed is improved. In particular, filter data structures are a class of data structures having a compact and efficient data organization that are capable of efficiently determining whether a record to be queried exists in a known collection. Existing filter structures such as bloom filters, quotient filters, cuckoo filters, xor filters, and variations thereof have been widely used to represent collections having a large amount of data.

However, since insertion and deletion operations are often performed in real scenes, the number of elements in the filter data structure changes dynamically and frequently, and therefore the filter data structure must be resized in time to better accommodate more or fewer elements. This spatial adaptability improves the utility of the filter data structure, but inevitably leads to performance degradation in existing systems, mainly classified into two categories:

(1) short term operation is blocked. When the filter data structure performs a resizing operation, other operations need to wait until the resizing operation is complete. During this time, the throughput of other operations drops to zero. We call this problem short term operational blockage. If the blocked operation has a strict expiration date, the system will not respond in a timely manner. Therefore, solving this problem is crucial for applications with strict real-time requirements, such as industrial control systems, real-time databases, interactive databases, and interactive web applications.

(2) The long-term performance deteriorates. After the resizing operation is performed, the performance of other operations may gradually deteriorate to such an extent that the level before the resizing operation cannot be reached. This performance degradation is due to the increased time complexity of the resizing operation leading to other operations. For example, dynamic bloom filters and dynamic cuckoo filters have longer approximate member query paths after resizing. Solving this problem is crucial for applications that need to perform query operations frequently, such as query speed optimization, denial of service attacks, privacy protection, distributed database systems, and cloud storage.

However, none of the existing filter data structures that support spatial adaptation can simultaneously alleviate the problems of short-term operation blocking and long-term performance degradation. Existing filter data structures that support spatial adaptation can be divided into the following two categories, but they all have their limitations:

(1) filter data structures that do not need to be re-hashed during resizing (e.g., dynamic bloom filters, scalable bloom filters, and dynamic cuckoo filters). This type of filter structure is expanded by adding homogeneous filters directly. Since there is no need to reinsert data that has already been inserted, the capacity expansion operation may be completed without blocking. However, this mechanism, on the one hand, causes long-term performance degradation as the lookup operation requires checking all filters in turn, and the lookup delay increases with the expansion of the filter data structure; on the other hand, directly increasing the amount of space will greatly reduce the space utilization of the filter data structure. In the extreme case, space utilization is directly halved after insertion of an element.

(2) Filter data structures that need to be re-hashed when resizing (e.g., morton filters, elastic bloom filters, and vacuum filters). During each resizing, all inserted elements in the filter data structure need to be re-inserted. Obviously, the resizing operation may block other operations for a long time. In the worst case, some tasks that are critical to real-time will time out, which is unacceptable for some real-time applications.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a mass data query method which can simultaneously relieve short-term operation blockage and long-term performance deterioration.

The technical scheme is as follows: the invention relates to a space self-adaptive mass data query method, which comprises the following steps:

(1) dividing all records in the data set into a plurality of disjoint data set sub-blocks according to the storage positions of the data set, constructing a spatial adaptive filter for all elements in the same domain to be queried of all records in each data set sub-block, mapping and coding each element one by one, and then inserting the elements into the corresponding spatial adaptive filter to complete the construction of all the spatial adaptive filters;

(2) a certain target element in a domain to be queried is given, all spatial adaptive filters corresponding to the domain to be queried are traversed, whether the target element exists in a corresponding data set subblock is judged through the spatial adaptive filters, and a data set subblock set with the target element is obtained; and traversing each data set subblock in the data set subblock set to obtain all records with target elements in the domain to be queried.

Preferably, the hash table of the spatial adaptive filter is composed of a plurality of segments, each segment comprises a plurality of buckets, the elements are stored in the buckets in a form of encoded element fingerprints, and each bucket comprises a plurality of element fingerprint storage locations.

Further, when the space self-adaptive filter is constructed, firstly, the segment address and two candidate bucket addresses of each element in the domain to be inquired in the corresponding space self-adaptive filter are obtained through calculation, and the element fingerprint of the element is obtained according to the segment address and the candidate bucket addresses of the element

Randomly storing the data in a certain candidate bucket; fingerprinting an element if there is a free location in any one of the candidate buckets

Storing directly into the free position; if no free position exists in the two candidate buckets, the filter randomly selects one candidate bucket and uses the element fingerprint

Replacing a stored fingerprint of an element in the bucket

Then fingerprint the element

Save to another candidate bucket for the element fingerprint: if element fingerprint

Is present in another candidate bucketAt the free position, the element is fingerprinted

Storing in a free location if the element fingerprint

Does not have a free location, then element fingerprinting is used

Replacing a stored fingerprint of an element in the bucket

Repeating the above process until finding the idle position; if the storage position in the segment is saturated, generating a blank segment which has the same format as the segment and does not store any element fingerprint, sharing the same segment address with the saturated segment, and randomly storing the element fingerprint of which the free position is not found in a certain candidate bucket of the blank segment.

When the repeated times of element fingerprint position replacement in the process of storing the element fingerprints into the segment reach a predefined threshold value and certain element fingerprints still have no idle position for storage, judging that the storage positions in the segment are saturated, wherein the element storage positions in each bucket in the saturated segment have sufficient element fingerprints, and the segment can not contain more element fingerprints; the newly added blank segment is linked at the tail part of the hash table of the original filter, the blank segment is linked behind the saturated segment, so that the blank segment and the saturated segment share the same segment address, and the blank segment and the saturated segment are mutually linked by using a linked list data structure.

Further, in step 2, a segment address, two candidate bucket addresses and an element fingerprint of a certain target element in a given domain to be queried are calculated, whether the two candidate buckets of the corresponding segments of all the spatial adaptive filters corresponding to the domain to be queried contain the element fingerprint of the target element is queried, and if yes, the existence is returned; if not, checking whether the segment is linked to a blank segment due to saturation, if so, continuing to check whether the element fingerprint of the target element is contained in two candidate buckets of the linked segment; if there are no linked segments, then the return is not present.

And further, acquiring all target data set sub-blocks corresponding to the space self-adaptive filter with the fingerprints of the elements to be inquired, performing sequential search on each target data set sub-block, judging whether the value of the recorded field elements to be inquired is the same as the value of the target elements, and outputting the record if the value of the recorded field elements to be inquired is the same as the value of the target elements.

Further, when deleting the record in the data set, simultaneously calculating the segment address, the two candidate bucket addresses and the element fingerprint of the element in the to-be-queried domain of the deleted record in the corresponding spatial adaptive filter, and deleting the element fingerprint which is stored in one candidate bucket and is the same as the element fingerprint of the to-be-deleted element; if the same element fingerprint is not found in both candidate buckets, the same element fingerprint is found in the segment linked by saturation, and the above process is repeated until the element fingerprint is successfully found and deleted.

Furthermore, in step 1, the size and the number of the data set sub-blocks are set according to the physical read-write characteristics of the actual storage device, and the storage address ranges of different data set sub-blocks are not crossed.

Further, after the size of the data set subblock is set, if the record to be inserted is inserted into the free position in the data set subblock, updating the space adaptive filter corresponding to the data set subblock; if no free locations exist within all data set sub-blocks, new data set sub-blocks are partitioned for storing records to be inserted.

Furthermore, the data set is composed of a plurality of records, each record is composed of a plurality of domains, and the number of the records in the data set is large, so that all the record contents in the data set cannot be simultaneously called into a memory for filtering; the space self-adaptive filter has extremely high data density and can be quickly constructed, the space self-adaptive filter is stored in a memory, and whether a target element in a domain to be queried is in a corresponding data set sub-block can be quickly judged by using a small amount of space. And then, checking the constructed space self-adaptive filter to quickly judge whether the record with the target element is in the data set subblock on the premise of not reading the data set subblock. If the record with the target element is not in a dataset sub-block, the read overhead of subsequent dataset sub-blocks can be saved, making the number of dataset sub-blocks that need to be read much smaller than the number of total dataset sub-blocks. The quick query of the records meeting the conditions in the data set is realized.

Furthermore, after the spatial adaptive filter is constructed, when the data in the data set needs to be updated, the data structure of the spatial adaptive filter also needs to be resized; the resizing performed by the spatially adaptive filter occurs when data needs to be inserted or deleted from the data set.

When data in a data set needs to be inserted, elements in a domain to be queried of the inserted new record need to be inserted into a corresponding spatial adaptive filter; if the new record is inserted into the free position in the existing data set subblock, inserting a new element into the space self-adaptive filter corresponding to the data set subblock; and if no free position exists in all the data set sub-blocks, dividing new data set sub-blocks for storing new records to be inserted, constructing a space adaptive filter for the newly divided data set sub-blocks, and inserting new elements into the newly constructed space adaptive filter. The method for inserting the new element into the space adaptive filter under the two working conditions is the same as the method for inserting the element during the construction of the space adaptive filter.

Furthermore, in the process of inserting the element, when no idle position exists in the candidate bucket, the position replacement operation is carried out on the element fingerprint, the operation greatly improves the utilization rate of the storage position in the bucket, reduces the increasing speed of the space self-adaptive filter, also reduces the frequency of the link of the blank section and the saturated section, and relieves the degree of long-term performance deterioration to a great extent.

When the data in the data set needs to be deleted, deleting elements in the domain to be queried of the record to be deleted from the corresponding space adaptive filter; before deletion, the positions of the element fingerprints of the elements in the domain to be queried of the record to be deleted need to be queried in the corresponding space self-adaptive filter, the query method is the same as the target element query method, the element fingerprints of the target elements are deleted from the corresponding space self-adaptive filter, and if a plurality of the same element fingerprints exist in the same segment and the same barrel in the corresponding space self-adaptive filter, one element fingerprint is randomly deleted.

Furthermore, in the process of carrying out element deletion operation by the spatial adaptive filter, the positions of the fingerprints of other elements cannot be changed, so that short-term operation blockage cannot be caused in the data insertion process. The position where the deleted element fingerprint is located becomes a free position, and can be detected again and utilized when a subsequent element fingerprint is inserted, so that the degree of long-term performance deterioration is reduced. The reason why one of the fingerprints of the same elements in one bucket is randomly deleted does not affect subsequent data search is that the space adaptive filter is used as a filter to judge whether a record containing a target element exists in a data set sub-block or not, a specific record is not directly obtained, and a record which specifically meets the condition is obtained by sequentially searching the target data set sub-block subsequently.

The present invention provides a computer-readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method steps of any of the above mass data query methods.

Has the advantages that: according to the method and the device, the data set in the hard disk is divided into the plurality of disjoint data set sub-blocks, the data are divided, the target elements are inquired in the areas to be inquired recorded in the different data set sub-blocks respectively by using the plurality of space self-adaptive filters when the target records are inquired, so that all the data set sub-blocks containing the records meeting the conditions are obtained, the read overhead of the disk is reduced, and the data inquiry efficiency is improved. Because the size of the space self-adaptive filter is adjustable, the overhead of disk reading can be reduced to the maximum extent even along with the increase of data volume, stable query acceleration is provided in the whole process, and the degree of long-term performance deterioration is relieved.

The influence range of the element insertion process of the spatial adaptive filter on other elements in the bucket is limited in the candidate bucket, and even if the element fingerprint is changed in position, the changed position is still in the candidate position, so that the influence on the query process is completely avoided. The element deletion process of the spatial adaptive filter has no effect on other elements in the bucket. Therefore, when the record in the data set is inserted or deleted, the adaptive change of the spatial adaptive filter has no influence on the query of the data, and the occurrence of short-term operation blockage is avoided.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a schematic flow chart of the construction of the spatial adaptive filter according to the present invention;

FIG. 3 is a schematic flow chart of the spatial adaptive filter query and deletion in the present invention;

FIG. 4 is a graph illustrating throughput comparison between a spatial adaptive filter according to the present invention and a conventional filter;

FIG. 5 is a diagram illustrating a comparison between the throughput of the spatial adaptive filter of the present invention and the throughput of the conventional filter when performing a search operation;

fig. 6 is a schematic diagram illustrating the comparison of the throughput rates of the spatial adaptive filter of the present invention and the existing filter when performing the deleting operation.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the accompanying drawings and examples.

The data set S stored in the database is composed of a plurality of records

Each record is composed of several query domain data, for example:one record in the dataset is "Zhang three, Man, 18", where "Zhang three" is the element of the record in the "name" query field, "Man" is the element of the record in the "gender" query field, and "18" is the element of the record in the "age" query field.

For a given data set

And domain to be queried

And value of domain to be queried

Data queries can be simplified to find datasets

In the query domain

All the values in the query are the values of the domain to be queried

Is recorded

Set of (2)

The process of (1). As shown in fig. 1, the present embodiment discloses a method for querying mass data in a space adaptive manner, which includes the following steps: the method comprises a filter construction phase, a target record query phase and a data set updating phase.

In this embodiment, the filter construction stage includes the following specific steps: for a given data set

The original data set is partitioned according to each storage location recorded on disk in the data setDivided into several disjoint sets, i.e.

Wherein

Is as follows

A data set sub-block; for each data set sub-block, according to the domain to be queried

And constructing a space adaptive filter. For each record

Will be

Is encoded and then inserted into the filter, i.e. the domain to be queried

Encoding each element data to form element fingerprint and inserting into filter to form sub-block of the data set in query domain

And the spatial adaptive filter is stored in the memory.

In this embodiment, the hash table of each spatial adaptive filter is composed of a plurality of segments, each segment includes a plurality of buckets and elements

Are stored in buckets encoded in the form of element fingerprints, each bucket containing a number of element fingerprint storage locations.

In this embodiment, when the spatial adaptive filter is constructed, the segment address of each element in the domain to be queried in the corresponding spatial adaptive filter is first obtained through calculationAnd two candidate bucket addresses, as shown in FIG. 2, fingerprinting an element of the element based on its segment address and the candidate bucket addresses

Replacing a stored fingerprint of an element in the bucket

Then fingerprint the element

Has a free location in another candidate bucket, then the element is fingerprinted

Storing in a free location if the element fingerprint

Does not have a free location, then element fingerprinting is used

Replacing a stored fingerprint of an element in the bucket

And is combined withRepeating the above process until an idle position is found; when the repeated times of element fingerprint position replacement in the process of storing the element fingerprints into the segment reach a predefined threshold value and no idle position of a certain element fingerprint is stored, judging that the storage position in the segment is saturated, generating a blank segment which has the same format as the segment but does not store any element fingerprint, linking the newly added blank segment to the tail part of the original filter hash table, and linking the blank segment to the back of the saturated segment, so that the blank segment and the saturated segment share the same segment address, and randomly storing the element fingerprint of which the idle position is not found in a certain candidate bucket of the blank segment.

In this embodiment, the target record query stage includes the following specific steps: when the value of the element of the domain to be queried in the domain to be queried is required to be queried

When recording correspondingly, firstly searching the data set sub-block

In the query domain

All corresponding spatial adaptive filters, and then as shown in fig. 3, the spatial adaptive filter first calculates the value of the target element in the domain to be queried as

Segment address, two candidate bucket addresses and element fingerprint of the element, judging the query domain

Whether the two candidate buckets of the corresponding section of the corresponding all spatial adaptive filters contain the element fingerprint of the target element or not is judged, if yes, the existence is returned, if not, whether the section is linked to a blank section due to saturation or not is judged, and if yes, whether the element fingerprint of the target element is contained or not is continuously judged in the two candidate buckets of the linked section; if there are no linked segments, then the return is not present.

And through the screening operation of the data set subblocks, acquiring target data set subblocks corresponding to the space self-adaptive filter which all contain the element fingerprint to be inquired, executing sequential search on each target data set subblock, judging whether the value of the recorded field element to be inquired is the same as the value of the target element, and outputting the record if the value of the recorded field element to be inquired is the same as the value of the target element.

In this embodiment, the data set update stage specifically includes the following steps: the data set updating comprises two types of operations of inserting records and deleting records; when data in a data set needs to be inserted, elements in a domain to be queried of the inserted new record need to be inserted into a corresponding spatial adaptive filter; if the new record is inserted into the free position in the existing data set subblock, inserting a new element into the space self-adaptive filter corresponding to the data set subblock; and if no free position exists in all the data set sub-blocks, dividing new data set sub-blocks for storing new records to be inserted, constructing a space adaptive filter for the newly divided data set sub-blocks, and inserting new elements into the newly constructed space adaptive filter. The method for inserting the new element into the space adaptive filter under the two working conditions is the same as the method for inserting the element during the construction of the space adaptive filter.

Further, the performance of the space adaptive mass data query method of the present invention is verified through simulation experiments, and the C + + is used to implement the space adaptive mass data query method and compare the performance with the dynamic bloom filter, the scalable bloom filter, the elastic bloom filter, and the dynamic cuckoo filter, as shown in fig. 4 to 6, DBF represents the dynamic bloom filter, SBF represents the scalable bloom filter, EBF represents the elastic bloom filter, DCF represents the dynamic cuckoo filter, and BBF represents the space adaptive filter in the present invention. All programs were compiled in g + + 9.2.1 on a standard server equipped with Intel (R) Xeon (R) Gold 5218R CPU @ 2.10GHz,64GB RAM, 1TB SSD, and CentOS Linux. The experimental results are as follows, wherein fig. 4 shows the throughput rate comparison of the present invention with that of other filters performing the insertion operation, and the results show that the method has the highest insertion throughput rate; FIG. 5 shows the throughput rate comparison of the present invention with other filters performing query operations, compared to the wide use of dynamic cuckoo filters, the present invention can achieve a 2.63-fold improvement in finding throughput rate indicators; fig. 6 shows the throughput of the present invention compared with the throughput of other filters when performing the deleting operation, and the result shows that the method has the optimal deleting throughput.

In conclusion, the space-adaptive mass data query method screens the data set subblocks by utilizing the constructed space-adaptive filter, so that the query quantity of data in a disk can be greatly reduced, and the mass data query speed is increased. Meanwhile, when data insertion and deletion are carried out, the spatial position and the mapping relation of other data are not influenced, and stable query acceleration can be provided in the whole life cycle of system operation. According to the space self-adaptive mass data query method, the rapid query of the recorded data is realized after the space self-adaptive filter corresponding to the data set is constructed, meanwhile, when the data set executes the updating operation, the updating of the space self-adaptive filter cannot influence other query processes, and after the data set executes a large amount of updating operations, the structure of the space self-adaptive filter cannot be changed greatly, so that the rapid query of the recorded data in the whole life cycle can be realized.

Claims

1. A space self-adaptive mass data query method is characterized by comprising the following steps: the method comprises the following steps:

(2) a certain target element in a domain to be queried is given, all spatial adaptive filters corresponding to the domain to be queried are traversed, whether the target element exists in a corresponding data set subblock is judged through the spatial adaptive filters, and a data set subblock set with the target element is obtained; traversing each data set subblock in the data set subblock set to obtain all records with target elements in a domain to be queried;

the hash table of the spatial adaptive filter is composed of a plurality of segments, each segment comprises a plurality of buckets, elements are stored in the buckets in a form of being encoded into element fingerprints, and each bucket comprises a plurality of element fingerprint storage positions;

when the space self-adaptive filter is constructed, firstly, the segment address and two candidate barrel addresses of each element in the domain to be inquired in the corresponding space self-adaptive filter are obtained through calculation, and the element fingerprint of the element is obtained according to the segment address and the candidate barrel addresses of the element

Replacing a stored fingerprint of an element in the bucket

Then fingerprint the element

Storing in a free location if the element fingerprint

Does not have a free location, then element fingerprinting is used

Replacing a stored fingerprint of an element in the bucket

2. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: step 2, calculating a segment address, two candidate bucket addresses and an element fingerprint of a certain target element in a given domain to be queried, querying whether two candidate buckets of corresponding segments of all space adaptive filters corresponding to the domain to be queried contain the element fingerprint of the target element, and if so, returning to exist; if not, checking whether the segment is linked to a blank segment due to saturation, if so, continuing to check whether the element fingerprint of the target element is contained in two candidate buckets of the linked segment; if there are no linked segments, then the return is not present.

3. The method for querying mass data in a space self-adaption mode according to claim 2, wherein the method comprises the following steps: acquiring target data set sub-blocks corresponding to all spatial adaptive filters with element fingerprints to be queried, performing sequential search on each target data set sub-block, judging whether the value of the recorded field element to be queried is the same as the value of the target element, and outputting the record if the value of the recorded field element to be queried is the same as the value of the target element.

4. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: when deleting the records in the data set, simultaneously calculating the segment address, the two candidate bucket addresses and the element fingerprint of the element in the domain to be inquired of the deleted record in the corresponding space self-adaptive filter, and deleting the element fingerprint which is stored in one candidate bucket and is the same as the element fingerprint of the element to be deleted; if the same element fingerprint is not found in both candidate buckets, the same element fingerprint is found in the segment linked by saturation, and the above process is repeated until the element fingerprint is successfully found and deleted.

5. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: in the step 1, the size and the number of the data set subblocks are set according to the physical read-write characteristics of the actual storage equipment, and the storage address ranges of different data set subblocks are not crossed.

6. The method for querying mass data in a space self-adaption mode according to claim 5, wherein the method comprises the following steps: after the size of the data set subblock is set, if the record to be inserted is inserted into the free position in the data set subblock, updating a space adaptive filter corresponding to the data set subblock; if no free locations exist within all data set sub-blocks, new data set sub-blocks are partitioned for storing records to be inserted.

7. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any of the claims 1-5.