[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113641681B - Space self-adaptive mass data query method - Google Patents

Space self-adaptive mass data query method Download PDF

Info

Publication number
CN113641681B
CN113641681B CN202111189827.8A CN202111189827A CN113641681B CN 113641681 B CN113641681 B CN 113641681B CN 202111189827 A CN202111189827 A CN 202111189827A CN 113641681 B CN113641681 B CN 113641681B
Authority
CN
China
Prior art keywords
data set
fingerprint
segment
data
queried
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111189827.8A
Other languages
Chinese (zh)
Other versions
CN113641681A (en
Inventor
许扬汶
刘天鹏
韩冬
孙腾中
刘灵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Big Data Group Co ltd
Original Assignee
Nanjing Big Data Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Big Data Group Co ltd filed Critical Nanjing Big Data Group Co ltd
Priority to CN202111189827.8A priority Critical patent/CN113641681B/en
Publication of CN113641681A publication Critical patent/CN113641681A/en
Application granted granted Critical
Publication of CN113641681B publication Critical patent/CN113641681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a space self-adaptive mass data query method, which comprises the following steps: step 1, dividing all records in a data set into a plurality of disjoint data set sub-blocks according to the storage position of the data set, and then constructing a spatial adaptive filter for all elements in the same domain to be queried of all records in each data set sub-block; step 2, a certain target element in the domain to be queried is given, all space adaptive filters corresponding to the domain to be queried are traversed, and a data set sub-block set with the target element is obtained; and traversing each data set subblock in the data set subblock set to obtain all records with target elements in the domain to be queried. The method provides stable query acceleration in a mode of reducing the query quantity of the disk data in the whole life cycle, and avoids the influence on the spatial position and the mapping relation of other data when the data is inserted and deleted.

Description

Space self-adaptive mass data query method
Technical Field
The invention belongs to the field of databases, and particularly relates to a space self-adaptive mass data query method.
Background
With the explosive increase of data volume, the time consumed by data query in mass data is rapidly increased, and many actual services cannot be completed due to too long operation time, so how to quickly retrieve data in a mass data scene is an urgent need of the existing system.
Aiming at the problems, a filter (filter) is used for pre-analyzing data in the mass data query process in advance, unnecessary data block reading is pruned, the disk reading overhead can be obviously reduced, and the data query speed is improved. In particular, filter data structures are a class of data structures having a compact and efficient data organization that are capable of efficiently determining whether a record to be queried exists in a known collection. Existing filter structures such as bloom filters, quotient filters, cuckoo filters, xor filters, and variations thereof have been widely used to represent collections having a large amount of data.
However, since insertion and deletion operations are often performed in real scenes, the number of elements in the filter data structure changes dynamically and frequently, and therefore the filter data structure must be resized in time to better accommodate more or fewer elements. This spatial adaptability improves the utility of the filter data structure, but inevitably leads to performance degradation in existing systems, mainly classified into two categories:
(1) short term operation is blocked. When the filter data structure performs a resizing operation, other operations need to wait until the resizing operation is complete. During this time, the throughput of other operations drops to zero. We call this problem short term operational blockage. If the blocked operation has a strict expiration date, the system will not respond in a timely manner. Therefore, solving this problem is crucial for applications with strict real-time requirements, such as industrial control systems, real-time databases, interactive databases, and interactive web applications.
(2) The long-term performance deteriorates. After the resizing operation is performed, the performance of other operations may gradually deteriorate to such an extent that the level before the resizing operation cannot be reached. This performance degradation is due to the increased time complexity of the resizing operation leading to other operations. For example, dynamic bloom filters and dynamic cuckoo filters have longer approximate member query paths after resizing. Solving this problem is crucial for applications that need to perform query operations frequently, such as query speed optimization, denial of service attacks, privacy protection, distributed database systems, and cloud storage.
However, none of the existing filter data structures that support spatial adaptation can simultaneously alleviate the problems of short-term operation blocking and long-term performance degradation. Existing filter data structures that support spatial adaptation can be divided into the following two categories, but they all have their limitations:
(1) filter data structures that do not need to be re-hashed during resizing (e.g., dynamic bloom filters, scalable bloom filters, and dynamic cuckoo filters). This type of filter structure is expanded by adding homogeneous filters directly. Since there is no need to reinsert data that has already been inserted, the capacity expansion operation may be completed without blocking. However, this mechanism, on the one hand, causes long-term performance degradation as the lookup operation requires checking all filters in turn, and the lookup delay increases with the expansion of the filter data structure; on the other hand, directly increasing the amount of space will greatly reduce the space utilization of the filter data structure. In the extreme case, space utilization is directly halved after insertion of an element.
(2) Filter data structures that need to be re-hashed when resizing (e.g., morton filters, elastic bloom filters, and vacuum filters). During each resizing, all inserted elements in the filter data structure need to be re-inserted. Obviously, the resizing operation may block other operations for a long time. In the worst case, some tasks that are critical to real-time will time out, which is unacceptable for some real-time applications.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a mass data query method which can simultaneously relieve short-term operation blockage and long-term performance deterioration.
The technical scheme is as follows: the invention relates to a space self-adaptive mass data query method, which comprises the following steps:
(1) dividing all records in the data set into a plurality of disjoint data set sub-blocks according to the storage positions of the data set, constructing a spatial adaptive filter for all elements in the same domain to be queried of all records in each data set sub-block, mapping and coding each element one by one, and then inserting the elements into the corresponding spatial adaptive filter to complete the construction of all the spatial adaptive filters;
(2) a certain target element in a domain to be queried is given, all spatial adaptive filters corresponding to the domain to be queried are traversed, whether the target element exists in a corresponding data set subblock is judged through the spatial adaptive filters, and a data set subblock set with the target element is obtained; and traversing each data set subblock in the data set subblock set to obtain all records with target elements in the domain to be queried.
Preferably, the hash table of the spatial adaptive filter is composed of a plurality of segments, each segment comprises a plurality of buckets, the elements are stored in the buckets in a form of encoded element fingerprints, and each bucket comprises a plurality of element fingerprint storage locations.
Further, when the space self-adaptive filter is constructed, firstly, the segment address and two candidate bucket addresses of each element in the domain to be inquired in the corresponding space self-adaptive filter are obtained through calculation, and the element fingerprint of the element is obtained according to the segment address and the candidate bucket addresses of the element
Figure 172259DEST_PATH_IMAGE001
Randomly storing the data in a certain candidate bucket; fingerprinting an element if there is a free location in any one of the candidate buckets
Figure 413884DEST_PATH_IMAGE001
Storing directly into the free position; if no free position exists in the two candidate buckets, the filter randomly selects one candidate bucket and uses the element fingerprint
Figure 485745DEST_PATH_IMAGE001
Replacing a stored fingerprint of an element in the bucket
Figure 660375DEST_PATH_IMAGE002
Then fingerprint the element
Figure 526700DEST_PATH_IMAGE002
Save to another candidate bucket for the element fingerprint: if element fingerprint
Figure 521200DEST_PATH_IMAGE002
Is present in another candidate bucketAt the free position, the element is fingerprinted
Figure 600015DEST_PATH_IMAGE002
Storing in a free location if the element fingerprint
Figure 629151DEST_PATH_IMAGE002
Does not have a free location, then element fingerprinting is used
Figure 135219DEST_PATH_IMAGE002
Replacing a stored fingerprint of an element in the bucket
Figure 617015DEST_PATH_IMAGE003
Repeating the above process until finding the idle position; if the storage position in the segment is saturated, generating a blank segment which has the same format as the segment and does not store any element fingerprint, sharing the same segment address with the saturated segment, and randomly storing the element fingerprint of which the free position is not found in a certain candidate bucket of the blank segment.
When the repeated times of element fingerprint position replacement in the process of storing the element fingerprints into the segment reach a predefined threshold value and certain element fingerprints still have no idle position for storage, judging that the storage positions in the segment are saturated, wherein the element storage positions in each bucket in the saturated segment have sufficient element fingerprints, and the segment can not contain more element fingerprints; the newly added blank segment is linked at the tail part of the hash table of the original filter, the blank segment is linked behind the saturated segment, so that the blank segment and the saturated segment share the same segment address, and the blank segment and the saturated segment are mutually linked by using a linked list data structure.
Further, in step 2, a segment address, two candidate bucket addresses and an element fingerprint of a certain target element in a given domain to be queried are calculated, whether the two candidate buckets of the corresponding segments of all the spatial adaptive filters corresponding to the domain to be queried contain the element fingerprint of the target element is queried, and if yes, the existence is returned; if not, checking whether the segment is linked to a blank segment due to saturation, if so, continuing to check whether the element fingerprint of the target element is contained in two candidate buckets of the linked segment; if there are no linked segments, then the return is not present.
And further, acquiring all target data set sub-blocks corresponding to the space self-adaptive filter with the fingerprints of the elements to be inquired, performing sequential search on each target data set sub-block, judging whether the value of the recorded field elements to be inquired is the same as the value of the target elements, and outputting the record if the value of the recorded field elements to be inquired is the same as the value of the target elements.
Further, when deleting the record in the data set, simultaneously calculating the segment address, the two candidate bucket addresses and the element fingerprint of the element in the to-be-queried domain of the deleted record in the corresponding spatial adaptive filter, and deleting the element fingerprint which is stored in one candidate bucket and is the same as the element fingerprint of the to-be-deleted element; if the same element fingerprint is not found in both candidate buckets, the same element fingerprint is found in the segment linked by saturation, and the above process is repeated until the element fingerprint is successfully found and deleted.
Furthermore, in step 1, the size and the number of the data set sub-blocks are set according to the physical read-write characteristics of the actual storage device, and the storage address ranges of different data set sub-blocks are not crossed.
Further, after the size of the data set subblock is set, if the record to be inserted is inserted into the free position in the data set subblock, updating the space adaptive filter corresponding to the data set subblock; if no free locations exist within all data set sub-blocks, new data set sub-blocks are partitioned for storing records to be inserted.
Furthermore, the data set is composed of a plurality of records, each record is composed of a plurality of domains, and the number of the records in the data set is large, so that all the record contents in the data set cannot be simultaneously called into a memory for filtering; the space self-adaptive filter has extremely high data density and can be quickly constructed, the space self-adaptive filter is stored in a memory, and whether a target element in a domain to be queried is in a corresponding data set sub-block can be quickly judged by using a small amount of space. And then, checking the constructed space self-adaptive filter to quickly judge whether the record with the target element is in the data set subblock on the premise of not reading the data set subblock. If the record with the target element is not in a dataset sub-block, the read overhead of subsequent dataset sub-blocks can be saved, making the number of dataset sub-blocks that need to be read much smaller than the number of total dataset sub-blocks. The quick query of the records meeting the conditions in the data set is realized.
Furthermore, after the spatial adaptive filter is constructed, when the data in the data set needs to be updated, the data structure of the spatial adaptive filter also needs to be resized; the resizing performed by the spatially adaptive filter occurs when data needs to be inserted or deleted from the data set.
When data in a data set needs to be inserted, elements in a domain to be queried of the inserted new record need to be inserted into a corresponding spatial adaptive filter; if the new record is inserted into the free position in the existing data set subblock, inserting a new element into the space self-adaptive filter corresponding to the data set subblock; and if no free position exists in all the data set sub-blocks, dividing new data set sub-blocks for storing new records to be inserted, constructing a space adaptive filter for the newly divided data set sub-blocks, and inserting new elements into the newly constructed space adaptive filter. The method for inserting the new element into the space adaptive filter under the two working conditions is the same as the method for inserting the element during the construction of the space adaptive filter.
Furthermore, in the process of inserting the element, when no idle position exists in the candidate bucket, the position replacement operation is carried out on the element fingerprint, the operation greatly improves the utilization rate of the storage position in the bucket, reduces the increasing speed of the space self-adaptive filter, also reduces the frequency of the link of the blank section and the saturated section, and relieves the degree of long-term performance deterioration to a great extent.
When the data in the data set needs to be deleted, deleting elements in the domain to be queried of the record to be deleted from the corresponding space adaptive filter; before deletion, the positions of the element fingerprints of the elements in the domain to be queried of the record to be deleted need to be queried in the corresponding space self-adaptive filter, the query method is the same as the target element query method, the element fingerprints of the target elements are deleted from the corresponding space self-adaptive filter, and if a plurality of the same element fingerprints exist in the same segment and the same barrel in the corresponding space self-adaptive filter, one element fingerprint is randomly deleted.
Furthermore, in the process of carrying out element deletion operation by the spatial adaptive filter, the positions of the fingerprints of other elements cannot be changed, so that short-term operation blockage cannot be caused in the data insertion process. The position where the deleted element fingerprint is located becomes a free position, and can be detected again and utilized when a subsequent element fingerprint is inserted, so that the degree of long-term performance deterioration is reduced. The reason why one of the fingerprints of the same elements in one bucket is randomly deleted does not affect subsequent data search is that the space adaptive filter is used as a filter to judge whether a record containing a target element exists in a data set sub-block or not, a specific record is not directly obtained, and a record which specifically meets the condition is obtained by sequentially searching the target data set sub-block subsequently.
The present invention provides a computer-readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the method steps of any of the above mass data query methods.
Has the advantages that: according to the method and the device, the data set in the hard disk is divided into the plurality of disjoint data set sub-blocks, the data are divided, the target elements are inquired in the areas to be inquired recorded in the different data set sub-blocks respectively by using the plurality of space self-adaptive filters when the target records are inquired, so that all the data set sub-blocks containing the records meeting the conditions are obtained, the read overhead of the disk is reduced, and the data inquiry efficiency is improved. Because the size of the space self-adaptive filter is adjustable, the overhead of disk reading can be reduced to the maximum extent even along with the increase of data volume, stable query acceleration is provided in the whole process, and the degree of long-term performance deterioration is relieved.
The influence range of the element insertion process of the spatial adaptive filter on other elements in the bucket is limited in the candidate bucket, and even if the element fingerprint is changed in position, the changed position is still in the candidate position, so that the influence on the query process is completely avoided. The element deletion process of the spatial adaptive filter has no effect on other elements in the bucket. Therefore, when the record in the data set is inserted or deleted, the adaptive change of the spatial adaptive filter has no influence on the query of the data, and the occurrence of short-term operation blockage is avoided.
Drawings
FIG. 1 is a flow chart of the operation of the present invention;
FIG. 2 is a schematic flow chart of the construction of the spatial adaptive filter according to the present invention;
FIG. 3 is a schematic flow chart of the spatial adaptive filter query and deletion in the present invention;
FIG. 4 is a graph illustrating throughput comparison between a spatial adaptive filter according to the present invention and a conventional filter;
FIG. 5 is a diagram illustrating a comparison between the throughput of the spatial adaptive filter of the present invention and the throughput of the conventional filter when performing a search operation;
fig. 6 is a schematic diagram illustrating the comparison of the throughput rates of the spatial adaptive filter of the present invention and the existing filter when performing the deleting operation.
Detailed Description
The technical solution of the present invention is further described in detail with reference to the accompanying drawings and examples.
The data set S stored in the database is composed of a plurality of records
Figure 765100DEST_PATH_IMAGE004
Each record is composed of several query domain data, for example:one record in the dataset is "Zhang three, Man, 18", where "Zhang three" is the element of the record in the "name" query field, "Man" is the element of the record in the "gender" query field, and "18" is the element of the record in the "age" query field.
For a given data set
Figure 383163DEST_PATH_IMAGE005
And domain to be queried
Figure 591291DEST_PATH_IMAGE006
And value of domain to be queried
Figure 294804DEST_PATH_IMAGE007
Data queries can be simplified to find datasets
Figure 981001DEST_PATH_IMAGE008
In the query domain
Figure 719150DEST_PATH_IMAGE006
All the values in the query are the values of the domain to be queried
Figure 832599DEST_PATH_IMAGE007
Is recorded
Figure 23409DEST_PATH_IMAGE009
Set of (2)
Figure 513296DEST_PATH_IMAGE010
The process of (1). As shown in fig. 1, the present embodiment discloses a method for querying mass data in a space adaptive manner, which includes the following steps: the method comprises a filter construction phase, a target record query phase and a data set updating phase.
In this embodiment, the filter construction stage includes the following specific steps: for a given data set
Figure 105951DEST_PATH_IMAGE008
The original data set is partitioned according to each storage location recorded on disk in the data setDivided into several disjoint sets, i.e.
Figure 655882DEST_PATH_IMAGE011
Wherein
Figure 333988DEST_PATH_IMAGE012
Is as follows
Figure 361986DEST_PATH_IMAGE013
A data set sub-block; for each data set sub-block, according to the domain to be queried
Figure 809148DEST_PATH_IMAGE006
And constructing a space adaptive filter. For each record
Figure 529980DEST_PATH_IMAGE014
Will be
Figure 429802DEST_PATH_IMAGE015
Is encoded and then inserted into the filter, i.e. the domain to be queried
Figure 527071DEST_PATH_IMAGE006
Encoding each element data to form element fingerprint and inserting into filter to form sub-block of the data set in query domain
Figure 563161DEST_PATH_IMAGE006
And the spatial adaptive filter is stored in the memory.
In this embodiment, the hash table of each spatial adaptive filter is composed of a plurality of segments, each segment includes a plurality of buckets and elements
Figure 454893DEST_PATH_IMAGE015
Are stored in buckets encoded in the form of element fingerprints, each bucket containing a number of element fingerprint storage locations.
In this embodiment, when the spatial adaptive filter is constructed, the segment address of each element in the domain to be queried in the corresponding spatial adaptive filter is first obtained through calculationAnd two candidate bucket addresses, as shown in FIG. 2, fingerprinting an element of the element based on its segment address and the candidate bucket addresses
Figure 107591DEST_PATH_IMAGE001
Randomly storing the data in a certain candidate bucket; fingerprinting an element if there is a free location in any one of the candidate buckets
Figure 477393DEST_PATH_IMAGE001
Storing directly into the free position; if no free position exists in the two candidate buckets, the filter randomly selects one candidate bucket and uses the element fingerprint
Figure 899147DEST_PATH_IMAGE001
Replacing a stored fingerprint of an element in the bucket
Figure 961781DEST_PATH_IMAGE002
Then fingerprint the element
Figure 836196DEST_PATH_IMAGE002
Save to another candidate bucket for the element fingerprint: if element fingerprint
Figure 275268DEST_PATH_IMAGE002
Has a free location in another candidate bucket, then the element is fingerprinted
Figure 551528DEST_PATH_IMAGE002
Storing in a free location if the element fingerprint
Figure 785063DEST_PATH_IMAGE002
Does not have a free location, then element fingerprinting is used
Figure 146775DEST_PATH_IMAGE002
Replacing a stored fingerprint of an element in the bucket
Figure 858379DEST_PATH_IMAGE003
And is combined withRepeating the above process until an idle position is found; when the repeated times of element fingerprint position replacement in the process of storing the element fingerprints into the segment reach a predefined threshold value and no idle position of a certain element fingerprint is stored, judging that the storage position in the segment is saturated, generating a blank segment which has the same format as the segment but does not store any element fingerprint, linking the newly added blank segment to the tail part of the original filter hash table, and linking the blank segment to the back of the saturated segment, so that the blank segment and the saturated segment share the same segment address, and randomly storing the element fingerprint of which the idle position is not found in a certain candidate bucket of the blank segment.
In this embodiment, the target record query stage includes the following specific steps: when the value of the element of the domain to be queried in the domain to be queried is required to be queried
Figure 254725DEST_PATH_IMAGE007
When recording correspondingly, firstly searching the data set sub-block
Figure DEST_PATH_IMAGE016
In the query domain
Figure 659161DEST_PATH_IMAGE006
All corresponding spatial adaptive filters, and then as shown in fig. 3, the spatial adaptive filter first calculates the value of the target element in the domain to be queried as
Figure 773748DEST_PATH_IMAGE007
Segment address, two candidate bucket addresses and element fingerprint of the element, judging the query domain
Figure 289043DEST_PATH_IMAGE006
Whether the two candidate buckets of the corresponding section of the corresponding all spatial adaptive filters contain the element fingerprint of the target element or not is judged, if yes, the existence is returned, if not, whether the section is linked to a blank section due to saturation or not is judged, and if yes, whether the element fingerprint of the target element is contained or not is continuously judged in the two candidate buckets of the linked section; if there are no linked segments, then the return is not present.
And through the screening operation of the data set subblocks, acquiring target data set subblocks corresponding to the space self-adaptive filter which all contain the element fingerprint to be inquired, executing sequential search on each target data set subblock, judging whether the value of the recorded field element to be inquired is the same as the value of the target element, and outputting the record if the value of the recorded field element to be inquired is the same as the value of the target element.
In this embodiment, the data set update stage specifically includes the following steps: the data set updating comprises two types of operations of inserting records and deleting records; when data in a data set needs to be inserted, elements in a domain to be queried of the inserted new record need to be inserted into a corresponding spatial adaptive filter; if the new record is inserted into the free position in the existing data set subblock, inserting a new element into the space self-adaptive filter corresponding to the data set subblock; and if no free position exists in all the data set sub-blocks, dividing new data set sub-blocks for storing new records to be inserted, constructing a space adaptive filter for the newly divided data set sub-blocks, and inserting new elements into the newly constructed space adaptive filter. The method for inserting the new element into the space adaptive filter under the two working conditions is the same as the method for inserting the element during the construction of the space adaptive filter.
When the data in the data set needs to be deleted, deleting elements in the domain to be queried of the record to be deleted from the corresponding space adaptive filter; before deletion, the positions of the element fingerprints of the elements in the domain to be queried of the record to be deleted need to be queried in the corresponding space self-adaptive filter, the query method is the same as the target element query method, the element fingerprints of the target elements are deleted from the corresponding space self-adaptive filter, and if a plurality of the same element fingerprints exist in the same segment and the same barrel in the corresponding space self-adaptive filter, one element fingerprint is randomly deleted.
Further, the performance of the space adaptive mass data query method of the present invention is verified through simulation experiments, and the C + + is used to implement the space adaptive mass data query method and compare the performance with the dynamic bloom filter, the scalable bloom filter, the elastic bloom filter, and the dynamic cuckoo filter, as shown in fig. 4 to 6, DBF represents the dynamic bloom filter, SBF represents the scalable bloom filter, EBF represents the elastic bloom filter, DCF represents the dynamic cuckoo filter, and BBF represents the space adaptive filter in the present invention. All programs were compiled in g + + 9.2.1 on a standard server equipped with Intel (R) Xeon (R) Gold 5218R CPU @ 2.10GHz,64GB RAM, 1TB SSD, and CentOS Linux. The experimental results are as follows, wherein fig. 4 shows the throughput rate comparison of the present invention with that of other filters performing the insertion operation, and the results show that the method has the highest insertion throughput rate; FIG. 5 shows the throughput rate comparison of the present invention with other filters performing query operations, compared to the wide use of dynamic cuckoo filters, the present invention can achieve a 2.63-fold improvement in finding throughput rate indicators; fig. 6 shows the throughput of the present invention compared with the throughput of other filters when performing the deleting operation, and the result shows that the method has the optimal deleting throughput.
In conclusion, the space-adaptive mass data query method screens the data set subblocks by utilizing the constructed space-adaptive filter, so that the query quantity of data in a disk can be greatly reduced, and the mass data query speed is increased. Meanwhile, when data insertion and deletion are carried out, the spatial position and the mapping relation of other data are not influenced, and stable query acceleration can be provided in the whole life cycle of system operation. According to the space self-adaptive mass data query method, the rapid query of the recorded data is realized after the space self-adaptive filter corresponding to the data set is constructed, meanwhile, when the data set executes the updating operation, the updating of the space self-adaptive filter cannot influence other query processes, and after the data set executes a large amount of updating operations, the structure of the space self-adaptive filter cannot be changed greatly, so that the rapid query of the recorded data in the whole life cycle can be realized.

Claims (7)

1. A space self-adaptive mass data query method is characterized by comprising the following steps: the method comprises the following steps:
(1) dividing all records in the data set into a plurality of disjoint data set sub-blocks according to the storage positions of the data set, constructing a spatial adaptive filter for all elements in the same domain to be queried of all records in each data set sub-block, mapping and coding each element one by one, and then inserting the elements into the corresponding spatial adaptive filter to complete the construction of all the spatial adaptive filters;
(2) a certain target element in a domain to be queried is given, all spatial adaptive filters corresponding to the domain to be queried are traversed, whether the target element exists in a corresponding data set subblock is judged through the spatial adaptive filters, and a data set subblock set with the target element is obtained; traversing each data set subblock in the data set subblock set to obtain all records with target elements in a domain to be queried;
the hash table of the spatial adaptive filter is composed of a plurality of segments, each segment comprises a plurality of buckets, elements are stored in the buckets in a form of being encoded into element fingerprints, and each bucket comprises a plurality of element fingerprint storage positions;
when the space self-adaptive filter is constructed, firstly, the segment address and two candidate barrel addresses of each element in the domain to be inquired in the corresponding space self-adaptive filter are obtained through calculation, and the element fingerprint of the element is obtained according to the segment address and the candidate barrel addresses of the element
Figure DEST_PATH_IMAGE001
Randomly storing the data in a certain candidate bucket; fingerprinting an element if there is a free location in any one of the candidate buckets
Figure 727438DEST_PATH_IMAGE001
Storing directly into the free position; if no free position exists in the two candidate buckets, the filter randomly selects one candidate bucket and uses the element fingerprint
Figure 853526DEST_PATH_IMAGE001
Replacing a stored fingerprint of an element in the bucket
Figure 255688DEST_PATH_IMAGE002
Then fingerprint the element
Figure 531424DEST_PATH_IMAGE002
Save to another candidate bucket for the element fingerprint: if element fingerprint
Figure 728050DEST_PATH_IMAGE002
Has a free location in another candidate bucket, then the element is fingerprinted
Figure 747959DEST_PATH_IMAGE002
Storing in a free location if the element fingerprint
Figure 281708DEST_PATH_IMAGE002
Does not have a free location, then element fingerprinting is used
Figure 70673DEST_PATH_IMAGE002
Replacing a stored fingerprint of an element in the bucket
Figure DEST_PATH_IMAGE003
Repeating the above process until finding the idle position; if the storage position in the segment is saturated, generating a blank segment which has the same format as the segment and does not store any element fingerprint, sharing the same segment address with the saturated segment, and randomly storing the element fingerprint of which the free position is not found in a certain candidate bucket of the blank segment.
2. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: step 2, calculating a segment address, two candidate bucket addresses and an element fingerprint of a certain target element in a given domain to be queried, querying whether two candidate buckets of corresponding segments of all space adaptive filters corresponding to the domain to be queried contain the element fingerprint of the target element, and if so, returning to exist; if not, checking whether the segment is linked to a blank segment due to saturation, if so, continuing to check whether the element fingerprint of the target element is contained in two candidate buckets of the linked segment; if there are no linked segments, then the return is not present.
3. The method for querying mass data in a space self-adaption mode according to claim 2, wherein the method comprises the following steps: acquiring target data set sub-blocks corresponding to all spatial adaptive filters with element fingerprints to be queried, performing sequential search on each target data set sub-block, judging whether the value of the recorded field element to be queried is the same as the value of the target element, and outputting the record if the value of the recorded field element to be queried is the same as the value of the target element.
4. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: when deleting the records in the data set, simultaneously calculating the segment address, the two candidate bucket addresses and the element fingerprint of the element in the domain to be inquired of the deleted record in the corresponding space self-adaptive filter, and deleting the element fingerprint which is stored in one candidate bucket and is the same as the element fingerprint of the element to be deleted; if the same element fingerprint is not found in both candidate buckets, the same element fingerprint is found in the segment linked by saturation, and the above process is repeated until the element fingerprint is successfully found and deleted.
5. The method for querying mass data in a space self-adaption mode according to claim 1, wherein the method comprises the following steps: in the step 1, the size and the number of the data set subblocks are set according to the physical read-write characteristics of the actual storage equipment, and the storage address ranges of different data set subblocks are not crossed.
6. The method for querying mass data in a space self-adaption mode according to claim 5, wherein the method comprises the following steps: after the size of the data set subblock is set, if the record to be inserted is inserted into the free position in the data set subblock, updating a space adaptive filter corresponding to the data set subblock; if no free locations exist within all data set sub-blocks, new data set sub-blocks are partitioned for storing records to be inserted.
7. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any of the claims 1-5.
CN202111189827.8A 2021-10-13 2021-10-13 Space self-adaptive mass data query method Active CN113641681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111189827.8A CN113641681B (en) 2021-10-13 2021-10-13 Space self-adaptive mass data query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111189827.8A CN113641681B (en) 2021-10-13 2021-10-13 Space self-adaptive mass data query method

Publications (2)

Publication Number Publication Date
CN113641681A CN113641681A (en) 2021-11-12
CN113641681B true CN113641681B (en) 2022-02-22

Family

ID=78426465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111189827.8A Active CN113641681B (en) 2021-10-13 2021-10-13 Space self-adaptive mass data query method

Country Status (1)

Country Link
CN (1) CN113641681B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891858B (en) * 2024-03-14 2024-07-05 苏州大学 Space-time efficient parallel approximate member query method and system
CN117932128B (en) * 2024-03-21 2024-07-09 苏州大学 Approximate member query method and system capable of dynamically adjusting size

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521303A (en) * 2011-11-30 2012-06-27 北京人大金仓信息技术股份有限公司 Single-table multi-column sequence storage method for column database
CN110046164A (en) * 2019-04-16 2019-07-23 中国人民解放军国防科技大学 Index independent grain distribution filter, consistency grain distribution filter and operation method
CN111259049A (en) * 2020-01-17 2020-06-09 中国平安人寿保险股份有限公司 Information query method, information query device and terminal equipment
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device
EP3889797A1 (en) * 2018-11-27 2021-10-06 Alibaba Group Holding Limited Database index and database query processing method, apparatus, and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630955B (en) * 2015-12-24 2019-01-29 华中科技大学 A kind of data acquisition system member management method of high-efficiency dynamic
US11507590B2 (en) * 2019-09-13 2022-11-22 Oracle International Corporation Techniques for in-memory spatial object filtering
CN111552693B (en) * 2020-04-30 2023-04-07 南方科技大学 Tag cuckoo filter
CN112632337B (en) * 2020-12-28 2023-12-22 南方科技大学 Element management method applied to firework filter and firework filter

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521303A (en) * 2011-11-30 2012-06-27 北京人大金仓信息技术股份有限公司 Single-table multi-column sequence storage method for column database
EP3889797A1 (en) * 2018-11-27 2021-10-06 Alibaba Group Holding Limited Database index and database query processing method, apparatus, and device
CN110046164A (en) * 2019-04-16 2019-07-23 中国人民解放军国防科技大学 Index independent grain distribution filter, consistency grain distribution filter and operation method
CN111259049A (en) * 2020-01-17 2020-06-09 中国平安人寿保险股份有限公司 Information query method, information query device and terminal equipment
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于负载均衡的高效布谷鸟过滤器研究;王飞越;《中国优秀硕士学位论文全文数据库信息科技辑》;20200315;第I138-241页 *

Also Published As

Publication number Publication date
CN113641681A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
KR102564170B1 (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
US20070124277A1 (en) Index and Method for Extending and Querying Index
US20100228914A1 (en) Data caching system and method for implementing large capacity cache
CN104156380A (en) Distributed memory Hash indexing method and system
CN113641681B (en) Space self-adaptive mass data query method
CN110196847A (en) Data processing method and device, storage medium and electronic device
CN103019887A (en) Data backup method and device
CN113867627B (en) Storage system performance optimization method and system
CN110888837B (en) Object storage small file merging method and device
CN105653609A (en) Memory-based data processing method and device
US11461239B2 (en) Method and apparatus for buffering data blocks, computer device, and computer-readable storage medium
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
CN105468644B (en) Method and equipment for querying in database
CN117633105A (en) Time-series data storage management method and system based on time partition index
CN108021562B (en) Disk storage method and device applied to distributed file system and distributed file system
CN111831691B (en) Data reading and writing method and device, electronic equipment and storage medium
KR20230026946A (en) Key value storage device with hashing
CN113326262B (en) Data processing method, device, equipment and medium based on key value database
CN110825747B (en) Information access method, device and medium
CN112328587A (en) Data processing method and device for ElasticSearch
CN116975006A (en) Data deduplication method, system and medium based on disk cache and B-tree index
CN115576947A (en) Data management method and device, combined library, electronic equipment and storage medium
CN111723266B (en) Mass data processing method and device
CN115934583A (en) Hierarchical caching method, device and system
CN113360551A (en) Method and system for storing and rapidly counting time sequence data in shooting range

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant