WO2024078122A1 - 数据库表扫描的方法、装置以及设备 - Google Patents
数据库表扫描的方法、装置以及设备 Download PDFInfo
- Publication number
- WO2024078122A1 WO2024078122A1 PCT/CN2023/113246 CN2023113246W WO2024078122A1 WO 2024078122 A1 WO2024078122 A1 WO 2024078122A1 CN 2023113246 W CN2023113246 W CN 2023113246W WO 2024078122 A1 WO2024078122 A1 WO 2024078122A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- boolean
- data set
- row
- filtering
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 238000001914 filtration Methods 0.000 claims abstract description 139
- 230000008569 process Effects 0.000 claims description 31
- 239000002131 composite material Substances 0.000 claims description 24
- 230000000717 retained effect Effects 0.000 claims description 11
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 230000006872 improvement Effects 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 4
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 4
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 229920001296 polysiloxane Polymers 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 239000010979 ruby Substances 0.000 description 1
- 229910001750 ruby Inorganic materials 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
Definitions
- the present invention relates to the field of database technology, and in particular to a method, device and equipment for scanning a database table.
- a compound filtering condition containing multiple single filtering conditions is used.
- the single filtering conditions are combined through logical operators such as AND and OR.
- the primary key or physical address of the data row is usually used for representation and comparison, resulting in a large amount of data, a large amount of calculation, a large consumption of memory and hard disk space, and a large consumption of CPU.
- One or more embodiments of the present specification provide a database table scanning method, apparatus, device, and storage medium to solve the following technical problem: a more efficient database table scanning solution is needed.
- one or more embodiments of the present specification provide a database table scanning method, wherein the database table is stored through multiple data sets, and the multiple data sets include a baseline data set and an incremental data set.
- the method includes: determining the data column involved in the set filtering condition in the database table as the target column, and the database table has a corresponding Boolean string for each of the data sets, and the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set; for each data row corresponding to the target column, executing: searching in the incremental data set and the baseline data set to determine the Boolean string corresponding to the data set where the latest data of the data row is located; judging whether the latest data meets the filtering condition, and assigning a value to the Boolean bit corresponding to the virtual row number of the data row in the data set according to the judgment result in the Boolean string; and determining the filtering result according to each of the assigned Boolean strings.
- One or more embodiments of the present specification provide a database table scanning device, wherein the database table is stored in multiple data sets, and the multiple data sets include a baseline data set and an incremental data set.
- the device includes: a target column determination module, which determines the data column involved in the set filtering condition in the database table as the target column, and the database table has a corresponding Boolean string for each of the data sets, and the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set; a target column scanning module, which executes, for each data row corresponding to the target column: searching in the incremental data set and the baseline data set to determine the Boolean string corresponding to the data set where the latest data of the data row is located; judging whether the latest data meets the filtering condition, and assigning a value to the Boolean bit corresponding to the virtual row number of the data row in the data set according to the judgment result in the Boolean string; and a filtering result determination module,
- One or more embodiments of this specification provide a database table scanning device, wherein the database table is stored through multiple data sets, wherein the multiple data sets include a baseline data set and an incremental data set, and the device includes at least one processor and a memory connected to the at least one processor.
- Instructions that can be executed by the at least one processor are executed by the at least one processor so that the at least one processor can: determine the data column involved in the set filtering condition in the database table as the target column, the database table has a corresponding Boolean string for each data set, and the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set; for each data row corresponding to the target column, respectively, execute: search in the incremental data set and the baseline data set to determine the Boolean string corresponding to the data set where the latest data of the data row is located; determine whether the latest data meets the filtering condition, and according to the judgment result, assign a value to the Boolean bit corresponding to the virtual row number of the data row in the data set in the Boolean string; and determine the filtering result according to each Boolean string after the assignment.
- One or more embodiments of the present specification provide a non-volatile computer storage medium, wherein a database table is stored through multiple data sets, wherein the multiple data sets include a baseline data set and an incremental data set, and the medium stores computer executable instructions, wherein the computer executable instructions are configured to: determine a data column involved in a set filtering condition in the database table as a target column, wherein the database table has a corresponding Boolean string for each of the data sets, wherein the Boolean bit in the Boolean string corresponds to a virtual row number of a data row in the corresponding data set; and for each data row corresponding to the target column, execute: search in the incremental data set and the baseline data set to determine a Boolean string corresponding to the data set where the latest data of the data row is located; determine whether the latest data satisfies the filtering condition, and according to the determination result, assign a value to the Boolean bit in the Boolean string corresponding to the virtual row number of the data row
- At least one of the above-mentioned technical solutions adopted in one or more embodiments of the present specification can achieve the following beneficial effects: based on the coordination of virtual row numbers and Boolean strings, various operations involved in compound filtering conditions can be efficiently performed through Boolean bit operations, and based on multi-way scanning (the multi-way here can include the processing of multiple target columns, and can also include the processing of multiple Boolean strings), the scanning and filtering efficiency is effectively improved; not only that, it is particularly suitable for databases that use storage structures such as LSM-Tree that include baseline data sets and incremental data sets, and the distribution of virtual row numbers and Boolean strings and the scanning process are adjusted to make the virtual row numbers and Boolean strings involved in the scanning process more lightweight, and the latest data can be scanned more quickly, which helps to further improve the scanning and filtering efficiency.
- FIG1 is a schematic flow chart of a database table scanning method provided by one or more embodiments of this specification.
- FIG. 2a and FIG. 2b are schematic diagrams showing the principle of a database multi-layer storage structure provided by one or more embodiments of this specification.
- FIG3 is a schematic diagram of a correspondence between a virtual row number and a Boolean string provided by one or more embodiments of this specification.
- FIG. 4 is a schematic flow chart of a data column scanning solution provided by one or more embodiments of this specification.
- FIG5 is a schematic flow chart of a physical-chemical solution provided in one or more embodiments of this specification.
- FIG. 6 is a schematic diagram of the structure of a database table scanning device provided by one or more embodiments of this specification.
- FIG. 7 is a schematic diagram of the structure of a database table scanning device provided by one or more embodiments of this specification.
- the embodiments of this specification provide a database table scanning method, device, equipment and storage medium.
- Boolean strings are represented by data structures such as arrays or strings. Boolean strings contain multiple Boolean bits, each of which corresponds to a virtual row number.
- the Boolean bit takes a value of 1 or 0 (usually 1 means true and 0 means false), which is used to indicate whether the data indicated by the corresponding virtual row number meets the filtering condition.
- AND, OR and other operations between multiple data columns can be efficiently performed through the corresponding AND, OR and other operations of Boolean strings.
- LSM-Tree Log Structured Merge Tree
- FIG1 is a flowchart of a database table scanning method provided by one or more embodiments of the present specification. The process is executed, for example, on a database server or a business processing device connected to the database.
- the database table is stored through multiple data sets, which include a baseline data set and an incremental data set.
- LSM-Tree is a typical structure of this type.
- the baseline data set is usually represented as a baseline data layer
- the incremental data set is represented as one or more incremental data layers.
- FIGS. 2a and 2b are schematic diagrams showing the principle of a database multi-layer storage structure provided in one or more embodiments of this specification.
- a baseline data layer constitutes the above-mentioned baseline data set
- one or more incremental data layers constitute the above-mentioned incremental data set.
- the baseline data layer In the baseline data layer, most of the data corresponding to the baseline moment that is relatively far away from the current moment and stable is often stored.
- the data in the baseline data layer is usually stored on the disk. After the baseline moment, the newly added data (the changed data caused by data insertion, data deletion and other operations) is temporarily stored in the incremental data layer. When appropriate, the data in the incremental data layer will be merged into the baseline data layer.
- the data in the incremental data layer is usually in memory, and some of it can be stored on the disk. In general, the time series corresponding to the baseline data layer is older than the time series corresponding to the incremental data layer.
- FIG. 2b shows that data can be stored in a data set according to data columns or data column groups.
- Each data column can be stored separately, for example, column group 2 and column group 3 each contain only one data column.
- Multiple data columns can also be stored together through column groups, for example, column group 1 contains two data columns. The number of data rows corresponding to each data column can be different.
- the process in FIG1 includes the following steps.
- Step S102 Determine the data columns involved in the set filtering conditions in the database table as target columns.
- the database table has a corresponding Boolean string for each data set, and the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set.
- the Boolean strings corresponding to each data set may be independent of each other.
- the baseline data set has its corresponding Boolean string. Assuming that there are N rows of data rows sorted by primary key in the baseline data set, their virtual row numbers are 1 to N, and the Boolean string corresponding to the baseline data set has N Boolean bits and a length of N/8 bytes; the incremental data set also has its corresponding Boolean string. Assuming that there are M rows of data rows sorted by primary key in the incremental data set (generally speaking, M is much smaller than N), their virtual row numbers are 1 to M, and the Boolean string corresponding to the incremental data set has M Boolean bits.
- each incremental data layer can correspond to the same Boolean string (the data of these incremental data layers need to be integrated so as to be uniformly represented by a Boolean string).
- the advantages of this approach include a small number of Boolean strings that need to be iterated subsequently; or, each incremental data layer can have its own corresponding independent Boolean string (or only a number of non-all incremental data layers correspond to the same Boolean string).
- the advantages of this approach include ease of controlling the scale of virtual row numbers. For ease of description, some of the following embodiments are mainly illustrated by taking the case where the incremental data set as a whole has only one corresponding Boolean string as an example. For the case where multiple incremental data layers correspond to different Boolean strings, the subsequent processing of the incremental data set can be referred to, and each incremental data layer can be processed similarly.
- the value of the Boolean bit in the Boolean string indicates whether the corresponding data row meets the current single filter condition or the entire composite filter condition.
- the value of the Boolean bit is the first value, it means that the condition is met and it is retained after filtering.
- the value is the second value, it means that the condition is not met and it is discarded after filtering.
- the Boolean bit is a binary variable with a value of 1 or 0. For the convenience of description, it is assumed that the first value is 1 and the second value is 0, and vice versa.
- FIG. 3 is a schematic diagram of the correspondence between a virtual row number and a Boolean string provided in one or more embodiments of this specification.
- the left side shows the virtual row numbers in the baseline data set or the incremental data set, starting from 1.
- 1 to 6 are counted, and the corresponding Boolean string is on the right, represented by an array.
- Each element of the array is a Boolean bit, which corresponds to the virtual row number on the left one by one. It is assumed that each Boolean bit has been assigned a value. It can be seen that the value of the Boolean bit corresponding to the current virtual row number 2, 4, and 5 is 1, which means that these rows meet the filtering conditions, and the other rows do not meet the filtering conditions.
- the Boolean string has a small amount of data and a small storage burden. The low-cost effect can be further improved by compressing the Boolean string.
- the target columns may be scanned in parallel to improve efficiency.
- Step S104 For each data row corresponding to the target column, execute: searching in the incremental data set and the baseline data set to determine the Boolean string corresponding to the data set where the latest data of the data row is located; judging whether the latest data satisfies the filtering condition, and assigning a value to the Boolean bit corresponding to the virtual row number of the data row in the data set according to the judgment result.
- Step S104 shows the scanning process of the target column. It includes the sub-steps of judging whether the filtering condition is met and assigning a value to the Boolean bit. It should be noted that when there are multiple single filtering conditions, these two steps can be cross-coordinated. Specifically, when judging whether the latest data meets the filtering condition, it can be judged whether a part of the data (for example, the row data belonging to a certain target column) meets the corresponding single filtering condition, and then the Boolean bit is assigned accordingly. However, the assignment at this time may not be the final result. It is necessary to continue to judge and assign values for other target columns and other single filtering conditions according to the assignment. Finally, the final assignment of the relevant Boolean bit to the composite filtering condition as a whole is obtained to obtain the subsequent filtering result.
- the latest data of the data row should be used as the basis, and the latest data of a data row may be in a baseline data set (for example, the data row has not been updated recently) or in an incremental data set (for example, the data row has been updated recently).
- the latest data specifically refers to the latest data of the corresponding data row on the current target column.
- the latest data of the same data row on a specified data column is in one of the data sets.
- the latest data can be searched in the incremental data set first in the order from new to old. If the latest data is found, there is no need to search in the baseline data set. If not found, search in the baseline data set again.
- the virtual row number of the data row in the data set is assigned a first value to the corresponding Boolean bit in the Boolean string, to indicate that it is retained after filtering; if the judgment result is no, the virtual row number of the data row in the data set is assigned a second value to the corresponding Boolean bit in the Boolean string, to indicate that it is discarded after filtering.
- the process in the previous paragraph is concise and easy to understand.
- the filter condition is a composite filter condition involving multiple data columns (i.e., multiple target columns) in the database table
- the process will also involve Boolean bit operations. Specifically, to determine the same Boolean bit,
- the multiple single filter conditions included in the composite filter condition respectively correspond to the assignments, and according to the composite operation of the multiple single filter conditions in the composite filter condition, Boolean operation is performed between the corresponding assignments accordingly, and according to the result of the Boolean operation, it is determined whether the latest data meets the composite filter condition.
- the first target column determine whether the corresponding single-item filtering condition is satisfied and assign a value accordingly, and then the obtained Boolean string is given to the next target column so as to perform a Boolean operation with the value corresponding to the next target column.
- the AND operator the data row with the Boolean bit value of 0 can be directly skipped, while for the OR operator, each corresponding data row needs to be determined and assigned a value, and then a bitwise OR operation is performed with the previous Boolean string.
- the implementation method is not limited to this one, but is diverse.
- a Boolean string copy can also be generated for each target column, and then the Boolean string copy is assigned a value according to the target column corresponding to it, and then a Boolean bit operation is performed between these Boolean string copies to obtain a Boolean string with a final assignment.
- Step S106 Determine the filtering result according to the assigned Boolean strings.
- the filtering result is determined according to the Boolean bit assigned to the first value in the Boolean string. Without considering the data order, the data rows corresponding to the Boolean bits with the first value in each Boolean string after the assignment can be determined as the reserved data rows, and the filtering result is determined according to each reserved data row. If the relevant redundant data and unexpected data are excluded more cleanly, it can be considered to directly determine each reserved data row as the filtering result.
- a scheme for determining filtering results is further provided, specifically including: in the Boolean string corresponding to the baseline data set, determining the virtual row number corresponding to the first Boolean bit whose value is the first value as the first row number, in the Boolean string corresponding to the incremental data set, determining the virtual row number corresponding to the first Boolean bit whose value is the first value as the second row number, determining the first data row identifier corresponding to the first row number, and the second data row identifier corresponding to the second row number, the first data row identifier and the second data row identifier are both primary keys or physical addresses that can uniquely identify a data row, comparing the sizes of the first data row identifier and the second data row identifier, taking the data row corresponding to the smaller one as the retained data row, determining the virtual row number corresponding to the next Boolean bit whose value is the first value corresponding to the smaller one, and continuing the above
- the filtering results obtained in this way are in the order of primary keys or physical addresses.
- this processing method is particularly efficient, making full use of the partial order that the virtual row number can represent, thereby effectively reducing redundant sorting actions and avoiding centralized sorting of a large number of primary keys or physical addresses.
- a more intuitive supplementary explanation will be given later in combination with actual scenarios.
- redundant data and unexpected data are mentioned above, and redundant data includes old data corresponding to the latest data and Boolean bits corresponding to the old data, and unexpected data includes deleted data and its corresponding Boolean bits.
- the corresponding Boolean bits can be actively assigned to second values to prevent these Boolean bits from still being assigned values at a previous time that cannot correctly reflect the latest situation.
- the baseline data set After searching in the incremental data set and the baseline data set, determine whether the latest data found exists in the incremental data set or in the baseline data set. If it exists in the incremental data set and the data of the data row also exists in the baseline data set (which means that the data in the baseline data set is old data), then determine the Boolean string corresponding to the baseline data set, which is the virtual row number of the data row in the baseline data set, and assign the corresponding Boolean bit in the Boolean string to the second value to indicate that it is discarded after filtering.
- the Boolean string corresponding to the data set where the latest data of the data row is located determines whether the latest data of the data row contains a deletion mark.
- the operation of deleting the mark on the data row this time is a data deletion operation. If so, the virtual row number of the data row in the data set is used, and the corresponding Boolean bit in the Boolean string is assigned a second value to indicate that it is discarded after filtering.
- this specification also provides some specific implementation schemes and extension schemes of the method, which will be described below.
- the data reading in the database where the above-mentioned database table is located adopts a snapshot reading method, thereby being able to introduce virtual row numbers similar to read-only data.
- the above-mentioned data rows that can be obtained are also snapshot data, which facilitates the use of virtual row numbers and facilitates efficient execution of data filtering operations involving multiple data columns.
- FIG. 4 is a schematic flow chart of a data column scanning solution provided by one or more embodiments of this specification.
- FIG4 shows a scanning process for one of the data columns C1.
- the process may include the following steps: start scanning the data column C1, and initiate a merge of the baseline data set and the incremental data set based on each Boolean string through the primary key or ROWID.
- the Boolean bit corresponding to the corresponding virtual row number in the Boolean string corresponding to the incremental data set is assigned a value of 1, if not, it is assigned a value of 0; if not, it means that the latest data of the data row is in the baseline data set, then it is determined whether the latest data in the baseline data set meets the filtering condition corresponding to C1, if so, the Boolean bit corresponding to the corresponding virtual row number in the Boolean string corresponding to the baseline data set is assigned a value of 1, if not, it is assigned a value of 0; continue to iterate the above steps for the next data row corresponding to C1 until the entire C1 is scanned.
- C1 filtering condition corresponding to C1
- Fig. 5 is a schematic diagram of a process flow of a materialization solution provided by one or more embodiments of this specification. Materialization here means obtaining data that meets the filtering conditions from the data table.
- the Boolean strings corresponding to the baseline data set and the incremental data set are obtained.
- these Boolean strings can be merged with the primary key or ROWID as the key.
- start materialization for the Boolean strings corresponding to the baseline data set and the incremental data set, respectively, take out the virtual row numbers corresponding to their first non-zero Boolean bits, denoted as base_vid and inc_vid; find the primary keys or ROWIDs corresponding to base_vid and inc_vid, respectively, denoted as base_pk and inc_pk;
- the current base_vid is consumed, and the virtual row number corresponding to the next non-zero Boolean bit of the Boolean string corresponding to the baseline data set is taken as base_vid again; iteratively execute the steps of comparison and data acquisition to determine whether the Boolean strings corresponding to the baseline data set and the incremental data set have been processed. If not, continue to iterate. If so, execute the next step; take the taken snapshot data as the materialized result (i.e., the filtering result), output the materialized result, and return it to the required user.
- the materialized result i.e., the filtering result
- one or more embodiments of this specification also provide devices and equipment corresponding to the above method, as shown in Figures 6 and 7.
- FIG. 6 is a schematic diagram of the structure of a database table scanning device provided by one or more embodiments of the present specification, wherein the database table is stored in a plurality of data sets, wherein the plurality of data sets include a baseline data set and an incremental data set, and the device comprises: a target column determination module 602, which determines a data column involved in a set filtering condition in the database table as a target column, wherein the database table has a corresponding Boolean string for each of the data sets, and a Boolean bit in the Boolean string corresponds to a virtual row number of a data row in the corresponding data set; a target column scanning module 604, which executes, for each data row corresponding to the target column, the following steps: searching in the incremental data set and the baseline data set to determine a Boolean string corresponding to the data set where the latest data of the data row is located; determining whether the latest data satisfies the filtering condition, and assigning a value to a Boolean
- the target column scanning module 604 judges that the virtual row number of the data row in the data set is yes, the corresponding Boolean bit in the Boolean string is assigned a first value to indicate that it is retained after filtering; if the target column scanning module 604 judges that the virtual row number of the data row in the data set is yes, the corresponding Boolean bit in the Boolean string is assigned a second value to indicate that it is discarded after filtering.
- the target column scanning module 604 after searching in the incremental data set and the baseline data set, determines whether the latest data exists in the incremental data set or in the baseline data set; if it exists in the incremental data set and the data of the data row also exists in the baseline data set, then determines the Boolean string corresponding to the baseline data set, which is the virtual row number of the data row in the baseline data set, and assigns the corresponding Boolean bit in the Boolean string a second value to indicate that it is discarded after filtering.
- the filtering condition is a composite filtering condition involving multiple data columns of the database table; the target column scanning module 604 determines that the same Boolean bit is included in multiple single-item filtering conditions.
- the filter conditions are assigned values respectively; according to the composite operation of the multiple single filter conditions in the composite filter condition, a Boolean operation is performed on each of the corresponding assignments; and according to the result of the Boolean operation, it is determined whether the latest data meets the composite filter condition.
- the target column scanning module 604 determines whether the latest data of the data row contains a deletion mark after determining the Boolean string corresponding to the data set in which the latest data of the data row is located; if so, the virtual row number of the data row in the data set is assigned a second value to the corresponding Boolean bit in the Boolean string to indicate that it is discarded after filtering.
- the filtering result determination module 606 determines the data rows corresponding to the Boolean bits whose values are the first value in each of the Boolean strings after assignment as the reserved data rows; and determines the filtering result according to each of the reserved data rows.
- the filtering result determination module 606 determines, in the Boolean string corresponding to the baseline data set, a virtual row number corresponding to the first Boolean bit whose value is the first value, as the first row number; determines, in the Boolean string corresponding to the incremental data set, a virtual row number corresponding to the first Boolean bit whose value is the first value, as the second row number; determines a first data row identifier corresponding to the first row number, and a second data row identifier corresponding to the second row number, wherein the first data row identifier and the second data row identifier are both primary keys or both physical addresses; compares the sizes of the first data row identifier and the second data row identifier, and takes the data row corresponding to the smaller one as the retained data row; determines the virtual row number corresponding to the next Boolean bit whose value is the first value, corresponding to the smaller one, to continue the above-mentioned comparison and data row taking process until the Boole
- the data row is snapshot data.
- FIG. 7 is a schematic diagram of the structure of a database table scanning device provided by one or more embodiments of the present specification, wherein the database table is stored in a plurality of data sets, wherein the plurality of data sets include a baseline data set and an incremental data set, and the device includes: at least one processor; and a memory connected to the at least one processor in communication; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can: determine the data column involved in the set filtering condition in the database table as the target column, wherein the database table has a corresponding Boolean string for each of the data sets, and the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set; respectively, for each data row corresponding to the target column, perform: searching in the incremental data set and the baseline data set to determine the Boolean string corresponding to the data set where the latest data of the data row is located; judging whether the
- one or more embodiments of the present specification also provide a non-volatile computer storage medium corresponding to the method in Figure 1, wherein the database table is stored through multiple data sets, wherein the multiple data sets include a baseline data set and an incremental data set, and the medium stores computer executable instructions, wherein the computer executable instructions are configured to: determine the data column involved in the set filtering condition in the database table as the target column, wherein the database table has a corresponding Boolean string for each of the data sets, wherein the Boolean bit in the Boolean string corresponds to the virtual row number of the data row in the corresponding data set; and for each data row corresponding to the target column, execute: search in the incremental data set and the baseline data set to determine the data set where the latest data of the data row is located.
- Corresponding Boolean string judging whether the latest data satisfies the filtering condition, and assigning a value to the corresponding Boolean bit in the Boolean string for the virtual row number of the data row in the data set according to the judging result; and determining the filtering result according to the assigned Boolean strings.
- a programmable logic device such as a field programmable gate array (FPGA)
- FPGA field programmable gate array
- HDL Hardware Description Language
- HDL Very-High-Speed Integrated Circuit Hardware Description Language
- ABEL Advanced Boolean Expression Language
- AHDL Altera Hardware Description Language
- HDCal Joint CHDL
- JHDL Java Hardware Description Language
- Lava Lava
- Lola MyHDL
- PALASM RHDL
- VHDL Very-High-Speed Integrated Circuit Hardware Description Language
- Verilog Verilog
- the controller may be implemented in any suitable manner, for example, the controller may take the form of a microprocessor or processor and a computer readable medium storing a computer readable program code (e.g., software or firmware) executable by the (micro)processor, a logic gate, a switch, an application specific integrated circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, and the memory controller may also be implemented as part of the control logic of the memory.
- a computer readable program code e.g., software or firmware
- the controller may be implemented in the form of a logic gate, a switch, an application specific integrated circuit, a programmable logic controller, and an embedded microcontroller by logically programming the method steps. Therefore, such a controller may be considered as a hardware component, and the means for implementing various functions included therein may also be considered as a structure within the hardware component. Or even, the means for implementing various functions may be considered as both a software module for implementing the method and a structure within the hardware component.
- a typical implementation device is a computer.
- the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
- the embodiments of this specification may be provided as methods, systems, or computer program products. Therefore, the embodiments of this specification may be in the form of complete hardware embodiments, complete software embodiments, or embodiments in combination with software and hardware. Moreover, the embodiments of this specification may be in the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
- These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
- a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
- processors CPU
- input/output interfaces network interfaces
- memory volatile and non-volatile memory
- Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
- RAM random access memory
- ROM read-only memory
- flash RAM flash memory
- Computer readable media include permanent and non-permanent, removable and non-removable media that can be implemented by any method or technology to store information.
- Information can be computer readable instructions, data structures, program modules or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
- computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.
- program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
- This specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network.
- program modules may be located in local and remote computer storage media, including storage devices.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本说明书实施例公开了一种数据库表扫描方法、装置以及设备。数据库表通过多个数据集合存储,多个数据集合包含基线数据集合、增量数据集合,方案包括:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
Description
本说明书涉及数据库技术领域,尤其涉及数据库表扫描的方法、装置以及设备。
在对数据库表的数据进行分析处理时,通常涉及到多个数据列的扫描以得到过滤结果,采用了包含多个单项过滤条件的复合过滤条件,单项过滤条件之间通过AND、OR等逻辑运算符进行组合,比如,某复合过滤条件表示为“C1<10AND(C2>5 OR C3==8)”,C1、C2和C3分别表示一个数据列。
目前,在这些AND、OR等各种组合的运算中,通常使用数据行的主键或物理地址进行表示和比较,导致数据量大、运算量大,内存和硬盘空间消耗大、CPU消耗大。
基于此,需要效率更高的数据库表扫描方案。
发明内容
本说明书一个或多个实施例提供一种数据库表扫描方法、装置、设备以及存储介质,用以解决如下技术问题:需要效率更高的数据库表扫描方案。
为解决上述技术问题,本说明书一个或多个实施例提供一种数据库表扫描方法,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述方法包括:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
本说明书一个或多个实施例提供的一种数据库表扫描装置,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述装置包括:目标列确定模块,确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;目标列扫描模块,分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;过滤结果确定模块,根据赋值后的各所述布尔串,确定过滤结果。
本说明书一个或多个实施例提供的一种数据库表扫描设备,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述设备包括至少一个处理器以及与所述至少一个处理器通信连接的存储器。其中,所述存储器存储有
可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
本说明书一个或多个实施例提供的一种非易失性计算机存储介质,数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述介质存储有计算机可执行指令,所述计算机可执行指令设置为:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
本说明书一个或多个实施例采用的上述至少一个技术方案能够达到以下有益效果:基于虚拟行号和布尔串的配合,能够通过布尔位运算高效地执行对复合过滤条件涉及的各种运算,并基于多路扫描(这里的多路可以包括对多个目标列的处理,也可以包括对多个布尔串的处理),有效地提高了扫描过滤效率;不仅如此,尤其适配于诸如采用了LSM-Tree等包含了基线数据集合、增量数据集合的存储结构的数据库,调整了虚拟行号和布尔串的分布以及扫描过程,以使得扫描过程中涉及的虚拟行号和布尔串更加轻量化,且更快地扫描到最新数据,有助于进一步提高扫描过滤效率。
图1为本说明书一个或多个实施例提供的一种数据库表扫描方法的流程示意图。
图2a和图2b为本说明书一个或多个实施例提供的一种数据库多层存储结构的原理示意图。
图3为本说明书一个或多个实施例提供的一种虚拟行号与布尔串的对应关系示意图。
图4为本说明书一个或多个实施例提供的一种数据列扫描方案的流程示意图。
图5为本说明书一个或多个实施例提供的一种物化方案的流程示意图。
图6为本说明书一个或多个实施例提供的一种数据库表扫描装置的结构示意图。
图7为本说明书一个或多个实施例提供的一种数据库表扫描设备的结构示意图。
本说明书实施例提供一种数据库表扫描方法、装置、设备以及存储介质。
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本说明书实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
针对背景技术中的问题,考虑引入虚拟行号和对应的布尔串,来针对过滤条件进行扫描过滤。
在数据库表中,将所有的数据行按照主键或物理地址的顺序,从0或1(后面描述时假定为1)开始用整数进行顺序连续计数,作为虚拟行号。主键是数据库表中指定的若干列,组合后其中没有重复行,数据库数据可按主键排序和聚集。物理地址指数据行实际存储的位置,比如用ROWID表示。布尔串比如采用数组或字符串等数据结构表示,布尔串包含多个布尔位,每个布尔位分别对应于一个虚拟行号,布尔位取值为1或0(通常1表示真,0表示假),用于表示对应的虚拟行号所指示的数据是否满足过滤条件。如此,多个数据列之间的AND、OR等操作,则可以通过布尔串相应的AND、OR等操作高效进行。不仅如此,考虑到该方案针对相对静态的数据效果较好,但是,对于持续修改的数据,特别是日志结构合并树(Log Structured Merge Tree,LSM-Tree)的数据,由于数据行在不断地增加或删除,动态性很强,导致虚拟行号无法直接使用,因此适配于LSM-Tree的场景,对该方案进行了进一步的改进。下面针对这样的思路,继续详细说明。
图1为本说明书一个或多个实施例提供的一种数据库表扫描方法的流程示意图。该流程比如在数据库服务器或连接数据库的业务处理设备上执行。在图1的场景下,数据库表通过多个数据集合存储,这多个数据集合包含基线数据集合、增量数据集合,LSM-Tree即属于较为典型的这类结构,在LSM-Tree中,通常将基线数据集合表示为一个基线数据层,将增量数据集合表示为一个或多个增量数据层。
参见图2a和图2b,为本说明书一个或多个实施例提供的一种数据库多层存储结构的原理示意图。
在图2a中,有多个层级的数据层,一个基线数据层构成了上述的基线数据集合,一个或多个增量数据层构成了上述的增量数据集合。
在基线数据层中,往往保存着距离当前时刻相对远一些和稳定的基线时刻对应的绝大部分数据,基线数据层中的数据通常在磁盘中。在该基线时刻以后,新增的数据(数据插入、数据删除等操作导致的变化数据)暂存于增量数据层中,则合适的时候,增量数据层中的数据会合并入基线数据层中,增量数据层中的数据通常在内存中,可以有部分在磁盘中。整体而言,基线数据层对应的时序旧于增量数据层对应的时序,而当存在多个增量数据层时,它们之间也有时序新旧关系,在图2a中表示为:越远离基线数据层的增量数据层的时序相对越新,则level 1指代的增量数据层是时序最新的,level n指代的基线数据层是时序最旧的。
在图2b中,示出了数据集合中可以按照数据列或者数据列组存储数据,每个数据列可以单独存储,比如,列组2和列组3中分别只包含了一个数据列,也可以有多个数据列通过列组一起存储,比如,列组1中包含了两个数据列。各数据列分别对应的数据行数量可以是差异化的。
图1中的流程包括以下步骤。
步骤S102:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应。
在本说明书一个或多个实施例中,各数据集合分别对应的布尔串可以是相互独立的。基线数据集合有其对应的布尔串,假定基线数据集合中有N行按照主键排序的数据行,则它们的虚拟行号依次为1~N,则基线数据集合对应的布尔串相应地有N个布尔位,长度为N/8个字节;增量数据集合也有其对应的布尔串,假定增量数据集合中有M行按照主键排序的数据行(一般而言,M远小于N),则它们的虚拟行号为1~M,则增量数据集合对应的布尔串相应地有M个布尔位。
需要说明的是,对于包含多个增量数据层的增量数据集合,这多个增量数据层可以对应于同一个布尔串(需要整合这些增量数据层的数据,以便用一个布尔串统一表示),这种方式的优点包括后续需迭代的布尔串数量少;或者,也可以有各增量数据层可以分别有自己对应的相互独立的布尔串(或者只有非全部的若干个增量数据层对应于同一个布尔串),这种方式的优点包括便于控制虚拟行号的规模。为了便于描述,下面的一些实施例主要以增量数据集合整体只有一个对应的布尔串为例说明,而对于多个增量数据层分别对应于不同布尔串的这类情况,可以参照后续对增量数据集合的处理,类似地对各增量数据层进行处理。
过滤条件涉及一个或多个数据列(即,需要根据这些数据列中的取值,来判断是否满足过滤条件),对于其中每个数据列又有一个或多个单项过滤条件。若过滤条件同时包含多个单项过滤条件,则将其称为复合过滤条件。以背景技术中提到的复合过滤条件“C1<10AND(C2>5 OR C3==8)”为例,其涉及C1、C2和C3这三个数据列,C1的单项过滤条件为“C1<10”,C2的单项过滤条件为“C2>5”,C3的单项过滤条件为“C3==8”,这三个单项过滤条件又通过AND、OR以及括号逻辑运算符连接,从而构成复合过滤条件。
布尔串中的布尔位的取值,则表示对应的数据行在是否满足当前的某个单项过滤条件,或整个复合过滤条件。布尔位的取值为第一值时,表示满足条件,过滤后保留,取值为第二值时,表示不满足条件,过滤后丢弃。布尔位是一个二值变量,取值为1或0,为了便于描述,假定第一值为1,第二值为0,反过来也是可以的。
更直观地,参见图3,图3为本说明书一个或多个实施例提供的一种虚拟行号与布尔串的对应关系示意图。
在图3中,左侧为基线数据集合或者增量数据集合中的虚拟行号,从1开始计数,
示例性地计了1~6,右侧是对应的布尔串,用数组进行表示,该数组的每个元素分别为1个布尔位,与左侧的虚拟行号一一对应,假定各布尔位已经赋值完成。可以看到当前虚拟行号2、4、5对应的布尔位的取值为1,则表示这几行满足过滤条件,其他几行不满足过滤条件。布尔串数据量较小,存储负担小,还可以通过压缩布尔串进一步地提高低成本效果。
在本说明书一个或多个实施例中,可以有多个目标列,这这种情况下,接下来可以并行地扫描各目标列以提高效率。
步骤S104:分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值。
步骤S104示出了对目标列的扫描过程。其中包含了判断是否满足过滤条件和为布尔位赋值的子步骤,需要说明的是,当存在多个单项过滤条件的情况下,这两个步骤可以是交叉配合进行的,具体地,在判断该最新数据是否满足过滤条件时,可以先判断其中的一部分数据(比如,属于某一个目标列的行数据)是否满足对应的某个单项过滤条件,之后相应地给布尔位赋值,但是此时的赋值未必是最终结果,还需要继续根据该赋值,针对其他的目标列和其他的单项过滤条件,继续进行判断和赋值,最后,得到相关的布尔位对于复合过滤条件整体的最终赋值,以得到后面的过滤结果。
在本说明书一个或多个实施例中,过滤数据时,应当以数据行的最新数据的情况为准,而某数据行的最新数据可能处于基线数据集合中(比如,近期该数据行并未更新过),也有可能处于增量数据集合中(比如,近期该数据行有更新过)。
该最新数据具体指对应的数据行在当前的目标列上的最新数据,同一个数据行在指定的某个数据列上的最新数据,处于各数据集合的其中一个数据集合中。为了提高查找效率,对于某目标列对应的某数据行,在扫描时,可以按照时序从新到旧的顺序,优先在增量数据集合中查找最新数据,若找到了,则无需接着在基线数据集合中查找,若没找到,再接着在基线数据集合中查找。
基于最新数据判断是否满足过滤条件,以及对最新数据的所在数据集合对应的布尔串中对应的布尔位相应赋值,使得布尔位的取值能够及时正确、且有序以及低冗余地表现出对应数据与过滤条件之间的关系,从而有助于接下来高效地整理出过滤结果。
在本说明书一个或多个实施例中,若判断结果为是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第一值,以表示过滤后保留;若判断结果为否,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
在过滤条件属于单项过滤条件的情况下,上一段中的处理过程简洁明了,比较容易理解。而在过滤条件属于涉及数据库表的多个数据列(即为多个目标列)的复合过滤条件的情况下,则该处理过程中具体还会涉及布尔位运算。具体地,确定同一个布尔位,
在复合过滤条件包含的多个单项过滤条件分别对应的赋值,根据复合过滤条件中多个单项过滤条件的复合运算,相应地对各对应的赋值之间进行布尔位运算,根据布尔位运算的结果,判断该最新数据是否满足复合过滤条件。
例如,对于第一个目标列,判断相应的单项过滤条件是否满足并相应赋值,然后得到的布尔串给到下一个目标列,以便与下一个目标列对应的赋值进行布尔运算,比如,对于AND运算符,则可以直接跳过布尔位取值为0的数据行,而对于OR运算符,需要相应的每个数据行都判断下得到赋值,再与上一个布尔串做位或运算。当然,实施方式不限于这一种,是多样的,比如,对于一个布尔串,也可以为每个目标列分别生成一个布尔串副本,然后分别适应于自己所对应的目标列为布尔串副本赋值,再在这些布尔串副本之间进行布尔位运算,得到最终赋值确定的布尔串。
步骤S106:根据赋值后的各所述布尔串,确定过滤结果。
在本说明书一个或多个实施例中,根据布尔串中赋值为第一值的布尔位,确定过滤结果。在不考虑数据顺序的情况下,可以确定赋值后的各布尔串中取值为第一值的布尔位对应的数据行,作为保留数据行,根据各保留数据行,确定过滤结果,若相关的冗余数据和意外数据排除得较干净,则可以考虑直接将各保留数据行确定为过滤结果。
为了提高用户体验,高效有序地输出过滤结果,进一步地提供了一种确定过滤结果的方案,具体包括:在基线数据集合对应的布尔串中,确定第一个取值为第一值的布尔位对应的虚拟行号,作为第一行号,在增量数据集合对应的布尔串中,确定第一个取值为第一值的布尔位对应的虚拟行号,作为第二行号,确定第一行号对应的第一数据行标识,以及第二行号对应的第二数据行标识,第一数据行标识、第二数据行标识均为主键或均为物理地址等能够唯一标识一个数据行的标识,比较第一数据行标识与第二数据行标识的大小,取较小一方对应的数据行,作为保留数据行,确定较小一方对应的下一个取值为第一值的布尔位对应的虚拟行号,以继续进行上述的比较和取数据行的过程,直至基线数据集合和增量数据集合对应的布尔串处理完毕,可以将各保留数据行确定为过滤结果。如此得到的过滤结果,是符合主键或物理地址顺序的,在过滤结果数据体量较大的情况下,这样的处理方式尤其高效,充分利用了虚拟行号自身能表示出的一部分顺序,据此有效减少了冗余的排序动作,还避免了对大量主键或物理地址集中进行排序。后面会结合实际场景更直观地补充说明。
在本说明书一个或多个实施例中,上面提到了冗余数据和意外数据,冗余数据比如包括最新数据对应的老数据,以及老数据对应的布尔位,意外数据比如包括被删除的数据及其对应的布尔位。对于这些数据,可以主动地将对应的布尔位赋值为第二值,以防止这些布尔位仍然处于在以前某个时刻被赋的值不能正确反映最新情况。
例如,在增量数据集合和基线数据集合中查找之后,判断查找到的上述的最新数据是存在于增量数据集合中,还是存在于基线数据集合中,若存在于增量数据集合中,且基线数据集合中也存在该数据行的数据(则表示基线数据集合中的该数据为旧数据),则确定基线数据集合对应的布尔串,为该数据行在基线数据集合的虚拟行号,在布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
再例如,在确定数据行的最新数据的所在数据集合对应的布尔串之后,判断该数据行的最新数据是否包含删除标记,删除标记本次对该数据行的操作是数据删除操作,若是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
通过图1的方法,基于虚拟行号和布尔串的配合,能够通过布尔位运算高效地执行对复合过滤条件涉及的各种运算,并基于多路扫描,有效地提高了扫描过滤效率;不仅如此,尤其适配于诸如采用了LSM-Tree等包含了基线数据集合,以及一个或多个增量数据集合的存储结构的数据库,调整了虚拟行号和布尔串的分布以及扫描过程,以使得扫描过程中涉及的虚拟行号和布尔串更加轻量化,且更快地扫描到最新数据,有助于进一步提高扫描过滤效率。
基于图1的方法,本说明书还提供了该方法的一些具体实施方案和扩展方案,下面继续进行说明。
在本说明书一个或多个实施例中,前面有提到,在LSM-Tree的场景下,由于数据行在不断地增加或删除,动态性很强,导致虚拟行号无法直接使用。为了解决这个问题,在上述的数据库表所处于的数据库中的数据读取,采用快照读的方式,进而能够引入类似于只读数据的虚拟行号,在读取过程中,即使是在内存中正在修改的最新的增量数据,其数据对于所用的快照也是不变的,因此可以认为所有增量和基线数据都是不变的。在这种情况下,能够获取到的上述的数据行也是快照数据,从而便于使用虚拟行号,也便于高效地执行涉及多个数据列的数据过滤操作。
前面对本方案进行了说明,为了便于理解,结合具体场景和更完整的例子,对上面比较重要的两部分内容示例性的实施方案进行展示,参见图4、图5。
图4为本说明书一个或多个实施例提供的一种数据列扫描方案的流程示意图。
在图4中,示出了针对其中一个数据列C1的扫描过程,该流程可以包括以下步骤:开始扫描数据列C1,基于各布尔串,通过主键或ROWID对基线数据集合和增量数据集合发起归并,通过这个过程确定C1对应的某一个数据行的最新数据在增量数据集合中是否存在;若存在,即表示该数据行的最新数据处于增量数据集合中,可以进一步地判断该数据行的数据在基线数据集合中是否也存在,若在基线数据集合中也存在,则将基线数据集合对应的布尔串中对应于相应虚拟行号的布尔位赋值为0,还需要判断增量数据集合中的该最新数据是否满足C1对应的过滤条件(比如,C1>5),若满足,则将增量数据集合对应的布尔串中对应于相应虚拟行号的布尔位赋值为1,若不满足,则赋值为0;若不存在,即表示该数据行的最新数据处于基线数据集合中,则判断基线数据集合中的该最新数据是否满足C1对应的过滤条件,若满足,则将基线数据集合对应的布尔串中对应于相应虚拟行号的布尔位赋值为1,若不满足,则赋值为0;继续针对C1对应的下一个数据行,迭代执行上述步骤,直至对整个C1扫描完毕。
图5为本说明书一个或多个实施例提供的一种物化方案的流程示意图。这里的物化即指从数据表中获得满足过滤条件的数据。
基于图4中的流程,对需要扫描的数据列都扫描完毕后,获得基线数据集合和增量数据集合分别对应的布尔串,在需要物化时(比如,要将过滤结果返回给用户时),则可以将这些布尔串以主键或ROWID为key,做归并操作,具体如图5中的流程所示,可以包括以下步骤:开始物化,针对基线数据集合和增量数据集合分别对应的布尔串,分别取出它们的第一个非零布尔位对应的虚拟行号,记作base_vid和inc_vid;找到base_vid、inc_vid分别对应的主键或ROWID,记作base_pk和inc_pk;
若base_pk>inc_pk,则输出inc_pk对应的数据行的快照数据,同时,当前的inc_vid被消费掉,取增量数据集合对应的布尔串的下一个非零布尔位对应的虚拟行号,重新作为inc_vid;若base_pk<inc_pk,则输出base_pk对应的数据行的快照数据,同时,当前的base_vid被消费掉,取基线数据集合对应的布尔串的下一个非零布尔位对应的虚拟行号,重新作为base_vid;迭代执行比较和取数据的步骤,判断基线数据集合和增量数据集合分别对应的布尔串是否都处理完毕,若否,则继续迭代,若是,则执行下一步骤;将所取的快照数据作为物化结果(即,过滤结果),并输出物化结果,返回给所需的用户。
基于同样的思路,本说明书一个或多个实施例还提供了上述方法对应的装置和设备,如图6、图7所示。
图6为本说明书一个或多个实施例提供的一种数据库表扫描装置的结构示意图,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述装置包括:目标列确定模块602,确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;目标列扫描模块604,分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;过滤结果确定模块606,根据赋值后的各所述布尔串,确定过滤结果。
可选地,所述目标列扫描模块604,若判断结果为是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第一值,以表示过滤后保留;若判断结果为否,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
可选地,所述目标列扫描模块604,在所述增量数据集合和所述基线数据集合中查找之后,判断该最新数据是存在于所述增量数据集合中,还是存在于所述基线数据集合中;若存在于所述增量数据集合中,且所述基线数据集合中也存在该数据行的数据,则确定所述基线数据集合对应的布尔串,为该数据行在所述基线数据集合的虚拟行号,在所述布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
可选地,所述过滤条件属于涉及所述数据库表的多个数据列的复合过滤条件;所述目标列扫描模块604,确定同一个所述布尔位,在所述复合过滤条件包含的多个单项过
滤条件分别对应的赋值;根据所述复合过滤条件中所述多个单项过滤条件的复合运算,相应地对各所述对应的赋值之间进行布尔位运算;根据所述布尔位运算的结果,判断该最新数据是否满足所述复合过滤条件。
可选地,所述目标列扫描模块604,确定该数据行的最新数据的所在数据集合对应的布尔串之后,判断该数据行的最新数据是否包含删除标记;若是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
可选地,所述过滤结果确定模块606,确定赋值后的各所述布尔串中取值为所述第一值的布尔位对应的数据行,作为保留数据行;根据各所述保留数据行,确定过滤结果。
可选地,所述过滤结果确定模块606,在所述基线数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第一行号;在所述增量数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第二行号;确定所述第一行号对应的第一数据行标识,以及所述第二行号对应的第二数据行标识,所述第一数据行标识、所述第二数据行标识均为主键或者均为物理地址;比较所述第一数据行标识与所述第二数据行标识的大小,取较小一方对应的数据行,作为保留数据行;确定所述较小一方对应的下一个取值为所述第一值的布尔位对应的虚拟行号,以继续进行上述的比较和取数据行的过程,直至所述基线数据集合和所述增量数据集合对应的布尔串处理完毕;将各所述保留数据行确定为过滤结果。
可选地,所述数据行是快照数据。
图7为本说明书一个或多个实施例提供的一种数据库表扫描设备的结构示意图,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
基于同样的思路,本说明书一个或多个实施例还提供了对应于图1中方法的一种非易失性计算机存储介质,数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述介质存储有计算机可执行指令,所述计算机可执行指令设置为:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合
对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本说明书实施例可提供为方法、系统、或计算机程序产品。因此,本说明书实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备、非易失性计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
以上所述仅为本说明书的一个或多个实施例而已,并不用于限制本说明书。对于本领域技术人员来说,本说明书的一个或多个实施例可以有各种更改和变化。凡在本说明书的一个或多个实施例的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本说明书的权利要求范围之内。
Claims (17)
- 一种数据库表扫描方法,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述方法包括:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
- 如权利要求1所述的方法,所述根据判断结果,为该数据行在该所在数据集合的虚拟行号在该布尔串中对应的布尔位赋值,具体包括:若判断结果为是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第一值,以表示过滤后保留;若判断结果为否,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求1或2所述的方法,所述在所述增量数据集合和所述基线数据集合中查找之后,所述方法还包括:判断该最新数据是存在于所述增量数据集合中,还是存在于所述基线数据集合中;若存在于所述增量数据集合中,且所述基线数据集合中也存在该数据行的数据,则确定所述基线数据集合对应的布尔串,为该数据行在所述基线数据集合的虚拟行号,在所述布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求1所述的方法,所述过滤条件属于涉及所述数据库表的多个数据列的复合过滤条件;所述判断该最新数据是否满足所述过滤条件,具体包括:确定同一个所述布尔位,在所述复合过滤条件包含的多个单项过滤条件分别对应的赋值;根据所述复合过滤条件中所述多个单项过滤条件的复合运算,相应地对各所述对应的赋值之间进行布尔位运算;根据所述布尔位运算的结果,判断该最新数据是否满足所述复合过滤条件。
- 如权利要求1或2所述的方法,所述确定该数据行的最新数据的所在数据集合对应的布尔串之后,所述方法还包括:判断该数据行的最新数据是否包含删除标记;若是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求2所述的方法,所述根据赋值后的各所述布尔串,确定过滤结果,具体包括:确定赋值后的各所述布尔串中取值为所述第一值的布尔位对应的数据行,作为保留数据行;根据各所述保留数据行,确定过滤结果。
- 如权利要求2或6所述的方法,所述确定过滤结果,具体包括:在所述基线数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第一行号;在所述增量数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第二行号;确定所述第一行号对应的第一数据行标识,以及所述第二行号对应的第二数据行标识,所述第一数据行标识、所述第二数据行标识均为主键或均为物理地址;比较所述第一数据行标识与所述第二数据行标识的大小,取较小一方对应的数据行,作为保留数据行;确定所述较小一方对应的下一个取值为所述第一值的布尔位对应的虚拟行号,以继续进行上述的比较和取数据行的过程,直至所述基线数据集合和所述增量数据集合对应的布尔串处理完毕;将各所述保留数据行确定为过滤结果。
- 如权利要求1所述的方法,所述数据行是快照数据。
- 一种数据库表扫描装置,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述装置包括:目标列确定模块,确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;目标列扫描模块,分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;过滤结果确定模块,根据赋值后的各所述布尔串,确定过滤结果。
- 如权利要求9所述的装置,所述目标列扫描模块,若判断结果为是,则为该数 据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第一值,以表示过滤后保留;若判断结果为否,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求9或10所述的装置,所述目标列扫描模块,在所述增量数据集合和所述基线数据集合中查找之后,判断该最新数据是存在于所述增量数据集合中,还是存在于所述基线数据集合中;若存在于所述增量数据集合中,且所述基线数据集合中也存在该数据行的数据,则确定所述基线数据集合对应的布尔串,为该数据行在所述基线数据集合的虚拟行号,在所述布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求9所述的装置,所述过滤条件属于涉及所述数据库表的多个数据列的复合过滤条件;所述目标列扫描模块,确定同一个所述布尔位,在所述复合过滤条件包含的多个单项过滤条件分别对应的赋值;根据所述复合过滤条件中所述多个单项过滤条件的复合运算,相应地对各所述对应的赋值之间进行布尔位运算;根据所述布尔位运算的结果,判断该最新数据是否满足所述复合过滤条件。
- 如权利要求9或10所述的装置,所述目标列扫描模块,确定该数据行的最新数据的所在数据集合对应的布尔串之后,判断该数据行的最新数据是否包含删除标记;若是,则为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值为第二值,以表示过滤后丢弃。
- 如权利要求10所述的装置,所述过滤结果确定模块,确定赋值后的各所述布尔串中取值为所述第一值的布尔位对应的数据行,作为保留数据行;根据各所述保留数据行,确定过滤结果。
- 如权利要求10或14所述的装置,所述过滤结果确定模块,在所述基线数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第一行号;在所述增量数据集合对应的布尔串中,确定第一个取值为所述第一值的布尔位对应的虚拟行号,作为第二行号;确定所述第一行号对应的第一数据行标识,以及所述第二行号对应的第二数据行标识,所述第一数据行标识、所述第二数据行标识均为主键或均为物理地址;比较所述第一数据行标识与所述第二数据行标识的大小,取较小一方对应的数据行,作为保留数据行;确定所述较小一方对应的下一个取值为所述第一值的布尔位对应的虚拟行号,以继 续进行上述的比较和取数据行的过程,直至所述基线数据集合和所述增量数据集合对应的布尔串处理完毕;将各所述保留数据行确定为过滤结果。
- 如权利要求9所述的装置,所述数据行是快照数据。
- 一种数据库表扫描设备,所述数据库表通过多个数据集合存储,所述多个数据集合包含基线数据集合、增量数据集合,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:确定设定的过滤条件在所述数据库表中涉及的数据列,作为目标列,所述数据库表对于各所述数据集合分别有对应的布尔串,所述布尔串中的布尔位与对应的所述数据集合中的数据行的虚拟行号相对应;分别对所述目标列对应的各数据行,执行:在所述增量数据集合和所述基线数据集合中查找,确定该数据行的最新数据的所在数据集合对应的布尔串;判断该最新数据是否满足所述过滤条件,根据判断结果,为该数据行在该所在数据集合的虚拟行号,在该布尔串中对应的布尔位赋值;根据赋值后的各所述布尔串,确定过滤结果。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211239876.2 | 2022-10-11 | ||
CN202211239876.2A CN115563116A (zh) | 2022-10-11 | 2022-10-11 | 一种数据库表扫描方法、装置以及设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024078122A1 true WO2024078122A1 (zh) | 2024-04-18 |
Family
ID=84744236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/113246 WO2024078122A1 (zh) | 2022-10-11 | 2023-08-16 | 数据库表扫描的方法、装置以及设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115563116A (zh) |
WO (1) | WO2024078122A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115563116A (zh) * | 2022-10-11 | 2023-01-03 | 北京奥星贝斯科技有限公司 | 一种数据库表扫描方法、装置以及设备 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049521A (zh) * | 2012-12-19 | 2013-04-17 | 广东电子工业研究院有限公司 | 一种支持多属性复合条件查询的虚拟表索引机制及方法 |
US20140052713A1 (en) * | 2012-08-20 | 2014-02-20 | Justin Schauer | Hardware implementation of the aggregation/group by operation: filter method |
US9092484B1 (en) * | 2015-03-27 | 2015-07-28 | Vero Analyties, Inc. | Boolean reordering to optimize multi-pass data source queries |
CN106970936A (zh) * | 2017-02-09 | 2017-07-21 | 阿里巴巴集团控股有限公司 | 数据处理方法及装置、数据查询方法及装置 |
CN108920695A (zh) * | 2018-07-13 | 2018-11-30 | 星环信息科技(上海)有限公司 | 一种数据查询方法、装置、设备及存储介质 |
CN114647635A (zh) * | 2022-03-31 | 2022-06-21 | 苏州浪潮智能科技有限公司 | 数据处理系统 |
CN115563116A (zh) * | 2022-10-11 | 2023-01-03 | 北京奥星贝斯科技有限公司 | 一种数据库表扫描方法、装置以及设备 |
-
2022
- 2022-10-11 CN CN202211239876.2A patent/CN115563116A/zh active Pending
-
2023
- 2023-08-16 WO PCT/CN2023/113246 patent/WO2024078122A1/zh unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140052713A1 (en) * | 2012-08-20 | 2014-02-20 | Justin Schauer | Hardware implementation of the aggregation/group by operation: filter method |
CN103049521A (zh) * | 2012-12-19 | 2013-04-17 | 广东电子工业研究院有限公司 | 一种支持多属性复合条件查询的虚拟表索引机制及方法 |
US9092484B1 (en) * | 2015-03-27 | 2015-07-28 | Vero Analyties, Inc. | Boolean reordering to optimize multi-pass data source queries |
CN106970936A (zh) * | 2017-02-09 | 2017-07-21 | 阿里巴巴集团控股有限公司 | 数据处理方法及装置、数据查询方法及装置 |
CN108920695A (zh) * | 2018-07-13 | 2018-11-30 | 星环信息科技(上海)有限公司 | 一种数据查询方法、装置、设备及存储介质 |
CN114647635A (zh) * | 2022-03-31 | 2022-06-21 | 苏州浪潮智能科技有限公司 | 数据处理系统 |
CN115563116A (zh) * | 2022-10-11 | 2023-01-03 | 北京奥星贝斯科技有限公司 | 一种数据库表扫描方法、装置以及设备 |
Also Published As
Publication number | Publication date |
---|---|
CN115563116A (zh) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11169978B2 (en) | Distributed pipeline optimization for data preparation | |
CN107038206B (zh) | Lsm树的建立方法、lsm树的数据读取方法和服务器 | |
US10552378B2 (en) | Dividing a dataset into sub-datasets having a subset of values of an attribute of the dataset | |
JP6598996B2 (ja) | データ準備のためのシグニチャベースのキャッシュ最適化 | |
US20130268770A1 (en) | Cryptographic hash database | |
CN105989015B (zh) | 一种数据库扩容方法和装置以及访问数据库的方法和装置 | |
CN107436911A (zh) | 模糊查询方法、装置及查询系统 | |
US20170109389A1 (en) | Step editor for data preparation | |
JP6598997B2 (ja) | データ準備のためのキャッシュ最適化 | |
CN114090695A (zh) | 分布式数据库的查询优化的方法和装置 | |
WO2024078122A1 (zh) | 数据库表扫描的方法、装置以及设备 | |
KR102354343B1 (ko) | 블록체인 기반의 지리공간 데이터를 위한 공간 데이터 인덱싱 방법 및 장치 | |
WO2024159575A1 (zh) | 数据处理方法、装置、电子设备及存储介质 | |
CN111125216A (zh) | 数据导入Phoenix的方法及装置 | |
CN116010345A (zh) | 一种实现流批一体数据湖的表服务方案的方法、装置及设备 | |
US20210056090A1 (en) | Cache optimization for data preparation | |
US11288447B2 (en) | Step editor for data preparation | |
CN116521734A (zh) | 一种数据查询的方法、装置、介质及设备 | |
CN118193032A (zh) | 消除无效依赖库的方法、装置、设备、介质和程序产品 | |
CN113704289A (zh) | 一种基于dbio接口的方法、系统、设备及介质 | |
KR20210052148A (ko) | 데이터 키 값 변환 방법 및 장치 | |
CN114201487A (zh) | 智能合约的存储装置和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23876327 Country of ref document: EP Kind code of ref document: A1 |