CN108874849B

CN108874849B - Optimization method and system for non-equivalent associated sub-query

Info

Publication number: CN108874849B
Application number: CN201810097136.7A
Authority: CN
Inventors: 何文婷; 程学旗; 郑天祺; 张志斌; 郭嘉丰; 赵鹏
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2020-12-25
Anticipated expiration: 2038-01-31
Also published as: CN108874849A

Abstract

The invention discloses a method and a system for optimizing non-equivalence association sub-queries, which are characterized by comprising the following steps: obtaining a value set of the appearance association column of the association sub-query; establishing a mapping relation from an exterior association column to an interior association column partition of the association sub-query according to the type of the operational character and the value set in the association sub-query; partitioning the inner table of the associated sub-query according to the obtained partition set, and acquiring intermediate result state information of the associated sub-query in each partition according to a query aggregation function of the inner table in the associated sub-query; and traversing the exterior association column according to the mapping relation, and obtaining the sub-query result corresponding to each association column in the exterior by aggregating the intermediate result state information of the corresponding partition set. The invention has the technical effects that: the inner table is partitioned, and intermediate results of all partitions are repeatedly utilized to obtain a final sub-query result set, so that query performance is improved.

Description

Optimization method and system for non-equivalent associated sub-query

Technical Field

The invention relates to the field of database relational systems, in particular to a method and a system for optimizing non-equivalence association sub-queries.

Background

A sub-query refers to a query statement appearing as a query condition of another statement, and an associated sub-query refers to a query condition of a sub-query being dependent on a parent query. A typical non-equal value association sub-query is as follows:

select X.a,X.b from X where X.c>

(select avg(Y.c)from Y where Y.d【OPERATOR】X.d)

wherein an OPERATOR of the association column of the association sub-query and the outer query, i.e. the above [ OPERATOR ], includes: | A Here, >, <, >, in, not in, between and, etc. Since the sub-query uses the result of the outer query, the current implementation techniques include the following three implementation schemes:

1. each time a value d is obtained from the exterior X, the value d is transmitted to the sub-query, and then the sub-query is executed to obtain the result of the sub-query. For the case of duplicate values in X.d, there are two ways to avoid the calculation: one is that the exterior list to the sub-query result is cached, when a repeated exterior association list exists, the hit can be cached, thereby avoiding calculation; another way is to sort the outer table association columns to put the outer table columns of the same value together for one-time identification, which also avoids duplicate calculations. If the associated column of the inner table has an index, a certain acceleration effect can be achieved.

2. And (5) making Cartesian product of the outer surface X and the inner surface Y, screening records meeting the association condition from the Cartesian product, and then obtaining a result list corresponding to the sub-query.

3. If the OPERATOR is <, >, and the select of the sub-query is a polymerization function of max taking the maximum value/min taking the minimum value, the association columns of the outer table and the inner table can be sorted respectively, then the results of the sub-query are obtained by a method similar to merging, sorting and comparing, and multiplexing of partial results can be realized by some cache strategies. For example, the sub-query is select max (Y.c) from Y where Y.d > X.d when the X and Y tables sort the d columns in descending order, respectively, and assuming that line k-1 of X.d is processed, line Y.d is processed to line m, and the internal query result is maxk, then when line k of X.d is processed, the internal query result is max (maxk, max (Y.c) of Y.d (m-n), where line n is the number of rows in Y.d where the value of line k is the smallest > X.d.

The prior art has respective problems:

the first method has poor performance, although repeated computation of sub-queries by the outer columns of the same value can be avoided, for the computation of sub-queries, the outer columns of different values still need to scan the inner tables completely, and the time overhead for obtaining the results of all sub-queries is the time overhead for comparing the inner tables scanned by the number of different values of the outer associated columns and finally computing. The invention uses the intermediate result of the sub-query, and only needs to scan the inner table once, thereby greatly improving the performance.

The second method also has the problem in 1).

The third method has limited use conditions: firstly, the association condition of the inner and outer surfaces can only be one expression, and the types can only be <, >, and four, and the method does not support the aggregation operation in, not support beween and not equal to! Operations and combinations of expressions such as Y.d > X.d and Y.d < X.d + 10. The query column for its next sub-query only supports max/min/sum/count. Furthermore, the technology needs to sort the inner and outer surfaces according to the associated columns, and is time-consuming. The sub-query associated operators supported by the method mostly comprise sets, and the method supports the combination of multiple comparison operator expressions, so that the method is more universal.

The inventor finds that the defect in the prior art is caused by cycle traversal of an inner table in a sub-query and repeated calculation of most record sets when carrying out sub-query association optimization research, particularly when aiming at non-equivalence association sub-query optimization research, and finds that solving the defect can be realized by a method of caching an intermediate state of a calculation result of the sub-query, storing the intermediate state with the minimum granularity, and then combining the intermediate result state with the minimum granularity to form a final sub-query result. The invention provides a high-performance optimization method which supports a general association expression, does not need to sort an internal table and only needs to sort an external association column set under a specific condition aiming at non-equivalent association sub-queries. The method and the device only need to sort the association column set of the outer surface under a specific condition, do not need to sort the outer surface at all, do not need to have index support aiming at the association column, and have the query performance expense of scanning the inner surface and scanning the outer surface. Compared with the query performance of the prior art, the optimization technical performance of the invention is improved by orders of magnitude.

Disclosure of Invention

The invention aims to solve (overcome) the problems of the prior art that the supporting sub-query correlation operation is not universal and the query performance is low, and provides an optimization method for non-equivalence correlation sub-queries, which comprises the following steps:

step 1, obtaining a value set of an appearance association column of an association sub-query;

step 2, establishing a mapping relation from an exterior association column of the association sub-query to an interior association column partition according to the type of the operational character and the value set in the association sub-query;

step 3, partitioning the inner table of the associated sub-query according to the mapping relation, and acquiring intermediate result state information of the associated sub-query in each partition according to a query aggregation function of the inner table in the associated sub-query;

and 4, traversing the exterior association column according to the mapping relation, and obtaining the sub-query result corresponding to each association column in the exterior by aggregating the intermediate result state information of each partition.

The optimization method of the non-equal-value associated sub-query, wherein the step 2 further comprises:

and step 21, constructing an automatic merging partition tree according to the type of the operator in the association sub-query and the value set of the exterior association column, and establishing the mapping relation through the automatic merging partition tree, wherein each leaf node of the automatic merging partition tree corresponds to a partition, and the partition is collected to be used as an interior association column partition.

if the operator is not equal to the operation, dividing the inner table into k +1 partitions according to the value number k of the outer association columns, and corresponding the value of each outer association column to k partitions except the outer association column to be used as the mapping relation;

if the operational character is a comparison operation, dividing the inner table into k +1 subareas according to the value number k of the outer association columns, and corresponding the values of the outer association columns to the corresponding subareas according to the comparison operational character to serve as the mapping relation;

if the operator is set operation, dividing the inner table into n +1 partitions according to the greatest common divisor of the values of the outer table associated columns, and corresponding the values of the associated columns to the corresponding partitions according to the set operator as the mapping relation, wherein n is the greatest common divisor.

In the optimization method of the non-equivalence association sub-query, in step S3, if the query aggregation function is Avg, the intermediate result status information is Sum + count; and if the query aggregation function is Sum/max/min/count, the intermediate result state information is Sum/max/min/count.

The optimization method of the non-equal-value associated sub-query, wherein the step 4 further comprises: and circularly processing the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the related sub-query.

The invention also provides an optimization system of the non-equivalence association sub-query, which comprises the following steps:

the partition mapping module is used for acquiring the values of the exterior association columns of the association sub-queries, collecting the values as a value collection, and establishing the mapping relation from the exterior association columns of the association sub-queries to the interior association column partitions according to the types of the operational characters in the association sub-queries and the value collection;

and the result merging module is used for partitioning the inner table of the associated sub-query according to the mapping relation, acquiring the intermediate result state information of the associated sub-query in each partition according to the query aggregation function of the inner table in the associated sub-query, traversing the outer associated columns according to the mapping relation, and aggregating the intermediate result state information of each partition to obtain the sub-query results corresponding to each associated column in the outer.

The optimization system of the non-equal value associated sub-query, wherein the partition mapping module further comprises: and constructing an automatic merging partition tree according to the type of the operational character in the association sub-query and the value set of the appearance association column, and establishing the mapping relation through the automatic merging partition tree, wherein each leaf node of the automatic merging partition tree corresponds to one partition, and the partition is collected to be used as the inner table association column partition.

The optimization system of the non-equal value associated sub-query, wherein the partition mapping module further comprises:

In the optimization system of the non-equivalence correlation sub-query, in the result merging module, if the query aggregation function is Avg, the intermediate result state information is Sum + count; and if the query aggregation function is Sum/max/min/count, the intermediate result state information is Sum/max/min/count.

The optimization system of the non-equivalence association sub-query, wherein the result merging module further comprises: and circularly processing the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the related sub-query.

The invention mainly comprises two stages of dynamic partition mapping and partition result merging. The method is characterized in that sorting of the inner and outer tables is not needed (sorting of the outer table association columns is needed only when the association operation is an expression and the operator is a comparison operator), sub-query calculation intermediate result states with the minimum granularity capable of being multiplexed can be obtained by scanning the inner table and the outer table once, and a final sub-query result is obtained by finally combining a plurality of intermediate result states.

And a dynamic partition mapping stage: and (4) constructing AMP-TREE according to different values of the outer table associative columns and the associative operators to obtain partitions of the inner table associative columns and mapping relations between the outer table columns and the partitions.

And a partition result merging stage: and partitioning the inner table according to the association columns. And traversing all the partitions to obtain the intermediate result state information of each partition. And obtaining a final sub-query result corresponding to each appearance association column according to the mapping relation obtained in the dynamic partition mapping stage.

The invention has the technical effects that: the non-equivalent correlation sub-query optimization method has strong adaptability, supports most correlation operations, can adaptively and dynamically partition the inner table, and repeatedly utilizes the intermediate results of all partitions to obtain a final sub-query result set, and has greatly improved performance compared with the existing technical method.

Drawings

FIG. 1 is a diagram of a model of the present invention;

FIG. 2 is a diagram illustrating a mapping relationship between partitions corresponding to the association relationships;

FIG. 3 is a schematic diagram of the structure of AMP-TREE interval classes;

FIG. 4 is a schematic diagram of the construction of AMP-TREE sets.

Detailed Description

Aiming at the defects of the existing non-equivalent correlation sub-query optimization method, the invention designs a novel universal, self-adaptive and high-performance optimization processing method. As shown in FIG. 1, the design effectively solves the problem of low performance of non-equivalence associated sub-queries aiming at different sub-query associated expressions and sub-query aggregation result functions.

The invention mainly comprises the following steps:

Wherein the step 2 further comprises: step 21, constructing an automatic merging partition tree according to the type of the operator in the association sub-query and the value set of the exterior association column, and establishing the mapping relation through the automatic merging partition tree, wherein each leaf node of the automatic merging partition tree corresponds to a partition, and the partition is collected to be used as an interior association column partition;

wherein the step 2 further comprises: if the operator is not equal to the operation, dividing the inner table into k +1 partitions according to the value number k of the outer association columns, and corresponding the value of each outer association column to k partitions except the outer association column to be used as the mapping relation; if the operational character is a comparison operation, dividing the inner table into k +1 subareas according to the value number k of the outer association columns, and corresponding the values of the outer association columns to the corresponding subareas according to the comparison operational character to serve as the mapping relation; if the operator is set operation, dividing the inner table into n +1 partitions according to the greatest common divisor of the values of the outer table associated columns, and corresponding the values of the associated columns to the corresponding partitions according to the set operator as the mapping relation, wherein n is the greatest common divisor.

Wherein the step 4 further comprises: and circularly processing the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the related sub-query.

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention supports various non-equivalent associated expressions, and the operator specifically comprises the following steps: | A For example, a "set" may be defined as a set of values, such as "set" and "set" or "set" of values. Multiple sub-query aggregation functions are supported: max/min/sum/count/avg, and the like. And the multiplexing of the intermediate results of the sub-queries is realized, so that the query performance is improved.

Where Table X is the exterior table, Table Y is the interior table, and the associative operator is operator. Y.d-partition means partition information of the associated column d of the inner table Y. The specific content of the present invention is to implement a model M, where M is equivalent to a function, and the parameters include two: the function returns two y.d-partition, map (x.d, y.d-partition) results. Namely: m (partition x.d, operator) (y.d-partition, map (x.d, y.d-partition)), which is input as all the different d-valued sets of the outer table X, partition set of the associated column d for the inner table Y and a map mapping relationship between partitions of x.partition d to Y.d. The specific operation steps are as follows:

s1, acquiring a value set of x.d and an operator associated with the value set from an exterior X of the database;

s2, based on the result of S1, obtaining a sort by using a topological sort method for the value set of x.d according to the full-order or partial-order relation of operator, as shown in FIG. 2, x.d1 to x.dk represent k distinct values of x.d, wherein the sort is for determining the subsequent partition range, the topological sort is used for traversing all values of the exterior association column to form an AMP tree, and the value of the exterior association column needs to be sorted only when the exterior association column is a non-set and the association operation is a comparison operator.

The S2 further includes step S21, in the S2 sorting process, generating an Auto-merge-Partition Tree (AMP Tree). Each leaf node of the tree is a partition. And obtaining the mapping relation between the interval and the appearance association column while constructing AMP-TREE.

For! An operator, the number of partitions is k +1, which is k X.d different values and one other value respectively;

for >, <, < + and the like comparison operators, the number of the partitions is k +1, and the partitions are sorted according to the appearance association column x.d to obtain a plurality of different interval ranges, wherein each interval range is one partition. And processing the multiple expressions formed by combining the between and operators and the comparison operators in a type mode, wherein the number of the partitions is calculated according to the minimum interval obtained by sorting. The process uses an AMP-TREE TREE structure.

For the in and not in operators of the set class, obtaining a plurality of trees at the lowest in fig. 2, extracting the maximum convention sets of all x.distintint d as much as possible, and finally determining the number of partitions as n +1 from the graph, wherein a leaf node in a forest is a partition.

The corresponding relation between distint x.d and y.d-partiton is obtained.

For not equal! X.di corresponds to k partitions except x.di, i is a positive integer less than k, and x.di is the value d for the ith associated column in table X.

For >, <, Y.d > X.d is taken as an example. X.d corresponds to all intervals meeting > X.d. For example X.d has values of 1,5, 10. Then X.d ═ 1 for the interval (1,5), (5,10), (10, positive infinity), X.d ═ 5 for the interval (5,10), (10, positive infinity), and X.d ═ 10 for the interval (10, positive infinity). Between and similar treatments.

For in, not in operators of set classes, it can be seen from FIG. 2 that the mapping relationships for in are X.d1- > (s123), X.d2- > (s2345), and for not in the opposite mapping relationships X.d1- > (not s123), X.d2- > (not s2345)

S3, partitioning d columns of the sub-query Y table, namely the association columns d of the inner table Y, according to the value partition set of the inner table association columns obtained in the S2, and meanwhile, calculating and storing the intermediate state information of the result avg (y.c) of the sub-query: sum + count. The state information for different aggregation functions is shown in the following table:

s4, traversing the data content of each row in the outer surface, and according to the mapping information from x.d to y.d-partiton obtained in S2, aggregating the intermediate states of the partitions to calculate the sub-query result for judgment, for example, the judgment that the related judgment of the related sub-query in the following embodiments is x.c > is provided.

Note: the method comprises the following steps of for the sub-query conditions: x.d in Y.d, wherein X.d is a single-value type, Y.d is a set type, because the complexity of the algorithm is consistent with the complexity of direct X Join Y re-judgment X.d in Y.d, the optimization method does not increase the overhead of such query, but does not have optimization effect. The present optimization method does not consider the situation of such sub-query conditions for the moment.

First, the present invention describes the comparison operators of the scalar class and the inclusive operators between sets and single values, respectively:

for the comparison operator > of the non-set class, >, and a combination of a plurality of algebraic expressions of the inner table associative columns corresponding to the outer table associative columns, the formation process of the partitions and the mapping is as follows: suppose a query is

select X.a,X.b from X where X.c>

(select avg(Y.c)from Y where Y.d>X.d and Y.d<X.d+20)

S1, assuming that x.distintint d ═ {1, 10, 20 }is obtained

S2, circularly traversing D.dinstint d to obtain all values of X.d and X.d +20, mapping the values to a pause-TREE, and if the AMP-TREE is empty, dividing the AMP-TREE into a plurality of sections according to data; if the AMP-TREE is not empty, the matched node is searched according to two halves, if the AMP-TREE is partially intersected with a certain leaf node, the leaf node is segmented to generate two new sub leaf nodes, and the specific process is shown in the following figure 3. In this step, a mapping relationship between each sub-query outer table association column and the corresponding AMP-TREE upper child node, i.e., the inner table association column partition, is also obtained.

In the above process, the exterior association class and the partition mapping relation are obtained simultaneously, and the corresponding node on the AMP-TREE is marked (indicating that it will be used subsequently):

that is, the partition sets are (1,21), (10,21), [21,30], (20,21), [30,40] in the present embodiment.

S3: the inner table is mapped into partitions of the AMP-TREE according to the values of the associative columns, and the intermediate states of the AMP-TREE to be stored by each of the marked nodes are calculated. And simultaneously calculating a sub-query result corresponding to the X.distint d:

the aggregation function is the query result of the sub-query, for example, in the embodiment, the avg in the select avg (id) from table is the aggregation function of the query. The corresponding status information is the contents of avgsum + count in the previous table. And in order to explain the aggregated corresponding partition sets in the above step 4, the corresponding partition set when X.d is 1 is (1,21), the corresponding partition set when X.d is 10 is (10,21), [21,30) in the present embodiment, and the corresponding partition set when X.d is 20 is (20,21), [21,30), [30, 40).

S4: and recycling the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the query.

For the in, not in operator type of the set class, the data type of X.d is a set, while the inner table association column type is a single value type. Suppose the query is:

select X.a,X.b from X where X.c>

(select avg(Y.c)from Y where Y.d in X.d)

s1, assuming that x.distinting d ═ abc }, { ab }, { cd }, { def }, { cef }// a, b, c, d, and the like are obtained as one element value, respectively

S2, circularly traversing the X.dinstint d, mapping the X.dinstint d to a pause-TREE, and if the AMP-TREE is empty, determining the value of the root node as the set; if the AMP-TREE is not empty, it is determined whether there is a matching element in the TREE, and whether the matching element can be combined by multiple node elements, and if the matching element is partially intersected with a leaf node, the leaf node is segmented to generate two new sub-leaf nodes, which is specifically shown in fig. 3 below. In this step, the mapping relationship between each sub-query outer association column and the partition on the corresponding AMP-TREE is also obtained.

X.d	partition
		{abc}	{ab}{c}
{ab}	{ab}
		{cd}	{c}{d}
{def},	{d}{ef}
		{cef}	{c}{ef}

s3: the inner table is mapped into partitions of the AMP-TREE according to the values of the associative columns, and the intermediate states of the AMP-TREE to be stored by each of the marked nodes are calculated. And then calculating the sub-query results corresponding to all the X.distintint d.

S4: and recycling the relevant judgment of the appearance to obtain the final result of the query.

In the embodiment of the set operator, the greatest common divisor of the values of the table association column is described, and in the embodiment, the content of the table partition column indicates that the number of common parts is 4, and is { ab } { c } { d } { ef }, respectively, so that the greatest common divisor is 4, n is 4, and a set including { ab } { c } { d } { ef } is the greatest common set.

Description of the drawings: the size of the AMP-TREE can be configured, and when the configured value is large enough, most of the appearance association columns can correspond to the partition nodes of the AMP-TREE one by one. When the configuration value is small, only a part of the appearance correlation column is ensured to directly correspond to the nodes of the AMP-TREE, and the rest part needs to be merged through Partition with small granularity. When the configuration value is extremely small, most of the appearance association columns cannot be merged with the partitions for mapping, and the results of the associated sub-queries need to be obtained by adopting the existing sub-query calculation method, namely, no intermediate calculation result can be reused. Thus, in practice, it is recommended to use a larger value of AMP-TREE size configuration to ensure high multiplexing of the sub-query intermediate result sets.

The following is a system example corresponding to the above method example, and the present implementation system can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in the present implementation system, and are not described herein again for the sake of reducing repetition. Accordingly, the related-art details mentioned in the present embodiment system can also be applied to the above-described embodiments.

In summary, the present invention can adaptively partition the internal table according to the association columns, and for the whole sub-query, the internal table of the sub-query only needs to be scanned once without the support of the index, and the external table association columns only need to be sorted when the association operation is an expression and the operation is a comparison operation; the partitioning and mapping process constructs an AMP-TREE (automatic merging partition TREEs) TREE structure. The data structure AMP-TREE is characterized by: a) all nodes are an interval; b) all leaf node interval ranges are disjoint and combined to form a complete set; c) the interval range of the parent node is equal to the intersection of the interval ranges of all the child nodes so as to assist the combination and multiplexing of the intermediate result states of the child queries; a partition algorithm is used in the construction process of the AMP-TREE, and the AMP-TREE is obtained according to values and association operations of different association columns of the exterior, wherein topological ordering aiming at a set and partial and full ordering aiming at a non-set are involved, so that a mapping relation between the exterior association column and the interior partition is formed; and after the inner table is partitioned according to the inner table association column as a main key, recording the intermediate result state of each partition, multiplexing the obtained partial results according to the mapping relation between the outer table association column and the inner table partition set, and merging the partial results to obtain the final calculation result of the sub-query.

The present invention may be embodied in other specific forms without departing from the spirit or scope thereof, and it is therefore intended that all matter contained in the above description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

Claims

1. A method for optimizing a non-equivalence association sub-query, comprising:

step 3, obtaining a partition set according to the inner table association list partition, partitioning the inner table of the association sub-query, and simultaneously obtaining intermediate result state information of the association sub-query in each partition according to a query aggregation function of the inner table in the association sub-query;

step 4, traversing the exterior association column according to the mapping relation, and obtaining the sub-query result corresponding to each association column in the exterior by aggregating the intermediate result state information of the corresponding partition sets;

wherein, this step 2 still includes:

2. The method for optimizing a non-equivalue associative sub-query as claimed in claim 1, wherein the step 2 further comprises:

3. The method according to claim 1, wherein in step 3, if the query aggregation function is Avg, the intermediate result status information is Sum + count; and if the query aggregation function is Sum/max/min/count, the intermediate result state information is Sum/max/min/count.

4. The method for optimizing a non-equivalue associated sub-query as claimed in claim 1, wherein the step 4 further comprises: and circularly processing the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the related sub-query.

5. A system for optimizing non-equi-value associative sub-queries, comprising:

a result merging module, configured to partition the inner table of the association sub-query according to the mapping relationship, obtain intermediate result state information of the association sub-query in each partition according to a query aggregation function of the inner table in the association sub-query, traverse the outer association columns according to the mapping relationship, and obtain sub-query results corresponding to each association column in the outer surface by aggregating the intermediate result state information of each partition;

wherein, the partition mapping module further comprises: and constructing an automatic merging partition tree according to the type of the operator in the association sub-query and the value set of the appearance association column, determining all partitions through the automatic merging partition tree, and establishing the mapping relation, wherein each leaf node of the automatic merging partition tree corresponds to one partition, and the partitions are collected to be used as the inner appearance association column partitions.

6. The system for optimizing a non-equivalence association sub-query as recited in claim 5, wherein the partition mapping module further comprises:

7. The system according to claim 5, wherein in the result merging module, if the query aggregation function is Avg, the intermediate result status information is Sum + count; and if the query aggregation function is Sum/max/min/count, the intermediate result state information is Sum/max/min/count.

8. The system for optimizing a non-equivalence association sub-query as recited in claim 5, wherein the result merging module further comprises: and circularly processing the related judgment of the sub-query results of the outer table and the inner table to obtain the final result of the related sub-query.