CN115033616A

CN115033616A - Data screening rule verification method and device based on multi-round sampling

Info

Publication number: CN115033616A
Application number: CN202210648307.7A
Authority: CN
Inventors: 王尧舒; 谢珉; 樊文飞
Original assignee: Shenzhen Institute of Computing Sciences
Current assignee: Shenzhen Institute of Computing Sciences
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-09
Also published as: WO2023236239A1

Abstract

The application provides a data screening rule verification method and a device thereof based on multi-round sampling, which are used for carrying out data screening on target data in a large database to determine a data screening rule; constructing a relation graph G according to the tuples; sampling the vertexes in the vertex set V for K times to generate K pieces of sampling data; searching K sampling data layer by layer to build an integral synchronous parallel computing model; and screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule. The recall rate of the data screening rules can be improved through multiple rounds of sampling; in the process of constructing the relational graph G, the influence of predicates is considered in edge construction, so that useful data which can be used for rule discovery can be sampled at a higher priority; the sampling accuracy is ensured; and the data screening operation efficiency is improved.

Description

Data screening rule verification method and device based on multi-round sampling

Technical Field

The application relates to the field of data processing, in particular to a data screening rule verification method and device based on multi-round sampling.

Background

Rule discovery in relational data is a time consuming and laborious process. Today with large data, the data size is growing at a multiple rate, which presents an unprecedented challenge for rule mining. The regular mining using the existing algorithm may take the user several days or even weeks, which is far from the usage requirements of many scenarios.

The existing Sampling technique is usually Uniform Sampling on a single table (Uniform Sampling). After sampling, single-machine rule mining is performed on the sampled data.

The uniform sampling technique mainly consists of several disadvantages: (1) the uniform sampling is mainly carried out on a single-table rule, and the effect of uniform sampling is greatly reduced; (2) the existing uniform sampling is basically single-round sampling, however, the single-round sampling can cause a great loss of the rule, and therefore, the rule recall rate is low; (3) uniform sampling does not distinguish whether data is useful, all data are treated identically, such an approach is not scientific, and some useful data that can be used for rule discovery should be sampled with higher priority; (4) the existing uniform sampling is basically a heuristic method, and the sampling accuracy cannot be ensured; the existing rule mining algorithm is mainly a single machine algorithm; even if more computing resources are used, the computational efficiency of the rule mining cannot be guaranteed to be improved.

Disclosure of Invention

In view of the problems, the present application is proposed to provide a data screening rule verification method based on multiple sampling rounds and a device thereof, which overcome or at least partially solve the problems, and comprises:

a data screening rule validation method based on multiple sampling rounds for data screening of target data in a large database to determine a data screening rule, the method comprising:

acquiring the target data, and determining a corresponding data relation table according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple;

constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E;

sampling the vertexes in the vertex set V for K times to generate K pieces of sampling data;

searching K sampling data layer by layer to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes;

and screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

Further, the step of constructing a relationship graph G according to the tuples, wherein the relationship graph G includes a vertex set V and an edge set E, and includes:

generating a plurality of vertexes according to the tuples;

a plurality of edges are generated by connecting a plurality of vertexes;

constructing an edge set E according to a plurality of vertexes and a plurality of edges; when one edge E in the edge set E is respectively connected with one vertex t and the other vertex s in the vertex set v, tuple pairs constructed corresponding to the vertex t and the vertex s at least meet an equality predicate;

and constructing the relation graph G according to the vertex set V and the edge set E.

Further, the step of sampling the vertices in the vertex set V K times to generate K sampled data includes:

performing K times of random walk sampling on the vertexes in the vertex set V to generate K sampling data; or the like, or, alternatively,

and performing breadth-first sampling on the vertexes in the vertex set V for K times to generate K pieces of sampling data.

Further, the step of performing K random walk samples on the vertices in the vertex set V to generate K sample data includes:

acquiring an initial vertex in the vertex set V through uniform sampling;

performing iterative sampling on the initial vertex according to the random walk sampling to generate a plurality of sampling tuples, wherein each step in the random walk sampling stops at a preset probability epsilon or moves to an adjacent vertex of the current vertex, and the iterative process of the iterative sampling is less than or equal to a preset sample amount;

and generating the K sampling data according to a plurality of sampling tuples.

Further, the step of performing breadth-first sampling on the vertices in the vertex set V K times to generate K pieces of sample data includes:

obtaining an initial tuple pair through uniform sampling in the vertex set V, wherein the initial tuple pair comprises a vertex z and a vertex y;

iteratively sampling the initial element group pair according to the breadth-first sampling to generate a plurality of adjacent vertexes, wherein the target distance of the adjacent vertexes is smaller than or equal to a preset distance between the vertex z and the vertex y;

and generating the K sampling data according to a plurality of adjacent vertexes.

Further, the step of screening the K sampled data according to the scheduling node and the plurality of working nodes to generate a target data screening rule includes:

the scheduling node correspondingly generates a plurality of tasks according to the K sampling data, wherein each task in the plurality of tasks is a triple, and the triple comprises a selected predicate set, a candidate predicate set and an effective predicate;

distributing a plurality of tasks to the working nodes evenly according to the scheduling nodes, wherein the number of the working nodes at least comprises one; when the tasks with the task quantity larger than the preset task quantity exist in the working nodes, distributing half of the task quantity in the working nodes to other working nodes; or when only the tasks with the preset task quantity of 1 exist in the work nodes, splitting data corresponding to the tasks;

and generating the target data screening rule according to the working node, the scheduling node and the K sampling data.

Further, the step of generating the target data screening rule according to the working node, the scheduling node, and the K sampling data includes:

acquiring K sampling data, and determining all predicates to be selected in the predicates to be selected set in the K sampling data;

the working node iteratively adds the predicate to be selected to the selected predicate set;

when the predicate set to be selected is an empty set, the working node stops iteration, and the predicate set to be selected is determined to be the empty set as a first data screening rule; or the like, or, alternatively,

when at least one to-be-selected predicate in the to-be-selected predicate set is related to the selected predicate set, the working node stops iteration, and then it is determined that at least one to-be-selected predicate in the to-be-selected predicate set is related to the selected predicate set as the second data screening rule;

and summarizing the first data screening rule and the second data screening rule to the scheduling node, and generating the target data screening rule by the scheduling node according to the first data screening rule and the second data screening rule.

The application also discloses data screening rule verifying attachment based on many rounds of sampling for target data in big database carries out the data screening and confirms data screening rule, the device includes:

the acquisition module is used for acquiring the target data and determining a corresponding data relation table according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple;

the first construction module is used for constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E;

the first generation module is used for sampling the vertexes in the vertex set V for K times to generate K sampling data;

the second construction module is used for searching the K sampling data layer by layer to construct an integral synchronous parallel computation model, wherein the integral synchronous parallel computation model comprises a scheduling node and a plurality of working nodes;

and the second generation module is used for screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

The application also discloses a device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, wherein the computer program when executed by the processor implements the steps of a data screening rule validation method based on multiple sampling rounds as described above.

The present application also discloses a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a data screening rule validation method based on multiple sampling rounds as described above.

The application has the following advantages:

in an embodiment of the application, the target data is acquired, and a corresponding data relation table is determined according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple; constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E; sampling the vertexes in the vertex set V for K times to generate K sampling data; searching K sampling data layer by layer to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes; and screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule. The sampling method can be carried out on a cross-table rule such as REE supporting a plurality of tuple variables; the sampling method is multi-round sampling, so that the recall rate of the data screening rules can be improved; in the process of constructing the relational graph G, the influence of predicates is considered in edge construction, so that useful data which can be used for rule discovery can be sampled at a higher priority; the sampling accuracy is ensured; therefore, more computing resources are used, and the operation efficiency of rule mining is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the present application will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating steps of a method for multi-sampling-based data screening rule validation according to an embodiment of the present application;

fig. 8 is a block diagram illustrating a structure of a data screening rule verification apparatus based on multiple sampling rounds according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the rule utilized by the present invention is an Entity Enhancing rule (refer to below as "REE"). The basic component of REE is the predicate p, defined as follows:

p:＝R(t)|t.A◎c|t.A◎s.B|M(t.A,s.B)

wherein ∈ is an operator, which may or may not be equal; r (t) represents that t is a tuple variable in the relational table R; t.A denotes the A attribute of variable t; m is a machine learning model that will return true if t.A and s.B are related, otherwise it will return false. t.A c has a constant, called constant predicate; t.A ^ s.B has no constant and is called variable predicate; m (t.A, s.B) is called a machine-learned predicate.

Based on predicates, the definition of REE is: x- > e. Where (1) X is the union of multiple predicates, called the condition for this REE; (2) e is a predicate, called the result of this REE.

One specific REE example is as follows:

express delivery (t) ^ express delivery(s) ^ t. addressee ^ t. address ^ s-zip code ^ 510000 "

The scenario described by this REE is that if the addressees of the express t and the express s are the same person, and the address of the express t is in "shenzhen city, guangdong province", the zip code of the express s must be "510000".

The conventional rule mining algorithm discovers REE in data through a depth-first or breadth-first single-machine mode, and the essence of the REE is a process for enumerating all predicate permutation combinations on the full data D. For all possible predicates, any extraction of one or more of the possible outcomes constitutes a valid rule with the REE result e. Therefore, in order to perform rule mining, the existing method needs to try all permutation combinations of predicates. The efficiency of such brute-force trial stand-alone algorithms is undoubtedly low when the data size is large.

The invention aims to provide a parallel rule discovery technology based on a brand-new sampling method; on the premise of ensuring sampling accuracy, the method supports parallel discovery of cross-table rules such as REE in big data, and solves the problem of efficiency of rule mining in data with huge scale.

By performing rule discovery on the full data D, the process of rule discovery takes much time due to the excessively large amount of data; thus, the prior art proposes that a small data set D can be upsampled from the full data D _s ；D _s May be only one tenth or even one hundredth of D; instead of performing rule discovery on the full data D, by sampling the data D _s And (4) performing rule discovery, wherein although part of rules may be lost when performing rule discovery on the sampled data, the efficiency of rule discovery can be greatly improved.

Referring to fig. 1, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

a data screening rule validation method based on multiple rounds of sampling for data screening target data in a large database to determine a data screening rule, the method comprising:

s110, acquiring the target data, and determining a corresponding data relation table according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple;

s120, constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E;

s130, sampling the vertexes in the vertex set V for K times to generate K sampling data;

s140, searching K sampling data layer by layer to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes;

s150, screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

Next, a data screening rule verification method based on multiple sampling rounds in the present exemplary embodiment will be further described.

The purpose of data sampling is to extract a part of sample data D used for rule discovery from the full data D, that is, the target data _s . Specifically, the sampling data D is decimated from the full data D _s Translates into a problem of sampling vertices in the graph G.

It should be noted that the target data is obtained, and a corresponding data relationship table is determined according to the target data, where each row in the data relationship table generates a tuple and includes at least one tuple; namely, several tuples are determined by the data relation table corresponding to the target data, i.e. the full data D.

As stated in step S120, a relationship graph G is constructed according to the tuples, wherein the relationship graph G includes a vertex set V and an edge set E.

In an embodiment of the present invention, the specific process of "building a relationship graph G according to the tuples" in step S120 can be further described with reference to the following description, where the relationship graph G includes a vertex set V and an edge set E ".

Referring to fig. 2, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s210, generating a plurality of vertexes according to the tuples;

s220, connecting the plurality of vertexes to generate a plurality of edges;

s230, constructing an edge set E according to a plurality of vertexes and a plurality of edges; when one edge E in the edge set E is respectively connected with one vertex t and the other vertex s in the vertex set v, the tuple pair constructed corresponding to the vertex t and the vertex s at least meets an equality predicate;

s240, constructing the relation graph G according to the vertex set V and the edge set E.

It should be noted that, a plurality of vertices are generated according to a plurality of tuples, and a plurality of vertices are generated from the plurality of tuples, wherein each vertex corresponds to one tuple in the full data D.

It should be noted that a plurality of edges are determined by a plurality of the vertices, and different vertices are connected to form a plurality of edges.

It should be noted that, an edge set E is constructed according to a plurality of the vertices and a plurality of the edges; when one edge E in the edge set E is respectively connected with one vertex t and the other vertex s in the vertex set V, tuple pairs constructed corresponding to the vertex t and the vertex s at least meet an equality predicate; a plurality of vertexes are collected to form a vertex set V; and a plurality of edges are collected to form an edge set E.

It should be noted that the relationship graph G is constructed according to the vertex set V and the edge set E; thereby constructing a relationship graph G ═ (V, E) for the full data D.

As stated in step S130, K times of sampling the vertices in the vertex set V generate K pieces of sampling data.

In an embodiment of the present invention, a specific process of "sampling K times to generate K sample data for vertices in the vertex set V" in step S130 may be further described with reference to the following description.

Referring to fig. 3, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s310, performing random walk sampling on the vertexes in the vertex set V for K times to generate K sampling data; or the like, or, alternatively,

and S320, performing breadth-first sampling on the vertexes in the vertex set V for K times to generate K sampling data.

It should be noted that, based on the constructed relationship graph G, the vertices in the relationship graph G may be sampled by using any one of random walk sampling and breadth-first sampling. Specifically, random walk sampling or breadth-first sampling is adopted at random, and a user can decide to perform sampling only in a random walk mode, only in a breadth-first mode, or alternatively in two modes of random walk sampling or breadth-first sampling.

In one implementation, the single round of sampling has the disadvantage of poor recall, i.e., a rule globally applicable to the full data D may not be derived from the sample data D _s Is found in (A); therefore, the recall rate can be improved by adopting a multi-round sampling strategy; by running K times of single-round sampling method (random walk sampling or breadth-first sampling), the method obtainsK sample data. We perform rule discovery on each sample data (rule discovery refers to discovery of REEs in the sample data) and then combine the discovered REEs.

As stated in step S310, K random walk samples are performed on the vertices in the vertex set V to generate K sample data.

In an embodiment of the present invention, a specific process of "performing K random walk samples on vertices in the vertex set V to generate K sample data" in step S310 may be further described with reference to the following description.

Referring to fig. 4, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s410, acquiring an initial vertex in the vertex set V through uniform sampling;

s420, performing iterative sampling on the initial vertex according to the random walk sampling to generate a plurality of sampling tuples, wherein each step in the random walk sampling stops or moves to an adjacent vertex of the current vertex by a preset probability epsilon, and the iterative process of the iterative sampling is less than or equal to a preset sample size;

s430, generating the K sampling data according to the sampling tuples.

It should be noted that an initial vertex is obtained through uniform sampling in the vertex set V, a vertex is obtained through uniform sampling in the vertex set V, and the vertex is marked as an initial vertex t, so that a plurality of tuples included in the full data D can be sampled with the same probability.

It should be noted that, iterative sampling is performed on the initial vertex according to the random walk sampling to generate a plurality of sample tuples, and iterative sampling is performed from the initial vertex t through the random walk sampling, so that more sample tuples are obtained.

As an example, for random walk sampling, an initial vertex t is first selected in the vertex set V based on uniform sampling, such that all tuples in the full data DAre all sampled with the same probability; then, iteratively sampling more sampling tuples from an initial vertex t by a mode of simulating random walk, wherein the longest walk length is len; at each step, the random walk will terminate with a probability ∈ or move to the neighboring vertex of the current vertex with a probability of (1 ∈); all vertices in the walk sequence are included in the sample data D _s Performing the following steps; the whole process iterates until | D _s L exceeds the maximum sample size.

As stated in step S320, K breadth-first sampling is performed on the vertices in the vertex set V to generate K pieces of sampling data.

In an embodiment of the present invention, a specific process of "performing breadth-first sampling on the vertices in the vertex set V K times to generate K pieces of sample data" in step S320 may be further described with reference to the following description.

Referring to fig. 5, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s510, acquiring an initial tuple pair in the vertex set V through uniform sampling, wherein the initial tuple pair comprises a vertex z and a vertex y;

s520, performing iterative sampling on the initial element group pair according to the breadth-first sampling to generate a plurality of adjacent vertexes, wherein the target distance of the adjacent vertexes is smaller than or equal to a preset distance between the vertex z and the vertex y;

s530, generating the K sampling data according to a plurality of adjacent vertexes.

It should be noted that, an initial tuple pair is obtained through uniform sampling in the vertex set V, and the initial tuple pair includes a vertex z and a vertex y; the initial tuple set < y, z > is obtained by uniformly sampling in the vertex set V.

It should be noted that, the initial element group is iteratively sampled according to the breadth-first sampling to generate a plurality of adjacent vertexes; and acquiring adjacent nodes of which the distance between the vertex z and the vertex does not exceed a preset distance through the breadth-first sampling pair initial element group < y, z >, wherein the adjacent nodes at least comprise one.

As an example, for breadth-first sampling, we get neighboring nodes with a distance between vertex z and vertex y of no more than (len-1) by uniformly sampling a tuple pair < y, z > within vertex set V, and then using breadth-first sampling; that is, if < y, z > conforms to a REE of length len, breadth first sampling will ensure that at least one REE-conforming data will be sampled.

In step S140, performing a layer-by-layer search on the K sampled data to construct an overall synchronous parallel computation model, where the overall synchronous parallel computation model includes a scheduling node and a plurality of working nodes;

it should be noted that an overall synchronous parallel computation model is constructed by searching layer by layer according to the K sampling data, wherein the overall synchronous parallel computation model comprises a scheduling node and a plurality of working nodes; and performing a parallel rule mining algorithm, namely searching layer by layer on the K sampling data to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes.

As stated in step S150, the K sampled data are screened according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

In an embodiment of the present invention, a specific process of step S150 "generating target data screening rules by screening data of K pieces of the sample data according to the scheduling node and a plurality of the working nodes" may be further described with reference to the following description.

Referring to fig. 6, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s610, the scheduling node correspondingly generates a plurality of tasks according to the K sampling data, wherein each task in the plurality of tasks is a triple, and the triple comprises a selected predicate set, a to-be-selected predicate set and an effective predicate;

s620, evenly distributing a plurality of tasks to the working nodes according to the scheduling nodes, wherein at least one working node is included; when the tasks with the task quantity larger than the preset task quantity exist in the working nodes, distributing half of the task quantity in the working nodes to other working nodes; or when only the tasks with the preset task quantity of 1 exist in the work nodes, splitting data corresponding to the tasks;

s630, generating the target data screening rule according to the working node, the scheduling node and the K sampling data.

It should be noted that the scheduling node generates a plurality of tasks according to the K sampling data, where each task in the plurality of tasks is a triple, and the triple includes a selected predicate set, a candidate predicate set, and a discovery result; and the scheduling node generates a plurality of corresponding tasks from the plurality of sampling data.

It should be noted that, a plurality of discovery rules are generated by averagely distributing a plurality of tasks to a plurality of work nodes according to the scheduling node; the dispatching nodes equally distribute the tasks to the work nodes, and a plurality of discovery rules, namely REE results e, are generated through the work nodes.

In a specific implementation, the parallel rule mining algorithm is based on layer-by-layer search and consists of one dispatcher, namely a dispatching node, and a plurality of workers, namely work nodes; under the integral synchronous parallel computing model, a dispatcher, namely a dispatching node, is responsible for generating and distributing tasks and is responsible for maintaining rules and balancing loads; workers, namely the working nodes, are responsible for discovering the rules in parallel; the overall computation is divided into a plurality of supersteps, where each superstep corresponds to each level in the rule mining.

Specifically, the dispatcher firstly generates a plurality of tasks according to the REE result e to be discovered, wherein each task consists of a triple marked as<P _sel ，P _re ，e>. Each task consisting of a tripletConsists of the following components:<P _sel ，P _re ，e>(ii) a Wherein P is _sel What is stored is the predicate that has been selected to compose the REE condition, i.e., the selected predicate, and P _re Storing predicates to be selected; each worker transmits the newly discovered rules to the dispatcher, who generates the target rules from the summary.

In each over-step, splitting the tasks of workers with the workload greater than the preset workload, and distributing half of the tasks to other idle workers; if the worker is assumed to be called W, namely a work node W, if a plurality of tasks exist in the W, the W distributes half of the tasks; if only one task remains in W, namely the preset task amount is 1, the task is split by splitting the data corresponding to the task.

As stated in step S630, the target data screening rule is generated according to the working node, the scheduling node, and K pieces of the sampling data.

In an embodiment of the present invention, a specific process of step S630 "generating the target data screening rule according to the working node, the scheduling node, and K pieces of the sampling data" may be further described with reference to the following description.

Referring to fig. 7, a flowchart illustrating steps of a method for verifying a data screening rule based on multiple sampling rounds according to an embodiment of the present application is shown;

as will be described in the following steps,

s710, acquiring K sampling data, and determining all predicates to be selected in the predicate set to be selected in the K sampling data;

s720, the working node iteratively adds the predicate to be selected to the selected predicate set;

s730, when the predicate set to be selected is an empty set, the working node stops iteration, and the predicate set to be selected is determined to be an empty set as a first data screening rule; or the like, or, alternatively,

s740, when at least one to-be-selected predicate in the to-be-selected predicate set is related to the selected predicate set, the working node stops iteration, and it is determined that the at least one to-be-selected predicate in the to-be-selected predicate set is related to the selected predicate set as the second data screening rule;

and S750, summarizing the first data screening rule and the second data screening rule to the scheduling node, and generating the target data screening rule by the scheduling node according to the first data screening rule and the second data screening rule.

It should be noted that the working node iteratively adds the predicate to be selected to the selected predicate set; determining predicates to be selected in the predicate set to be selected through iteration of the working nodes, and then adding the predicates to be selected to the selected predicate set in an iteration mode; by decimating K of said sample data and iteratively selecting P _re Adding the predicate to be selected in the predicate to be selected into P _sel In (1).

It should be noted that, the predicate to be selected is added to the set of selected predicates, and when the set of predicate to be selected is an empty set or the set of predicates to be selected meets a preset condition, the iteration is stopped, and a plurality of discovery rules are generated; wherein the predetermined condition is P _sel → e are effective REEs.

In a specific implementation, initially, P _sel I.e. the selected predicate set is an empty set, P _re Namely, the predicate set to be selected is all possible predicate sets; after all tasks are evenly distributed to all workers, each worker performs parallel rule discovery: by extracting the required sample data and iteratively selecting P _re The predicate of the inner is added into P _sel Until the following conditions are met: p _re Forming an empty set, namely the predicate set to be selected is an empty set, and determining that the predicate set to be selected is an empty set as a first data screening rule; or, P _sel The rule "→ e" is valid REE, namely at least one candidate predicate in the candidate predicate set is related to the selected predicate set, and the fact that the at least one candidate predicate in the candidate predicate set is related to the selected predicate set is determined to be a second data screening rule; the first data screening rule and the second data screening rule are gathered to the dispatching node, and the dispatching node generates a target data screening rule according to the first data screening rule and the second data screening rule.

The invention has the technical effects that:

(1) the proposed sampling method can be performed on a cross-table rule such as REE that supports multiple tuple variables.

(2) The sampling method is multi-round sampling, and compared with single-round sampling, the method has better data screening rule recall rate. That is, a data screening rule globally applicable to D can be applied to small samples D _s Is found in (a).

(3) In the process of constructing the relational graph G, the influence of predicates is considered in the edge construction. In this way, useful data that can be used for rule discovery is sampled with higher priority.

(4) The sampling accuracy is ensured; specifically, the size of the sampling data needs, and the lower bound of the sampling rule recall rate and accuracy can be accurately calculated.

(5) The mining algorithm is a parallel rule discovery algorithm under an integral synchronous parallel computing model.

Therefore, more computing resources are used, and the operation efficiency of data screening rule mining is improved.

In the plurality of public data, the efficiency of mining the data screening rule by using the full data D is compared, and the accuracy of mining the data screening rule by using uniform sampling is also compared; according to the result, it is shown that,

(1) in the sampling data D _s The upper rule shows that the operation efficiency can be improved by 12.2 times on average. On a large DBLP dataset with 3 relational tables, 18 attributes, 180 million pieces of data, the run time is 406 seconds using the method of the present invention, while the run time for data screening rule mining using full data D is 2096 seconds.

(2) After only 10% of data is sampled, the average accuracy rate can reach 90% and the average recall rate can reach 82% by using the method of the invention, which are respectively improved by 2% and 7% compared with the uniform sampling.

(3) The rule discovery algorithm is parallel scalable: if the number of processors used is changed from 4 to 20, the efficiency is improved by 3.38 times.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Referring to fig. 8, a block diagram of a data screening rule verification apparatus based on multiple sampling rounds according to an embodiment of the present application is shown;

a data screening rule verifying device based on multi-round sampling is used for carrying out parallel rule mining on target data in a large database, and specifically comprises the following steps:

an obtaining module 810, configured to obtain the target data and determine a corresponding data relationship table according to the target data, where each row in the data relationship table generates a tuple and each tuple includes at least one tuple;

a first construction module 820, configured to construct a relationship graph G according to the tuples, where the relationship graph G includes a vertex set V and an edge set E;

a first generating module 830, configured to sample vertices in the vertex set V for K times to generate K pieces of sampled data;

a second construction module 840, configured to perform layer-by-layer search on the K sampled data to construct an overall synchronous parallel computation model, where the overall synchronous parallel computation model includes a scheduling node and a plurality of working nodes;

and a second generating module 850, configured to perform data screening on the K sampled data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

In an embodiment of the present invention, the first building module 820 includes:

the first generation submodule is used for generating a plurality of vertexes according to the tuples;

the second generation submodule is used for generating a plurality of edges by connecting a plurality of vertexes;

the first construction submodule is used for constructing an edge set E according to a plurality of vertexes and a plurality of edges; when one edge E in the edge set E is respectively connected with one vertex t and the other vertex s in the vertex set v, tuple pairs constructed corresponding to the vertex t and the vertex s at least meet an equality predicate;

and the second construction submodule is used for constructing the relationship graph G according to the vertex set V and the edge set E.

In an embodiment of the present invention, the first generating module 830 includes:

a third generation submodule, configured to perform K times of random walk sampling on vertices in the vertex set V to generate K times of sampling data; or the like, or a combination thereof,

and the fourth generation submodule is used for carrying out K times of breadth-first sampling on the vertexes in the vertex set V to generate K times of sampling data.

In an embodiment of the present invention, the third generation sub-module includes:

a first obtaining unit, configured to obtain an initial vertex in the vertex set V through uniform sampling;

a first generating unit, configured to perform iterative sampling on the initial vertex according to the random walk sampling to generate a plurality of sample tuples, where each step in the random walk sampling stops or moves to an adjacent vertex of a current vertex with a preset probability e, and an iterative process of the iterative sampling is less than or equal to a preset sample amount;

and the second generating unit is used for generating the K sampling data according to a plurality of sampling tuples.

In an embodiment of the present invention, the fourth generating sub-module includes:

a second obtaining unit, configured to obtain an initial tuple pair through uniform sampling within the vertex set V, where the initial tuple pair includes a vertex z and a vertex y;

a third generating unit, configured to perform iterative sampling on the initial tuple pair according to the breadth-first sampling to generate a plurality of adjacent vertices, where a target distance of the adjacent vertices is smaller than or equal to a preset distance between the vertex z and the vertex y;

and the fourth generating unit is used for generating the K sampling data according to a plurality of adjacent vertexes.

In an embodiment of the present invention, the second generating module 850 includes:

a fifth generating sub-module, configured to correspondingly generate, by the scheduling node, a plurality of tasks from the K sampled data, where each task in the plurality of tasks is a triple, and the triple includes a selected predicate set, a to-be-selected predicate set, and an effective predicate;

the first allocating submodule is used for averagely allocating a plurality of tasks to the working nodes according to the scheduling nodes, wherein the working nodes at least comprise one; when the tasks with the task quantity larger than the preset task quantity exist in the working nodes, distributing half of the task quantity in the working nodes to other working nodes; or when only the tasks with the preset task quantity of 1 exist in the work nodes, splitting data corresponding to the tasks;

and the sixth generation submodule is used for generating the target data screening rule according to the working node, the scheduling node and the K sampling data.

In an embodiment of the present invention, the sixth generating sub-module includes:

a third obtaining unit, configured to obtain K pieces of the sample data, and determine all predicates to be selected in the predicate set to be selected in the K pieces of the sample data;

the iteration unit is used for iteratively adding the predicates to be selected to the selected predicate set by the working nodes;

the first determining unit is used for determining that the predicate set to be selected is an empty set as a first data screening rule if the working node stops iteration when the predicate set to be selected is an empty set; or the like, or, alternatively,

a second determining unit, configured to determine that at least one predicate to be selected in the predicate sets to be selected is related to the selected predicate set if the working node stops iteration when the at least one predicate to be selected in the predicate sets to be selected is related to the selected predicate set, and then determine that the at least one predicate to be selected in the predicate sets to be selected is related to the second data screening rule;

a fifth generating unit, configured to collect the first data screening rule and the second data screening rule to the scheduling node, where the scheduling node generates the target data screening rule according to the first data screening rule and the second data screening rule.

Referring to fig. 9, a computer device of a data screening rule verification method based on multiple sampling rounds according to the present invention is shown, which may specifically include the following:

the computer device 12 described above is embodied in the form of a general purpose computing device, and the components of the computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, and a processor or local bus 18 using any of a variety of bus 18 architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus 18, micro-channel architecture (MAC) bus 18, enhanced ISA bus 18, audio Video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard drives"). Although not shown in FIG. 9, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, with the program modules 42 configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each of which examples or some combination thereof may include an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), with one or more devices that enable an operator to interact with the computer device 12, and/or with any device (e.g., network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN)), a Wide Area Network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As shown, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in FIG. 9, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage systems 34, among others.

The processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data mining, such as implementing a multi-round sampling-based data screening rule validation method provided by embodiments of the present invention.

That is, the processing unit 16 implements, when executing the program,: acquiring the target data, and determining a corresponding data relation table according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple; constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E; sampling the vertexes in the vertex set V for K times to generate K sampling data; searching K sampling data layer by layer to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes; and screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

In an embodiment of the present invention, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements a data screening rule validation method based on multiple sampling cycles as provided in all embodiments of the present application:

that is, the program when executed by the processor implements: acquiring the target data, and determining a corresponding data relation table according to the target data, wherein each row in the data relation table generates a tuple and at least comprises one tuple; constructing a relation graph G according to the tuples, wherein the relation graph G comprises a vertex set V and an edge set E; sampling the vertexes in the vertex set V for K times to generate K sampling data; searching K sampling data layer by layer to construct an integral synchronous parallel computing model, wherein the integral synchronous parallel computing model comprises a scheduling node and a plurality of working nodes; and screening the K sampling data according to the scheduling node and the plurality of working nodes to generate a target data screening rule.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the operator's computer, partly on the operator's computer, as a stand-alone software package, partly on the operator's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the operator's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true scope of the embodiments of the present application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The data screening rule verification method and the data screening rule verification device based on multi-round sampling provided by the application are introduced in detail, specific examples are applied in the text to explain the principle and the implementation mode of the application, and the description of the above embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A data screening rule verification method based on multi-round sampling is used for carrying out data screening on target data in a large database to determine a data screening rule, and is characterized by comprising the following steps:

sampling the vertexes in the vertex set V for K times to generate K sampling data;

and screening the K sampling data according to the scheduling nodes and the working nodes to generate a target data screening rule.

2. The method of claim 1, wherein the step of constructing a relationship graph G from the tuples, wherein the relationship graph G comprises a vertex set V and an edge set E, comprises:

generating a plurality of vertexes according to the tuples;

a plurality of edges are generated by connecting a plurality of vertexes;

3. The method of claim 1, wherein the step of sampling the vertices in the set of vertices V K times to generate K sampled data comprises:

performing random walk sampling on the vertexes in the vertex set V for K times to generate K sampling data; or the like, or, alternatively,

and performing breadth-first sampling on the vertexes in the vertex set V for K times to generate K sampling data.

4. The method of claim 3, wherein the step of performing K random walk samples on the vertices in the vertex set V to generate K sample data comprises:

acquiring an initial vertex in the vertex set V through uniform sampling;

and generating the K sampling data according to a plurality of sampling tuples.

5. The method according to claim 3, wherein the step of K breadth-first sampling the vertices in the set of vertices V to generate K sampled data comprises:

6. The method according to claim 1, wherein the step of generating target data screening rules by screening the K sampled data according to the scheduling node and a plurality of the working nodes comprises:

distributing a plurality of tasks to the working nodes according to the scheduling nodes, wherein the working nodes at least comprise one; when the tasks with the task quantity larger than the preset task quantity exist in the working nodes, distributing half of the task quantity in the working nodes to other working nodes; or when only the tasks with the preset task quantity of 1 exist in the work nodes, splitting data corresponding to the tasks;

7. The method of claim 6, wherein said step of generating said target data screening rule based on said working node, said dispatch node, and K of said sampled data comprises:

when the predicate set to be selected is an empty set, the working node stops iteration, and the predicate set to be selected is determined to be an empty set as a first data screening rule; or the like, or, alternatively,

8. A data screening rule verification device based on multi-round sampling is used for carrying out data screening on target data in a large database to determine a data screening rule, and is characterized by comprising the following steps:

9. A computer device comprising a processor, a memory, and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.