CN109634949A - A kind of blended data cleaning method based on more versions of data - Google Patents
A kind of blended data cleaning method based on more versions of data Download PDFInfo
- Publication number
- CN109634949A CN109634949A CN201811628044.3A CN201811628044A CN109634949A CN 109634949 A CN109634949 A CN 109634949A CN 201811628044 A CN201811628044 A CN 201811628044A CN 109634949 A CN109634949 A CN 109634949A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- versions
- rule
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of blended data cleaning methods based on more versions of data.The present invention is using Markov logical network probability graph model and minimizes reparation principle, Qualitative and quantitative technique are integrated in the present invention, design efficient data cleaning method, the structural data of mistake is detected and corrected, guarantee that wash result can either clean the dirty data for the constraint that breaks the rules and the change cost met to data set is minimum, and the statistical properties can be complied with.Entire data set is first divided into block and group according to Markov logic index technology by the present invention, then executes two stage data cleansing.First stage by the evaluation criterion of introducing confidence level score, cleans the data in each group to obtain the data cleansing result of multi version;Second stage merges the evaluation criterion of score by introducing, and the multi version result generated to the preposition stage merges, to generate final unified wash result.
Description
Technical field
The present invention relates to, to the cleaning technique of wrong data, be based particularly on more versions of data in Computer Database field
Blended data cleaning method.
Background technique
The purpose of data cleansing is to find the content that wrong data is most likely to be in data set, and provides one reliably
Data of righting the wrong method.Dirty data is exactly the data that there is mistake in data set.
Nowadays, with being continued to bring out using social networks, e-commerce as the novel information published method of representative, Yi Jiyun
It calculating, the rise of Internet of Things computer technology, data just constantly increase and accumulate at an unprecedented rate, and in data point
In analysis, the presence of dirty data not only results in the decision and insecure analysis of mistake, can also cause to hit to corporate economy.Cause
No matter this all produces great interest to data cleansing in industry or academia.Data cleansing be to wrong data into
Process row detection and repaired corrects existing error message, keeps the one of data its object is to delete wherein redundancy
Cause property.
For data cleaning method, a few thing is had been made in domestic and foreign scholars at present.The method of mainstream can be at present
Be roughly divided into two class of qualitative method and quantitative approach: (1) qualitative method is mainly to clean the mistake for violating integrity constraint rule
Data, evaluation criterion are minimum cost principle, that is, require the cost of cleaning to minimize the change of data set, disadvantage is it
The wrong data for being unsatisfactory for minimum cost principle can not be cleaned, although it still violates integrity constraint;(2) quantitative approach is
Suitable model is constructed to determine cleaning strategy, its shortcoming is that such method is strongly dependent upon training based on data probability distributions
Collection, it is desirable to provide enough and clean given datas are used as training set to construct reliable model, and this is for present big
It being not suitable for for data environment, the Data Representation that overwhelming majority quantitative approach is cleaned at present is poorer than qualitative method, and
Existing method runing time is longer.
Summary of the invention
In view of the above deficiencies, the present invention provides a kind of blended data cleaning method based on more versions of data, of the invention
Method is not only to have guaranteed to execute cleaning to the data for violating ICs, but also accord with wash result by method that is qualitative and quantitatively combining
Close statistical property.This method is based on Markov logical network, first according to Markov logic index technology by entire data set
It is divided into block and group, then executes two stage data cleansing again, wherein data cleansing is individually performed to each piece in the first stage,
Obtain multi-edition data wash result;Second stage, the data result based on multi version eliminate conflict, obtain final global system
One wash result.Markov logic index technology reduces the detection range of dirty data, and it is clear can be effectively carried out data
It washes.
In order to achieve the above object, the used technical solution of the present invention is as follows: a kind of mixed number based on more versions of data
According to cleaning method, the step of this method, is as follows:
(1) it obtains regular (ICs) with dirty data collection and relevant integrity constraint;
(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and with dirty
The constant that each tuple includes in data set instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data
Piece;
(3) Markov logic index structure is established to dirty data collection, is first difference according to regular partition by dirty data collection
Data block, each corresponding data block of rule, the minimum unit in each data block is data slice, then again by every number
Different data groups is again divided into according to block;
(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, passes through
Independent cleaning is carried out to each data group to obtain the versions of data of multiple preliminary wash results;
(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, generates to the first stage multiple preliminary
The versions of data of wash result is merged, and the collision problem between multi version is eliminated, to generate final unified wash result;
(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeat number
According to deletion;
(7) data set after output data cleaning.
Further, the step (2) specifically:
(2.1) the different types of integrity constraint of input is standardized as Markov by conjunctive normal form transformation rule
Logical network rule;
(2.2) the corresponding constant of all variables data set in the rule after standardization is replaced.
Further, the step (3) specifically:
(3.1) entire dirty data collection is divided into multiple data by the integrity constraint rule for being included according to dirty data collection
Block, each rule correspond to a data block, include several data slices in each data block;
(3.2) in each data block, the entry in attribute containing same keyword is divided into same group;It is wherein crucial
Word is the reason item of rule, and the data slice with same cause is divided into one group.
Further, the step (4) specifically:
(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be drawn
It assigns to the phenomenon in incorrect group and is known as "abnormal", then repartition the data slice of these mistakes in corresponding group;
(4.2) it is calculated in each group according to similarity distance metric method and Markov logical network weight learning method
The confidence level score (reliability score) of abnormal data;
(4.3) independent to clean each data group: cleaning unit is each of data block, selects confidence level score maximum
Benchmark of the data slice γ as replacement, will be belonged to using this data and be replaced with other data that leave a question open in data group
It changes, until each data group cleaning in the data block finishes, that is, the independence for completing the data block is cleaned;
Above-mentioned cleaning is also similarly executed to other data blocks;The multiple preliminary wash results that will be cleaned by the stage
It is considered as multiple versions of data, each data block is a versions of data.
Further, the step (5) specifically:
(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then with each base
Standard is starting, finds in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Ma Erke
The data slice of authority of the husband weight, and it is merged with benchmark;
(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed;Then it calculates under the benchmark
The fusion score f-score (t) of amalgamation result=w1×…×wm, wherein wiIndicate the data slice being merged in i-th of data block
Markov weight;
(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and remember
Record, until obtaining the fusion score of the amalgamation result under all different benchmark;Then the selection fusion maximum amalgamation result of score
The wash result unified as the final overall situation of the tuple.
Further, the step (6) is scanned entire data set specifically, after completing two stage cleaning,
Hash table is established for each tuple therein, when duplicate keys are arrived in scanning, it is rejected.
The invention has the advantages that: the present invention be based on qualitative and quantitative technique blended data cleaning method,
By Markov logical network rule, a plurality of types of integrity constraints are combined, introduce Markov logical network power
Weight learning method and Similar distance measuring method are used as the foundation of data cleansing simultaneously, meet wash result can either qualitative
Technology needs the minimum cost principle followed, and can meet the statistical property of quantitative technique.In addition, the present invention design it is excellent
Change method, i.e. Markov logic index, reduce the detection range of dirty data, accelerate the runing time of data cleansing.This
Invention with the data set of synthesis using really being tested, more higher than the currently a popular system cleaning efficiency of result presentation and clearly
Wash precision.
Detailed description of the invention
Fig. 1 is implementation steps flow chart of the invention;
Fig. 2 (a) is hospital data collection according to rule (r1)FD:The Markov logical network of formation indexes knot
Structure;
Fig. 2 (b) is hospital data collection according to rule (r2)DC:Shape
At Markov logical network index structure;
Fig. 2 (c) is hospital data collection according to rule (r3) CFD:HN [" ELIZA "], CT [" BOAZ "]=> PN
The Markov logical network index structure that [" 2567688400 "] are formed;
Fig. 3 (a) is the rule r after the first stage cleans1Corresponding Markov logical network index structure schematic diagram;
Fig. 3 (b) is the rule r after the first stage cleans2Corresponding Markov logical network index structure schematic diagram;
Fig. 3 (c) is the rule r after the first stage cleans3Corresponding Markov logical network index structure schematic diagram;
Fig. 4 is second stage cleaning process schematic diagram.
Specific embodiment
Technical solution of the present invention is described further now in conjunction with attached drawing and specific implementation:
As shown in Figure 1, specific implementation process of the present invention and working principle are as follows:
Step (1): the integrity constraint (IC) in frame and the data set with dirty data are input in frame;Under
Face is illustrated dirty data collection and integrity constraint with table 1:
Table 1 illustrates a information for hospital data set record, includes 4 attributes, is hospital name (HN), city respectively
(CT), affiliated state (ST), contact method (PN), grey shading label is wrong data in table 1.Given three integralities are about
Beam:
Wherein D represents data set, t1,t2Represent two different tuples, functional dependence (Functional
Dependency, abbreviation FD) rule r1Indicate that a city can only belong to a state, negative constraint (Denial
Constraint, abbreviation DC) rule r2Indicate that the hospital in not Tonzhou has different telephone numbers, conditional function dependent Rule
(Conditional Functional Dependency, abbreviation CFD) r3Indicate that the name of hospital, corresponding city Hezhou are determined
The telephone number of Ding Liao hospital.
Table 1:
Step (2): converting Markov logical network normalisation rule for different types of integrity constraint rule, and
The constant for including with each tuple that dirty data is concentrated instantiates the normalisation rule after conversion, and each instantiation rule is referred to as
Data slice.
Specific steps include:
1) the different types of integrity constraint of input Markov is standardized as by conjunctive normal form transformation rule to patrol
Collect networking rule;
2) constant of the data set of the variable in the rule after standardization is replaced.
Step (3): Markov logic index structure is established to dirty data collection, is first according to regular partition by dirty data collection
Different data blocks, each rule correspond to a data block, and the minimum unit in each data block is data slice, then again will be every
A data block is again divided into different groups, and specific steps include:
1) entire dirty data collection is divided into multiple data blocks, Mei Gegui by the integrity constraint rule that dirty data collection is included
A data block is then corresponded to, includes several data slices γ in each data block;
2) in each data block, the entry in attribute containing same keyword is divided into same group, wherein keyword
For the reason item of rule, the γ with same cause is divided into one group.
Markov logical network index construct is illustrated by taking Fig. 2 (a), Fig. 2 (b), Fig. 2 (c) as an example below:
Using the data set of table 1 as sample, given constraint rule is related to HN, CT, ST and PN, will according to three rules
Data set is accordingly divided into three block B1、B2、B3, and pay attention to distinguishing attribute and result attribute the reason of in constraint rule.It connects down
Come, operation is grouped to three blocks respectively, the identical array of reason attribute keyword in a group is divided into a group, such as
B1Middle G13Three arrays the reason of keyword be all identical, so being classified as one group.B1Corresponding Markov logical network
Shown in index structure such as Fig. 2 (a), B2Shown in corresponding Markov logical network index structure such as Fig. 2 (b), B3Corresponding Ma Er
Shown in section husband logical network index structure such as Fig. 2 (c);
Step (4): on the basis of step (3), the cleaning of first stage is executed, introduces the evaluation mark of confidence level score
Standard, by carrying out the independent multiple versions of data (each versions of data is from different blocks) of cleaning to each data group, specifically
It is as follows:
1) abnormal data is handled.Item due to being appeared in error in data and cause its corresponding data slice to be divided into
Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group;
2) exception in each group is calculated according to similarity distance metric method and Markov logical network weight learning method
The confidence level score (reliability score, r-score) of data, formula is
Wherein d (γi,γ*) represent the candidate alternate data γ of data slice γ and it*The distance between, w (γi) be data slice γ horse
That section authority of the husband weight.
3) independent to clean each data block.Specifically, cleaning unit is each of data block, we select credible
Spend benchmark of the maximum data slice γ of score as replacement, using this data will belong to other in a group leave a question open data into
Row replacement.It all cleans and finishes until each of the data block, that is, complete the independent cleaning of the data block.Similarly to other numbers
Above-mentioned cleaning is also executed according to block;The multiple preliminary wash results cleaned by the stage are considered as multiple versions of data, often
A data block is a versions of data.Markov logic index structure such as Fig. 3 (a), Fig. 3 (b), Fig. 3 after stage cleaning
(c) shown in.
Step (5):, may between different data version since first stage cleaning step produces the data result of multi version
Conflict is generated, i.e., the same position in data set generates different wash results between different editions.Therefore, melted by introducing
The evaluation criterion of score is closed, multi-edition data collision problem is eliminated, to obtain final global unified wash result.
With the tuple t in table 13For, after having executed first stage cleaning, in B1In with t3Relevant data slice is
{ CT:DOTHAN, ST:AL } (first versions of data), however in B3In with t3Relevant data slice be HN:ELIZA, CT:
BOAZ, PN:2567688400 } (third versions of data).Obviously, t3Two are corresponded to after the cleaning of [CT] in the first stage not
Same value (that is, " DOTHAN " and " BOAZ "), they are from two different versions of data.In other words, for t3For, it
There is conflict on attribute CT, and final consistent wash result in order to obtain, conflict needs are solved.
The step is specific as follows:
1) all tuples comprising conflict are detected, and record the data slice where each conflict.As shown in figure 4, t3It is corresponding
The data slice of two conflicts, respectively α1∈B1And α2∈B2, and using the two as the benchmark for generating different candidate schemes.
2) merging of corresponding data piece between different data block is executed for each benchmark.Need to consider two kinds of situations, if wait close
And data slice and benchmark between there is no conflict, directly merging;Conflict if it exists, then needs corresponding in data slice to be combined
Block in find another data slice (it does not conflict between benchmark, and corresponding Markov weight is maximum), then hold
Row union operation, and using data slice new after merging as benchmark, above-mentioned steps are executed again, until all data blocks are all
It completes to merge.It is noted that if can not find satisfactory data slice in merging process, then it is assumed that under the benchmark
It is unable to complete merging.
3) after executing the step 2), multiple possible candidate schemes is generated for each tuple comprising conflict, are led to
Introducing fusion score (fusion score, f-score) is crossed, is given a mark to each candidate scheme, it is final for selecting score highest item
As a result, fusion score formula be f-score (t)=w1×…×wm.As shown in figure 4, for α1∈B1On the basis of conjunction
And scheme, due to merging B3In corresponding data slice when, can not find satisfactory data slice, therefore, it is considered that under the benchmark
It is unable to complete merging, therefore remembers f-score (t3)=0.And with α2∈B2On the basis of, amalgamation result t3=HN:ELIZA,
CT:BOAZ, ST:AL, PN:2567688400 }, corresponding f-score (t3)=0.0678.Therefore, by second of Merge Scenarios
As final t3Wash result.
Step (6): after completing two stage cleaning, we are scanned entire data set, are each member therein
Group establishes Hash table, when duplicate keys are arrived in scanning, rejects to it.
Step (7): output data treated data set.
Claims (6)
1. a kind of blended data cleaning method based on more versions of data, which is characterized in that the step of this method is as follows:
(1) it obtains regular (ICs) with dirty data collection and relevant integrity constraint;
(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and uses dirty data
The constant for concentrating each tuple to include instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data slice;
(3) Markov logic index structure is established to dirty data collection, is first different numbers according to regular partition by dirty data collection
According to block, each rule corresponds to a data block, and the minimum unit in each data block is data slice, then again by each data block
It is again divided into different data groups;
(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, by every
A data group carries out independent cleaning to obtain the versions of data of multiple preliminary wash results;
(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, the multiple preliminary cleanings generate to the first stage
As a result versions of data is merged, and the collision problem between multi version is eliminated, to generate final unified wash result;
(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeated data delete
It removes;
(7) data set after output data cleaning.
2. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step
(2) specifically:
(2.1) the different types of integrity constraint of input is standardized as Markov logic by conjunctive normal form transformation rule
Networking rule;
(2.2) the corresponding constant of all variables data set in the rule after standardization is replaced.
3. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step
(3) specifically:
(3.1) entire dirty data collection is divided into multiple data blocks by the integrity constraint rule for being included according to dirty data collection, often
A rule corresponds to a data block, includes several data slices in each data block;
(3.2) in each data block, the entry in attribute containing same keyword is divided into same group;Wherein keyword is
The reason item of rule, the data slice with same cause are divided into one group.
4. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step
Suddenly (4) specifically:
(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be divided into
Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group;
(4.2) it is calculated according to similarity distance metric method and Markov logical network weight learning method abnormal in each group
The confidence level score (reliability score) of data;
(4.3) independent to clean each data group: cleaning unit is each of data block, selects the maximum number of confidence level score
Benchmark according to piece γ as replacement will be belonged to using this data and be replaced with other data that leave a question open in data group, directly
Each data group cleaning into the data block finishes, that is, completes the independent cleaning of the data block;
Above-mentioned cleaning is also similarly executed to other data blocks;The multiple preliminary wash results cleaned by the stage are considered as
Multiple versions of data, each data block are a versions of data.
5. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step
Suddenly (5) specifically:
(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then it is with each benchmark
Starting is found in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Markov power
The data slice of weight, and it is merged with benchmark;
(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed;Then the merging under the benchmark is calculated
As a result fusion score f-score (t)=w1×…×wm, wherein wiIndicate the horse for the data slice being merged in i-th of data block
That section authority of the husband weight;
(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and record, directly
To the fusion score for obtaining the amalgamation result under all different benchmark;Then select the fusion maximum amalgamation result of score as this
The unified wash result of the final overall situation of tuple.
6. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step
(6) specifically, being scanned after completing two stage cleaning to entire data set, Hash is established for each tuple therein
Table rejects it when duplicate keys are arrived in scanning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811628044.3A CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811628044.3A CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109634949A true CN109634949A (en) | 2019-04-16 |
CN109634949B CN109634949B (en) | 2022-04-12 |
Family
ID=66079015
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811628044.3A Active CN109634949B (en) | 2018-12-28 | 2018-12-28 | Mixed data cleaning method based on multiple data versions |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109634949B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287191A (en) * | 2019-06-25 | 2019-09-27 | 北京明略软件系统有限公司 | Data alignment method and device, storage medium, electronic device |
CN110968576A (en) * | 2019-11-28 | 2020-04-07 | 哈尔滨工程大学 | Content correlation-based numerical data consistency cleaning method |
WO2021143463A1 (en) * | 2020-01-17 | 2021-07-22 | 深圳市华傲数据技术有限公司 | Data cleaning method and apparatus |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2919533A1 (en) * | 2012-08-01 | 2014-02-06 | Sherpa Technologies Inc. | System and method for managing versions of program assets |
CN105339940A (en) * | 2013-06-28 | 2016-02-17 | 甲骨文国际公司 | Naive, client-side sharding with online addition of shards |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
US20180150543A1 (en) * | 2016-11-30 | 2018-05-31 | Linkedin Corporation | Unified multiversioned processing of derived data |
US20180219888A1 (en) * | 2017-01-30 | 2018-08-02 | Splunk Inc. | Graph-Based Network Security Threat Detection Across Time and Entities |
CN108921399A (en) * | 2018-06-14 | 2018-11-30 | 北京新广视通科技有限公司 | A kind of intelligence direct management system and method |
-
2018
- 2018-12-28 CN CN201811628044.3A patent/CN109634949B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2919533A1 (en) * | 2012-08-01 | 2014-02-06 | Sherpa Technologies Inc. | System and method for managing versions of program assets |
CN105339940A (en) * | 2013-06-28 | 2016-02-17 | 甲骨文国际公司 | Naive, client-side sharding with online addition of shards |
US20180150543A1 (en) * | 2016-11-30 | 2018-05-31 | Linkedin Corporation | Unified multiversioned processing of derived data |
CN106649644A (en) * | 2016-12-08 | 2017-05-10 | 腾讯音乐娱乐(深圳)有限公司 | Lyric file generation method and device |
US20180219888A1 (en) * | 2017-01-30 | 2018-08-02 | Splunk Inc. | Graph-Based Network Security Threat Detection Across Time and Entities |
CN108921399A (en) * | 2018-06-14 | 2018-11-30 | 北京新广视通科技有限公司 | A kind of intelligence direct management system and method |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287191A (en) * | 2019-06-25 | 2019-09-27 | 北京明略软件系统有限公司 | Data alignment method and device, storage medium, electronic device |
CN110287191B (en) * | 2019-06-25 | 2021-07-27 | 北京明略软件系统有限公司 | Data alignment method and device, storage medium and electronic device |
CN110968576A (en) * | 2019-11-28 | 2020-04-07 | 哈尔滨工程大学 | Content correlation-based numerical data consistency cleaning method |
WO2021143463A1 (en) * | 2020-01-17 | 2021-07-22 | 深圳市华傲数据技术有限公司 | Data cleaning method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN109634949B (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nobre et al. | Lineage: Visualizing multivariate clinical data in genealogy graphs | |
Bininda-Emonds | The evolution of supertrees | |
Ge et al. | A hybrid data cleaning framework using markov logic networks | |
CN111753101A (en) | Knowledge graph representation learning method integrating entity description and type | |
US12131228B2 (en) | Method for accessing data records of a master data management system | |
CN109634949A (en) | A kind of blended data cleaning method based on more versions of data | |
CN111597347A (en) | Knowledge embedded defect report reconstruction method and device | |
Zhou et al. | D-bot: Database diagnosis system using large language models | |
Mahdavi et al. | Semi-Supervised Data Cleaning with Raha and Baran. | |
Galhotra et al. | Beer: blocking for effective entity resolution | |
Imran et al. | Complex process modeling in Process mining: A systematic review | |
Cosler et al. | Iterative circuit repair against formal specifications | |
US11321359B2 (en) | Review and curation of record clustering changes at large scale | |
Ciszak | Application of clustering and association methods in data cleaning | |
Jimenez et al. | ARIEX: Automated ranking of information extractors | |
CN113516189B (en) | Website malicious user prediction method based on two-stage random forest algorithm | |
CN115438274A (en) | False news identification method based on heterogeneous graph convolutional network | |
Wang et al. | Error diagnosis and data profiling with data x-ray | |
Gu et al. | Improving the quality of web-based data imputation with crowd intervention | |
Vidhya et al. | Entity Resolution and Blocking: A Review | |
US12052134B2 (en) | Identification of clusters of elements causing network performance degradation or outage | |
Wang et al. | MIDAS: Finding the right web sources to fill knowledge gaps | |
Kuan et al. | Characterizing patent assignees by their structural positions relative to a field’s evolutionary trajectory | |
Chuchao et al. | Structure learning on Bayesian networks by finding the optimal ordering with and without priors | |
JP2010267229A (en) | Association processing method and flow comparative processing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |