[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105045807A - Data cleaning algorithm based on Internet trading information - Google Patents

Data cleaning algorithm based on Internet trading information Download PDF

Info

Publication number
CN105045807A
CN105045807A CN201510305440.2A CN201510305440A CN105045807A CN 105045807 A CN105045807 A CN 105045807A CN 201510305440 A CN201510305440 A CN 201510305440A CN 105045807 A CN105045807 A CN 105045807A
Authority
CN
China
Prior art keywords
tuple
data
value
expert knowledge
knowledge library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510305440.2A
Other languages
Chinese (zh)
Inventor
陈海江
吕浩
邵奇可
颜世航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Li Shi Science And Technology Co Ltd
Original Assignee
Zhejiang Li Shi Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Li Shi Science And Technology Co Ltd filed Critical Zhejiang Li Shi Science And Technology Co Ltd
Priority to CN201510305440.2A priority Critical patent/CN105045807A/en
Publication of CN105045807A publication Critical patent/CN105045807A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for cleaning data based on different data resources, i.e., different Internet trading platforms. According to the method, firstly, tuples in a database are classified; correctness-confirmed tuple data in the tuples are subjected to mode interaction with an expert knowledge base; fuzzy matching based on retrieval contents of the knowledge base is used as a tool to obtain corresponding mode knowledge; then, the found mode knowledge is used for cleaning applicable data with quality problems. Meanwhile, a proper efficient detection scheme is also provided for quality errors of different types of mass data. A BP (Back Propagation) neural network method is adopted for realizing the self-learning expert knowledge base, thereby providing an efficient and safe cleaning mode for the Internet trading information data cleaning.

Description

The data cleansing algorithm of internet business information
Technical field
The present invention relates to computer application field, particularly, relate to a kind of data cleansing algorithm of internet business information.
Background technology
China's internet business continues to keep high speed development in recent years, and over nearly 5 years, average speedup reaches 80%.Within 2013, ecommerce total turnover is more than 10 trillion yuan Renminbi, and e-tailing market scale has exceeded the e-tailing market that the U.S. becomes the biggest in the world.Along with the development of ecommerce, also there is the insoluble problem in some markets self, comprised product false propaganda, fake products spreads unchecked, network defraud and fishing website is very capable, logistics distribution service is lack of standardization, it is difficult to return goods and reverse logistic is not smooth and the problem such as netizen's leakage of personal information.Specification mainly due to the credit appraisal system of different electric business's platforms is different; Data grows simultaneously in electric business's infosystem is many, even reach TB, the mass data rank of more than PB cause mass data assemble after because content is out-of-date, input error, repeat input, property value conflict etc. and drastically influence the quality of data, and then cause the quality of data cannot to meet the demand of supervisory systems in guarantee system.
In order to overcome the problem caused due to the quality of data, the technology of data processing is taked to be very important.A lot of method obtaining higher-quality data by processing data is suggested at present, and in the middle of these technology, data cleansing technology is most important.
Disposal route for data cleansing mainly comprises following several:
1. carrying out data cleansing by the functional dependence in relation data between key with key is more direct method, but for insufficient with the rule digging of this mass data of that Transaction Information of internet.
2. the data method relied on based on conditional function to adopt based on functional dependence and the constraint condition added semantically, effectively can clean the data tuple of the relation that existence function relies on like this, but internet business information is from different electric business's platforms, very multidata functional dependence is also indefinite, and some data cannot obtain funtcional relationship before cleaning simultaneously.
3. adopt the artificial data cleansing participated in, namely in the process of data cleansing, if when system meets with the situation that cannot process, next step cleaning step need be carried out by the feedback operation of people.The advantage of this method is because the participation accuracy of people can improve greatly, but the time loss of process is larger; Simultaneously different people can not ensure completely the same for the rule judgment standard of dependence, and subjective dependence is excessively strong.
4. adopt the feedback system of machine learning, namely the feedback procedure of people is substituted by the method for machine learning, the cleaning operation that machine learning is correct was first allowed before cleaning process, then in cleaning process, constantly study is accumulated, can to deduct a percentage like this time efficiency of algorithm, but degree of accuracy declines to some extent, and learning process can increase the overhead of system, requires still higher in cleaning process to the dependence between data simultaneously.
In sum, current Data Cleaning Method also exists certain limitation for the demand of the process of internet business information.
Summary of the invention
For defect of the prior art, the object of this invention is to provide a kind of data cleansing algorithm of internet business information.
According to the data cleansing algorithm of a kind of internet business information provided by the invention, comprising:
Internet business information data to be cleaned is carried out data quality problem and detect the clean tuple of acquisition, correct tuple and problem tuple;
To described clean tuple: directly send into clean database;
To described correct tuple: generate and need to expertise library searching key sentence, in described expert knowledge library, carry out inquiry according to described key sentence and obtain expert knowledge library pattern, described expert knowledge library pattern comprises text dependence statement, sends into described clean database after carrying out data cleansing to described expert knowledge library pattern;
To described problem tuple: the judgement carrying out feasible tuple obtains the feasible tuple and the infeasible tuple be not suitable for based on the cleaning of expert knowledge library pattern that are applicable to based on the cleaning of expert knowledge library pattern,
Described feasible tuple generation is inquired about after described expertise library searching key sentence from this expert knowledge library and obtains expert knowledge library pattern, then after data cleansing, send into described clean database,
Described clean database is sent into after the cleaning of other policy datas is carried out to described infeasible tuple.
As a kind of prioritization scheme, described expert knowledge library adopts BP neural network algorithm to realize self study, and described BP neural network algorithm is specially:
The neural network of a m layer, for given internet business message sample collection X i(i=1,2 ..., n), if the i of kth layer neuronic input summation is expressed as output summation is i-th neuronic weight coefficient from a jth neuron of kth-1 layer to kth layer is W ij, each neuronic excitation function is f (), then the relation of each variable can be expressed as:
X i k = f ( U i k )
U i k = Σ j W ij X j k - 1
In formula, input layer number is n, and hidden layer nodes is h, and output layer nodes is o, determines input layer and hidden layer respectively, the weight matrix that links between hidden layer with output layer is W h, W oand threshold values b h, b o;
As a kind of prioritization scheme, the quadratic sum of desired output and the actual difference exported is the error function of described expert knowledge library, and the error function of described expert knowledge library is:
e = 1 2 Σ i ( X i m - Y i ) 2
Y ibe the expectation value of output unit, m layer is output layer, actual output; BP algorithm adopts the steepest descending method in nonlinear programming, by the negative gradient direction power of amendment coefficient of error function e.
As a kind of prioritization scheme, the difference in an internet business message sample between institute's directed quantity adopts the mahalanobis distance in machine learning to weigh; For l vectorial X 1~ X l, establish the most reasonable vectorial X ksample training is launched as BP neural network standard output; The covariance matrix of the vector comprised in a described sample is designated as S, vectorial X iwith X jbetween mahalanobis distance be:
D ( X i , X j ) = ( X i - X j ) T S - 1 ( X i - X j )
min { Σ i = 0 l D ( X k , X i ) }
In described covariance matrix S, each element is the covariance Cov (X, Y) between each vector element, wherein E is the mathematical expectation of the vector comprised in a described sample.
As a kind of prioritization scheme, described problem tuple comprises missing value, and/or improper value, and/or conflict value;
Described missing value is the value that data attribute exists vacancy; Detection method for missing value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) attribute A detect, if exist disappearance property value; for comprising loss problem tuple extremely;
Described improper value be data exist attribute be identified as be mistake value; Detection method for improper value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) carry out the condition dependence detection relying on function based on condition, if the attribute of these data does not meet described condition rely on function, this tuple is the problem tuple comprising improper value;
Described conflict value is that multiple respective value appears in the property value of data; Detection method for conflict value is: first carry out tuple coupling for internet business information data to be cleaned and find out the tuple pair of potentially conflicting, then to the tuple of described potentially conflicting to the problem tuple of carrying out cluster and obtain comprising conflict value.
As a kind of prioritization scheme, described tuple coupling is specially:
S1: similarity is carried out for the tuple in internet business information data to be cleaned and mates between two, if the right similarity degree of tuple reaches default similar threshold value, this tuple, to the same entity of sensing, will point to the tuple of same entity as a cohort;
S2: be that described cohort creates the BloomFilter array corresponding with tuple attributes, checks that the attribute item by item of each tuple in described cohort is whether in the BloomFilter array of correspondence, the weights of this tuple that then adds up in the tuple of same BloomFilter array,
S3, described tuple weights exceed the default upper limit and then extract tuple as described potentially conflicting.
As a kind of prioritization scheme, in described expert knowledge library, carry out inquiring about the process obtaining expert knowledge library pattern according to described key sentence and be specially:
Described key sentence is sent to the search engine of expert knowledge library, obtain and resolve expert knowledge library feedback Query Result, adopt optimum Method of Fuzzy Matching to carry out mode excavation and obtain described expert knowledge library pattern.
As a kind of prioritization scheme, described infeasible tuple comprises:
The data attribute quantity degree of association be less than between preset attribute numerical lower limits value or attribute is weaker than the tuple of default degree of association lower limit;
There are quality problems and the tuple cannot repaired by expert knowledge library pattern correspondence in attribute;
There is mistake and the tuple that cannot repair in the data of different attribute simultaneously.
As a kind of prioritization scheme, described other policy datas cleaning comprises:
If data centralization provides the constraint function mode that just uses of some constraint conditions directly to clean data;
If when there is multiple situation to the description of same entity, select these describe in one the most accurately, then adopt use true value to find algorithms selection cleans, namely by study to data source accuracy, to realize, true value finds to make each describe non-equivalence to give different weights.
Compared with prior art, the present invention has following beneficial effect:
The present invention proposes a kind of method that data for different internet business platform source carry out cleaning, first the tuple in database is classified, mutual by wherein determining that correct tuple data carries out carrying out with expert knowledge library pattern, with the fuzzy matching of knowledge based library searching content for instrument, obtain its corresponding pattern knowledge.Then the pattern knowledge found is utilized, to there are quality problems in data and applicable data are cleaned.Meanwhile, the quality mistake for dissimilar mass data it is also proposed suitable efficient detection scheme.And the expert knowledge library adopting BP neural net method to realize self study be internet business information data cleaning provide more efficient, safe cleaning way.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, obviously, accompanying drawing in the following describes is only some embodiments of the present invention, for those skilled in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.In accompanying drawing:
Fig. 1 is the data cleansing algorithm flow of a kind of internet business information in embodiment;
Fig. 2 is the tuple coupling flow process adopting BloomFilter;
Fig. 3 is the flow process obtaining expert knowledge library pattern;
Fig. 4 is the cleaning process based on expert knowledge library pattern adopting mode excavation;
Fig. 5 is BP neural net method flow process.
Embodiment
Hereafter in the mode of specific embodiment, the present invention is described in detail by reference to the accompanying drawings.Following examples will contribute to those skilled in the art and understand the present invention further, but not limit the present invention in any form.It should be pointed out that the embodiment that can also use other, or the amendment on 26S Proteasome Structure and Function is carried out to the embodiment enumerated herein, and can not depart from the scope and spirit of the present invention.
In the embodiment of the data cleansing algorithm of a kind of internet business information provided by the invention, as shown in Figure 1, internet business information data to be cleaned is carried out data quality problem and detect the clean tuple of acquisition, correct tuple and problem tuple;
To described clean tuple: directly send into clean database;
To described correct tuple: generate and need to expertise library searching key sentence, in described expert knowledge library, carry out inquiry according to described key sentence and obtain expert knowledge library pattern, described expert knowledge library pattern comprises text dependence statement, sends into described clean database after carrying out data cleansing to described expert knowledge library pattern;
To described problem tuple: the judgement carrying out feasible tuple obtains the feasible tuple and the infeasible tuple be not suitable for based on the cleaning of expert knowledge library pattern that are applicable to based on the cleaning of expert knowledge library pattern,
Described feasible tuple generation is inquired about after described expertise library searching key sentence from this expert knowledge library and obtains expert knowledge library pattern, then after data cleansing, send into described clean database,
Described clean database is sent into after the cleaning of other policy datas is carried out to described infeasible tuple.
Internet business information in the present embodiment comprises transaction agent information and trading activity information.The quality of data refers to and occur loss of data or mess code data in transmitting procedure, or in data grabber process, captures sequence error or the header file quality of data caused such as to make mistakes decline.Because the quality of data can have a strong impact on the analysing and decision of supervisory systems, in order to overcome the problem that factor data quality causes, the present invention proposes a kind of cleaning algorithm to internet business information data, and algorithm content mainly comprises the detection of data quality problem, the mutual of expert knowledge library pattern and cleaning part.
In order to realize particular content a kind of embodiment as shown in Figure 1 that above-mentioned algoritic module adopts, described problem tuple comprises missing value, and/or improper value, and/or conflict value.Data quality problem is monitored: whether the data tuple of monitoring internet business information has problems, and the situation that there are quality problems in data mainly comprises: missing value, improper value and conflict value.
Described missing value is the value that data attribute exists vacancy; Detection method for missing value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) attribute A detect, if exist disappearance property value; for comprising loss problem tuple extremely.Missing value is exactly the value that there is vacancy from the data attribute of the Transaction Information of internet acquisition, the generation of this mistake is often caused by data integration, such as when being undertaken integrated by two data sources, the attribute number in two data sources is different, will cause the property value vacancy of part tuple.
Described improper value be data exist attribute be identified as be mistake value; Detection method for improper value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) carry out the condition dependence detection relying on function based on condition, if the attribute of these data does not meet described condition rely on function, this tuple is the problem tuple comprising improper value.Improper value refers to that the attribute of data exists, but to be identified as be wrong, and this mistake is often because data grabber or input error cause.For the internet business data of magnanimity, employing condition relies on the judgement that function carries out error in data value.Before the detection of error in data value starts, condition relies on the knowledge m of function, i.e. standard set m is known, then to a data set D (T 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) carry out condition dependence detection.What the attribute of such as these data was corresponding is trading activity, but these data are but the contents of transaction agent, and as seller shop title etc., then cannot meet condition corresponding to this standard set and rely on function, these data are improper value, and this data place tuple is problem tuple.
Described conflict value is that multiple respective value appears in the property value of data; Detection method for conflict value is: first carry out tuple coupling for internet business information data to be cleaned and find out the tuple pair of potentially conflicting, then to the tuple of described potentially conflicting to the problem tuple of carrying out cluster and obtain comprising conflict value.
Conflict value is that the property value of data exists multiple possible values, but only has a value to be correct, and this mistake is judged by the method for Entity recognition, needs to select or manufacture a real property value and carrys out conflict removal value.In order to realize this process for mass data, first carry out tuple coupling by the tuple of potentially conflicting to finding out, then the tuple found out is carried out the tuple-set that cluster finally obtains conflict.
(1) described tuple coupling is specially:
S1: similarity is carried out for the tuple in internet business information data to be cleaned and mates between two, if the right similarity degree of tuple reaches default similar threshold value, this tuple, to the same entity of sensing, will point to the tuple of same entity as a cohort;
S2: be that described cohort creates the BloomFilter array corresponding with tuple attributes, checks that the attribute item by item of each tuple in described cohort is whether in the BloomFilter array of correspondence, the weights of this tuple that then adds up in the tuple of same BloomFilter array,
S3, described tuple weights exceed the default upper limit and then extract tuple as described potentially conflicting.
As a kind of embodiment of matching process, first set the upper and lower bound of judgement, if the similarity of two tuples reaches the upper limit just think that these two tuples point to same entity; If the similarity of two tuples, was just thought to point to different entities relatively just lower than lower limit in first time.Then be that each cohort creates m BloomFilter array, m is the number of attribute in first kind attribute, to the N number of tuple in each array, attribute is inserted in corresponding BloomFilter array.Then check that the first kind attribute of each tuple is whether heavy at corresponding BloomFilter, if return results be Yes that just illustrate that these attributes are redundancies in corresponding cohort; If two tuples have like attribute so with regard to sum weight simultaneously, just they deleted from data source when weights exceed the upper limit and export, if do not reach the upper limit just continue more next attribute.If all properties of the first kind is all after relatively, weights also do not reach lower limit and then think that two groups of data are different, just do not carry out the comparison of remaining attribute.When weights are between upper and lower bound, just continue more remaining attribute to judge whether coupling according to the mode of sequential.The process of coupling adopts the form of structure as a linked list array of BloomFilter, when an attribute be hashed into one digital time, only need increase node in corresponding chained list, comparison procedure in the heavier tuple returning Yes of BloomFilter, will increase substantially the efficiency compared by being only limitted to so in this way.Adopt the comparison procedure of BloomFilter array as shown in Figure 2.
(2) operation of tuple cluster carries out based on the result of coupling, the tuple of all repetitions can be put together for tuple similar between two.Each tuple is regarded as a point, and connect with a line between the tuple matched before, like this to all points, clustering being carried out with regard to being equivalent to the cluster of tuple, finding internal connection point closely on figure, put within a community, and then complete the cluster operation of tuple.
Described expert knowledge library adopts BP neural network algorithm to realize self study.Expert knowledge library adopts BP neural net method to realize the process of self study, then carries out problem quality detection and cleaning according to the knowledge base trained to internet electronic business transaction data.
BP neural net method is adopted in the process of expert knowledge library study, first obtain the primary data of internet electronic business transaction, input Testing index number and output detections index number again, calculate each layer to export, calculate each layer to export, calculate the error e of actual output and target output, calculate partial gradient, revise output layer weights, after revising hidden layer weights, training of judgement concentrates the sample of whether not training in addition, if also have non-training sample, the step being back to input Testing index number and output detections index number continues to perform, if sample all trains end, more whether error in judgement satisfies condition, or whether iteration satisfies condition, arbitrary satisfied then end learns, the satisfied step being back to input Testing index number and output detections index number after 0 that then reset to by e re-executes.
As shown in Figure 5, establish input layer number is n to BP neural network algorithm, and hidden layer nodes is h, and output layer nodes is o, determines input layer and hidden layer respectively, the weight matrix that links between hidden layer with output layer is W h, W oand threshold values b h, b o.The neural network of a m layer, for given internet business message sample collection X i(i=1,2 ..., n), if the i of kth layer neuronic input summation is expressed as output summation is i-th neuronic weight coefficient from a jth neuron of kth-1 layer to kth layer is W ij, each neuronic excitation function is f (), then the relation of each variable can be expressed as formula (1), (2):
X i k = f ( U i k ) - - - ( 1 )
U i k = Σ j W ij X j k - 1 - - - ( 2 )
The quadratic sum of definition desired output and the actual difference exported as the error function of described expert knowledge library such as formula shown in (3):
e = 1 2 Σ i ( X i m - Y i ) 2 - - - ( 3 )
Y ibe the expectation value of output unit, m layer is output layer, actual output.BP algorithm adopts the steepest descending method in nonlinear programming, by the negative gradient direction power of amendment coefficient of error function e.
Mahalanobis distance in employing machine learning herein weighs the difference in a certain sample between institute's directed quantity, for l vectorial X 1~ X l, establish the most reasonable vectorial X ksample training is launched as BP neural network standard output.The most reasonable described vectorial X krefer to other each vector distance and the shortest vectors.Covariance matrix is designated as S, vectorial X iwith X jbetween mahalanobis distance definition such as formula shown in (4):
D ( X i , X j ) = ( X i - X j ) T S - 1 ( X i - X j ) - - - ( 4 )
min { Σ i = 0 l D ( X k , X i ) } - - - ( 5 )
In covariance matrix, each element is the covariance Cov (X, Y) between each vector element, wherein E is the mathematical expectation of internet business information.
For the BP neural network training process of data cleansing expert knowledge library, the initial set value of the mahalanobis distance in expert knowledge library between data tuple, be divided into the application scenarios of transaction agent and the large class of trading activity two according to the content of the internet business data of monitoring, this initial set value is in 0.7 best results.
The pattern of expert knowledge library is be divided into correct tuple and the large class of problem tuple two to the data through quality testing alternately, then the cleaning all completing data under the guidance of expert knowledge library puts into clean database, for subsequent analysis process provides high-quality data message.
For correct tuple, first generate the Query statement needed to expertise library searching through key word generation module, then obtain the relational statement that text relies on, and then complete the data cleansing based on expert knowledge library; First will by the judgement of feasible first group selection for incorrect tuple, feasible tuple carries out the data cleansing based on expert knowledge library, then adopts other strategies to clean for infeasible tuple.
As a kind of embodiment, in described expert knowledge library, carry out inquiring about the process obtaining expert knowledge library pattern according to described key sentence and be specially:
Described key sentence is sent to the search engine of expert knowledge library, obtain and resolve expert knowledge library feedback Query Result, adopt optimum Method of Fuzzy Matching to carry out mode excavation and obtain described expert knowledge library pattern.
Obtain the process of expert knowledge library pattern as shown in Figure 3, through the correct tuple of data quality checking resume module, the Query statement assembled is sent to the search engine of expert knowledge library, then the Results of expert knowledge library feedback is resolved and result is stored, then adopt optimum Method of Fuzzy Matching to carry out the process of mode excavation, finally carry out the data cleansing based on expert knowledge library.For given attribute value Attrl and pattern knowledge Pattern, both combined arrangements are searched for, the information of the property value that current cable will be cleaned is comprised and the number of times of record appearance in record searching result, compared by sequence, fuzzy matching and comprising, find optimum fuzzy matching punctuate Sentence, then utilize given property value and pattern knowledge to carry out fractionation and just can obtain recommendation, cleaning process is shown in Fig. 4.
First to carry out the judgement of feasible tuple for problem tuple, select the data tuple be applicable to based on the cleaning of special storehouse knowledge base pattern, then carry out data cleansing.Described infeasible tuple comprises:
1, the data attribute quantity degree of association be less than between preset attribute numerical lower limits value or attribute is weaker than the tuple of default degree of association lower limit;
2, there are quality problems and the tuple cannot repaired by expert knowledge library pattern correspondence in attribute;
3, there is mistake and the tuple that cannot repair in the data of different attribute simultaneously.Three kinds of situations are judged as infeasible tuple below: the relation between data attribute negligible amounts or attribute is excessively weak, cannot obtain dependence between attribute by expertise library searching; Exist quality problems attribute and cannot by expert knowledge library pattern correspondence repair data; And occur mistake when the data of different attribute in data tuple simultaneously and cannot repair.Need to adopt other cleaning means to comprise for non-feasible tuple: if data centralization provides the constraint function mode that just uses of some constraint conditions directly to clean data; When the description of same entity being occurred to multiple situation is, need to select these describe in one the most accurately, then adopt and use true value discovery algorithms selection to clean, namely by study to data source accuracy, to realize, true value finds to make each describe non-equivalence to give different weights.
True value find be find in conflict value to real entities describe the most accurately that.Use based on the method for data source, wherein comprise two important tolerance and be respectively: the accuracy of data source and the dominance relation of data source.
In description to all data in accuracy i.e. this data source of data source, ratio height accurately.Data cleansing technology based on the accuracy of data source needs the accuracy constantly learning out each data source, if the accuracy A of data source D (D) is the highest, is so cleaned data by use D.
The dominance relation of data source and the transitive relation of data source, if the description of tuple T is identical with D2 in data source D1, so just say that D2 arranges D1.In the present embodiment, data source is internet business information data to be cleaned.
Described other policy datas cleaning comprises:
If data centralization provides the constraint function mode that just uses of some constraint conditions directly to clean data;
If when there is multiple situation to the description of same entity, select these describe in one the most accurately, then adopt true value to find that algorithms selection cleans, namely by the study to data source accuracy, give different weights and find to realize true value to make each describe non-equivalence.
The description of described same entity occurs that multiple situation refers to when there is multiple data tuple to the expression of same Transaction Information, need by the study to data source accuracy, give different weights and find to realize true value to make each describe non-equivalence.
As a kind of embodiment, a transaction content in store, Jingdone district, " O/No.: 8971959437 "+" exchange hour: on 03 10th, 2105 15:00 "+" flagship store of Legend computer official "+" association (Lenovo) G40-70MA14.0 inch notebook computer "+" i5-4258U4G500G2G is aobvious GT820M video card DVD imprinting Win8 solely "+" metal black "+" Sun Xiang "+" Xihu District, Hangzhou, Zhejiang province city "+" Zhejiang Polytechnical University "+" 310023 "+" 151XXXXXXXX "+" cashing on delivery ".
Different according to the source platform of the internet electronic business Transaction Information obtained, first according to the pattern knowledge part in the knowledge base trained, keywording is carried out to data message.
Spelling is carried out for commodity shop and information attribute value add, the query statement generated is exactly similar Query={ " flagship store+association of Legend computer official (Lenovo) G40-70MA14.0 inch notebook computer ", " association (Lenovo) G40-70MA14.0 inch notebook computer+metal black " }, if the vacancy value having shortage of data to produce or just carry out repairing and cleaning according to the corresponding information in BP expert knowledge library due to the improper value produced in transmitting procedure.
Lose or mistake for O/No., the electric business's platform order naming rule according to correspondence can be repaired according to the order transaction time, and O/No. and exchange hour also can judge whether it is improper value or conflict value mutually simultaneously.Corresponding electric business's platform order naming rule is the one that condition relies on function.
The present invention proposes the framework that a kind of data for different internet business platform source carry out cleaning, first the tuple in database is classified, mutual by wherein determining that correct tuple data carries out carrying out with expert knowledge library pattern, with the fuzzy matching of knowledge based library searching content for instrument, obtain its corresponding pattern knowledge.Then the pattern knowledge found is utilized, to there are quality problems in data and applicable data are cleaned.Meanwhile, the quality mistake for dissimilar mass data it is also proposed suitable efficient detection scheme.
The current cleaning for internet business information also rests on the process for single platform, also be in the stage of fumbling for the data that can process different platform source, the present invention can well solve the inconsistent data heterogeneous question brought of Data Source by the study of expert knowledge library.
The foregoing is only preferred embodiment of the present invention, those skilled in the art know, without departing from the spirit and scope of the present invention, can carry out various change or equivalent replacement to these characteristic sum embodiments.In addition, under the teachings of the present invention, can modify to adapt to concrete situation and material to these characteristic sum embodiments and can not the spirit and scope of the present invention be departed from.Therefore, the present invention is not by the restriction of specific embodiment disclosed herein, and the embodiment in the right of all the application of falling into all belongs to protection scope of the present invention.

Claims (9)

1. a data cleansing algorithm for internet business information, is characterized in that, comprising:
Internet business information data to be cleaned is carried out data quality problem and detect the clean tuple of acquisition, correct tuple and problem tuple;
To described clean tuple: directly send into clean database;
To described correct tuple: generate and need to expertise library searching key sentence, in described expert knowledge library, carry out inquiry according to described key sentence and obtain expert knowledge library pattern, described expert knowledge library pattern comprises text dependence statement, sends into described clean database after carrying out data cleansing to described expert knowledge library pattern;
To described problem tuple: the judgement carrying out feasible tuple obtains the feasible tuple and the infeasible tuple be not suitable for based on the cleaning of expert knowledge library pattern that are applicable to based on the cleaning of expert knowledge library pattern,
Described feasible tuple generation is inquired about after described expertise library searching key sentence from this expert knowledge library and obtains expert knowledge library pattern, then after data cleansing, send into described clean database,
Described clean database is sent into after the cleaning of other policy datas is carried out to described infeasible tuple.
2. the data cleansing algorithm of a kind of internet business information according to claim 1, is characterized in that, described expert knowledge library adopts BP neural network algorithm to realize self study, and described BP neural network algorithm is specially:
The neural network of a m layer, for given internet business message sample collection X i(i=1,2 ..., n), if the i of kth layer neuronic input summation is expressed as output summation is i-th neuronic weight coefficient from a jth neuron of kth-1 layer to kth layer is W ij, each neuronic excitation function is f (), then the relation of each variable can be expressed as:
In formula, input layer number is n, and hidden layer nodes is h, and output layer nodes is o, determines input layer and hidden layer respectively, the weight matrix that links between hidden layer with output layer is W h, W oand threshold values b h, b o.
3. the data cleansing algorithm of a kind of internet business information according to claim 2, is characterized in that, the quadratic sum of desired output and the actual difference exported is the error function of described expert knowledge library, and the error function of described expert knowledge library is:
Y ibe the expectation value of output unit, m layer is output layer, actual output; BP algorithm adopts the steepest descending method in nonlinear programming, by the negative gradient direction power of amendment coefficient of error function e.
4. the data cleansing algorithm of a kind of internet business information according to claim 2, is characterized in that, the difference in an internet business message sample between institute's directed quantity adopts the mahalanobis distance in machine learning to weigh; For l vectorial X 1~ X l, establish the most reasonable vectorial X ksample training is launched as BP neural network standard output; The covariance matrix of the vector comprised in a described sample is designated as S, vectorial X iwith X jbetween mahalanobis distance be:
In described covariance matrix S, each element is the covariance Cov (X, Y) between each vector element, wherein E is the mathematical expectation of the vector comprised in a described sample.
5. the data cleansing algorithm of a kind of internet business information according to claim 1, is characterized in that, described problem tuple comprises missing value, and/or improper value, and/or conflict value;
Described missing value is the value that data attribute exists vacancy; Detection method for missing value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) attribute A detect, if exist disappearance property value; for comprising loss problem tuple extremely;
Described improper value be data exist attribute be identified as be mistake value; Detection method for improper value is: for internet business information data D (T to be cleaned 1, T 2..., T n) in each tuple T (A 1, A 2..., A m) carry out the condition dependence detection relying on function based on condition, if the attribute of these data does not meet described condition rely on function, this tuple is the problem tuple comprising improper value;
Described conflict value is that multiple respective value appears in the property value of data; Detection method for conflict value is: first carry out for internet business information data to be cleaned the tuple that tuple coupling finds out potentially conflicting, then carry out to the tuple of described potentially conflicting the problem tuple that cluster obtains comprising conflict value.
6. the data cleansing algorithm of a kind of internet business information according to claim 5, is characterized in that, described tuple coupling is specially:
S1: similarity is carried out for the tuple in internet business information data to be cleaned and mates between two, if the right similarity degree of tuple reaches default similar threshold value, this tuple, to the same entity of sensing, will point to the tuple of same entity as a cohort;
S2: be that described cohort creates the BloomFilter array corresponding with tuple attributes, checks that the attribute item by item of each tuple in described cohort is whether in the BloomFilter array of correspondence, the weights of this tuple that then adds up in the tuple of same BloomFilter array,
S3, described tuple weights exceed the default upper limit and then extract tuple as described potentially conflicting.
7. the data cleansing algorithm of a kind of internet business information according to claim 1, is characterized in that, carries out inquiring about the process obtaining expert knowledge library pattern be specially according to described key sentence in described expert knowledge library:
Described key sentence is sent to the search engine of expert knowledge library, obtain and resolve expert knowledge library feedback Query Result, adopt optimum Method of Fuzzy Matching to carry out mode excavation and obtain described expert knowledge library pattern.
8. the data cleansing algorithm of a kind of internet business information according to claim 1, is characterized in that, described infeasible tuple comprises:
The data attribute quantity degree of association be less than between preset attribute numerical lower limits value or attribute is weaker than the tuple of default degree of association lower limit;
There are quality problems and the tuple cannot repaired by expert knowledge library pattern correspondence in attribute;
There is mistake and the tuple that cannot repair in the data of different attribute simultaneously.
9. the data cleansing algorithm of a kind of internet business information according to claim 8, is characterized in that, described other policy datas cleaning comprises:
If data centralization provides the constraint function mode that just uses of some constraint conditions directly to clean data;
If when there is multiple situation to the description of same entity, then adopt true value to find that algorithms selection cleans, namely by the study to data source accuracy, give different weights and find to realize true value to make each describe non-equivalence.
CN201510305440.2A 2015-06-04 2015-06-04 Data cleaning algorithm based on Internet trading information Pending CN105045807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510305440.2A CN105045807A (en) 2015-06-04 2015-06-04 Data cleaning algorithm based on Internet trading information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510305440.2A CN105045807A (en) 2015-06-04 2015-06-04 Data cleaning algorithm based on Internet trading information

Publications (1)

Publication Number Publication Date
CN105045807A true CN105045807A (en) 2015-11-11

Family

ID=54452354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510305440.2A Pending CN105045807A (en) 2015-06-04 2015-06-04 Data cleaning algorithm based on Internet trading information

Country Status (1)

Country Link
CN (1) CN105045807A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055579A (en) * 2016-05-20 2016-10-26 上海交通大学 Vehicle performance data cleaning system based on artificial neural network, and method thereof
CN106227872A (en) * 2016-08-01 2016-12-14 浪潮软件集团有限公司 Data cleaning and verifying method based on e-commerce platform
CN106384219A (en) * 2016-10-13 2017-02-08 北京京东尚科信息技术有限公司 Warehouse partition assisted analysis method and device
CN107977412A (en) * 2017-11-22 2018-05-01 上海大学 It is a kind of based on iterative with interactive perceived age database cleaning method
CN108776697A (en) * 2018-06-06 2018-11-09 南京大学 A kind of multi-source data collection cleaning method based on predicate
CN108876270A (en) * 2018-09-19 2018-11-23 惠龙易通国际物流股份有限公司 Automatic source of goods auditing system and method
CN109784741A (en) * 2019-01-23 2019-05-21 北京理工大学 A kind of mobile gunz sensory perceptual system reward distribution method based on prestige prediction
CN111522807A (en) * 2020-04-28 2020-08-11 电子科技大学 Database error data recovery method
CN112364005A (en) * 2020-11-10 2021-02-12 平安科技(深圳)有限公司 Data synchronization method and device, computer equipment and storage medium
CN113138982A (en) * 2021-05-25 2021-07-20 黄柱挺 Big data cleaning method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN102257496A (en) * 2009-12-07 2011-11-23 埃森哲环球服务有限公司 Method and system for accelerated data quality enhancement
CN102708180A (en) * 2012-05-09 2012-10-03 北京华电天仁电力控制技术有限公司 Data mining method in unit operation mode based on real-time historical library

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102257496A (en) * 2009-12-07 2011-11-23 埃森哲环球服务有限公司 Method and system for accelerated data quality enhancement
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN102708180A (en) * 2012-05-09 2012-10-03 北京华电天仁电力控制技术有限公司 Data mining method in unit operation mode based on real-time historical library

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIAN QU: "Support-Vector-Machine-Based Diagnostics and Prognostics for Rotating Systems", 《加拿大艾伯塔大学博士学位论文》 *
N.A. SETIAWAN ET AL.: "Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory", 《2008 INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING AND INFORMATICS》 *
李亚坤: "基于网络的数据清洗技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106055579B (en) * 2016-05-20 2020-01-21 上海交通大学 Vehicle performance data cleaning system and method based on artificial neural network
CN106055579A (en) * 2016-05-20 2016-10-26 上海交通大学 Vehicle performance data cleaning system based on artificial neural network, and method thereof
CN106227872A (en) * 2016-08-01 2016-12-14 浪潮软件集团有限公司 Data cleaning and verifying method based on e-commerce platform
CN106384219A (en) * 2016-10-13 2017-02-08 北京京东尚科信息技术有限公司 Warehouse partition assisted analysis method and device
CN106384219B (en) * 2016-10-13 2021-09-07 北京京东振世信息技术有限公司 Storage sub-warehouse auxiliary analysis method and device
CN107977412A (en) * 2017-11-22 2018-05-01 上海大学 It is a kind of based on iterative with interactive perceived age database cleaning method
CN108776697A (en) * 2018-06-06 2018-11-09 南京大学 A kind of multi-source data collection cleaning method based on predicate
CN108776697B (en) * 2018-06-06 2020-06-09 南京大学 Multi-source data set cleaning method based on predicates
CN108876270A (en) * 2018-09-19 2018-11-23 惠龙易通国际物流股份有限公司 Automatic source of goods auditing system and method
CN108876270B (en) * 2018-09-19 2022-08-12 惠龙易通国际物流股份有限公司 Automatic goods source auditing system and method
CN109784741A (en) * 2019-01-23 2019-05-21 北京理工大学 A kind of mobile gunz sensory perceptual system reward distribution method based on prestige prediction
CN111522807A (en) * 2020-04-28 2020-08-11 电子科技大学 Database error data recovery method
CN111522807B (en) * 2020-04-28 2023-05-30 电子科技大学 Database error data repairing method
CN112364005A (en) * 2020-11-10 2021-02-12 平安科技(深圳)有限公司 Data synchronization method and device, computer equipment and storage medium
CN112364005B (en) * 2020-11-10 2024-02-27 平安科技(深圳)有限公司 Data synchronization method, device, computer equipment and storage medium
CN113138982A (en) * 2021-05-25 2021-07-20 黄柱挺 Big data cleaning method

Similar Documents

Publication Publication Date Title
CN105045807A (en) Data cleaning algorithm based on Internet trading information
Kannan et al. A hybrid approach using ISM and fuzzy TOPSIS for the selection of reverse logistics provider
Su A hybrid fuzzy approach to fuzzy multi-attribute group decision-making
CN102160066A (en) Search engine and method, particularly for patent documents
CN113779264B (en) Transaction recommendation method based on patent supply and demand knowledge graph
CN106296343A (en) A kind of e-commerce transaction monitoring method based on the Internet and big data
CN103679462A (en) Comment data processing method and device and searching method and system
CN106339383A (en) Method and system for sorting search
CN104636447A (en) Intelligent evaluation method and system for medical instrument B2B website users
CN105389505A (en) Shilling attack detection method based on stack type sparse self-encoder
CN104463601A (en) Method for detecting users who score maliciously in online social media system
CN110781940A (en) Fuzzy mathematics-based community discovery information processing method and system
Malo et al. Concept‐based document classification using Wikipedia and value function
CN113837844A (en) Multi-cascade downstream enterprise recommendation system and method and storage medium
Patil et al. Online review spam detection using language model and feature selection
Ali et al. Identification of critical factors for the implementation of reverse logistics in the manufacturing industry of Pakistan
CN105786810B (en) The method for building up and device of classification mapping relations
Watada et al. Preference identification based on big data mining for customer responsibility management.
Bakirli et al. DTreeSim: A new approach to compute decision tree similarity using re-mining
Morrison et al. Business process integration: Method and analysis
Nasif et al. Order Dependency in Sequential Correlation
Dudek et al. Integrated quality assessment of services in an adaptive expert system with a rule-based knowledge base
Sbastian et al. Implementation of multi criteria decision making (MCDM) fuzzy neutrosophic TOPSIS-CRITIC in determining sustainability aspects of the location of IoT based products warehouse
CN101334793B (en) Method for automatic recognition for dependency relationship of demand
Dubey et al. Contextual relationship among antecedents of truck freight using interpretive structural modelling and its validation using MICMAC analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151111

RJ01 Rejection of invention patent application after publication