CN109739939A

CN109739939A - The data fusion method and device of knowledge mapping

Info

Publication number: CN109739939A
Application number: CN201811635696.XA
Authority: CN
Inventors: 刘涛; 朱宏明; 顾江; 姜逸之; 王晓文; 周游
Original assignee: Yingtuo Information Technology (shanghai) Co Ltd
Current assignee: Yingtuo Information Technology (shanghai) Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-10
Also published as: WO2020135048A1

Abstract

This application provides a kind of data fusion method of knowledge mapping and devices, the system for executing the method includes the data platform configured with unified access interface, the described method includes: being converted to triplet format after the data from different data sources are handled, by the unified access interface storage to data platform, and receive the diagram data index information that the data platform returns；It by Attribute transposition is one or more child partitions by the entity stored in the data platform according to the diagram data index information；To the candidate entity being divided into identical child partition to similarity calculation is carried out, the matching entities pair for meeting default similarity condition are filtered out；The entity property value of the matching entities pair is supplemented and/or replaced, generating unified entity indicates.The application can effectively solve the problem of data fusion that available data integration technology is unable to flexible adaptation difference knowledge base by above-mentioned means.

Description

The data fusion method and device of knowledge mapping

Technical field

This application involves knowledge mapping technical fields, particularly, are related to the data fusion method and dress of a kind of knowledge mapping It sets.

Background technique

Knowledge mapping is a kind of one for describing various entities or concept and its relationship present in real world and constituting Huge semantic network figure, node presentation-entity or concept, side are then made of attribute or relationship.Present knowledge mapping by For referring to various large-scale knowledge bases.Wherein: entity refers to distinguishability and certain self-existent things, such as Some country, certain company, someone etc..Attribute refers to that the intrinsic characteristic of an entity, such as country have " population ", " face The different attributes (as shown in Figure 4) such as product ", company have the attributes such as " title ", " legal representative ".Relationship is an entity and another The linked character of one entity, for example some register of company, in some country, someone takes office in some company etc..

The node of knowledge mapping and side generally use the form of triple (S-P-O, Subject-Property-Object) Definition, including forms, the knowledge mapping such as (entity 1- relation-entity 2) and (entity-attribute-attribute value) can be expressed as ternary The set of group, can show as the form (as shown in Figure 4) of figure, and carry out data using chart database on data model Storage and management.

Knowledge Source is extensive in real world, knowledge very different, from different data sources that there are quality of knowledge repeats, The problems such as knowledge base hierarchical structure lacks；In addition, different data sources may have same entity the different representation of knowledge, than Such as, some corporate entity has name attribute ' Alibaba ' in Baidupedia, and certain grabbed from google search The name attribute of a corporate entity is ' alibaba ', the two entities are possible to be directed toward the same entity in real world, because This needs the relationship by their attribute and extension to merge into each other, to generate unique entity section in knowledge mapping Point, disambiguation generate the knowledge base of high quality.

Available data integration program generally comprise subregion index, similarity calculation and entity fusion etc. key steps, but When specific implementation corresponding partitioning algorithm, similarity mode algorithm and entity can be selected according to the characteristics of data source and knowledge base Alignment algorithm, and above scheme is integrated into a complete system, when the range of data source or knowledge base changes, it is New demand is adapted to, needs to rebuild data fusion system.

Summary of the invention

The application provides the data fusion method and device of a kind of knowledge mapping, for solving available data integration technology not The problem of data fusion of energy flexible adaptation difference knowledge base.

A kind of data fusion method of knowledge mapping disclosed in the present application, the system for executing the method include configured with system The data platform of one access interface, which comprises be converted to ternary after being handled the data from different data sources Group format by the unified access interface storage to data platform, and receives the diagram data index that the data platform returns Information；It by Attribute transposition is one or more by the entity stored in the data platform according to the diagram data index information Child partition；To the candidate entity being divided into identical child partition to similarity calculation is carried out, the default similarity item of satisfaction is filtered out The matching entities pair of part；The entity property value of the matching entities pair is supplemented and/or replaced, unified entity table is generated Show.

Preferably, in step according to the diagram data index information, by the entity stored in the data platform by attribute It is divided into before one or more child partitions, further includes: by the storage after being converted to triplet format from multiple data sources Entity in data platform is aligned according to the physical meaning of its attribute.

Preferably, the child partition division mode is that the globally unique subregion key generated according to entity attribute carries out equivalent draw Point, or divided based on default Clustering Model.

Preferably, it is default that satisfaction is filtered out to similarity calculation is carried out to the candidate entity being divided into identical child partition The matching entities pair of similarity condition, specifically: for the attribute of entity itself and the category of other entities relevant to the entity Property be respectively set different weights, weighted sum calculates the overall similarity of candidate entity pair；Candidate in child partition if they are the same The overall similarity of entity pair is more than default similarity threshold, then by candidate's entity to as matching entities pair.

Preferably, the method supplemented the entity property value of missing is to obtain or carried out manually from network by crawler Filling.

Preferably, the diagram data index information is storage address of the diagram data in the data platform of triplet format And its metadata.

A kind of data fusion device of knowledge mapping disclosed in the present application, including data platform, data preprocessing module, reality Body division module, Entities Matching module and entity Fusion Module, in which: the data platform is configured with unified access interface；Institute Data preprocessing module is stated for being converted to triplet format after being handled the data from different data sources, by described Unified access interface storage receives the diagram data index information that the data platform returns to data platform；The entity point The diagram data index information that area's module is exported according to the data preprocessing module, the entity stored in the data platform is pressed Attribute transposition is one or more child partitions；The Entities Matching module is used to the entity division module being divided into identical son Candidate entity in subregion filters out the matching entities pair for meeting default similarity condition to similarity calculation is carried out；The reality The entity property value for the matching entities pair that body Fusion Module is used to filter out the Entities Matching module is supplemented and/or is replaced It changes, generating unified entity indicates.

Preferably, the entity division module includes equivalent subregion submodule and/or cluster subregion submodule；The equivalence Subregion submodule is for carrying out the entity being stored in data platform according to the globally unique subregion key that entity attribute generates etc. Value divides；The cluster subregion submodule divides the entity being stored in data platform based on default Clustering Model.

Preferably, the Entities Matching module specifically includes similarity calculation submodule and Comparative sub-module；It is described similar Degree computational submodule is used to that difference to be respectively set for the attribute of entity itself and other entity attributes relevant to the entity Weight, weighted sum calculates the overall similarity of candidate entity pair；The Comparative sub-module is for judging in identical child partition The overall similarity of candidate entity pair whether be more than default similarity threshold, if so, by candidate's entity to as matching Entity pair.

Preferably, described device further includes data processing module and/or attribute alignment module；The data processing module is used In by the unified access interface in data platform node entities data and side solid data handle, and return to number Next module is passed to according to processing result；The attribute alignment module is used for will be from the pre- through the data of multiple data sources The entity being stored in data platform after processing module processing is aligned according to the physical meaning of its attribute.

Disclosed herein as well is a kind of storage mediums for being recorded on the program for executing the above method.

Compared with prior art, the application has the following advantages:

Each stage in the application preferred embodiment scheme has upstream and downstream dependence, but different phase on assembly line Between only by data format constrain, by data platform provide unified interface realize mutually decoupling, can stand-alone development complete. The algorithm in each stage itself can be replaced flexibly, and by realizing the customized stage, new stream can be inserted between different phase The journey stage freely works out customized demand.In addition, there is no limit for framework of the application to data platform, such as can adopt With Hadoop distributed file system or cloud computing framework, to facilitate, in the case where data volume increases, extension is calculated and storage provides Source.

Detailed description of the invention

The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as the limitation to the application.And whole In a attached drawing, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 is the flow diagram of the data fusion method first embodiment of the application knowledge mapping；

Fig. 2 is the flow diagram of the data fusion method second embodiment of the application knowledge mapping；

Fig. 3 is the structural schematic diagram of one embodiment of data fusion device of the application knowledge mapping；

Fig. 4 is the diagram data model schematic of knowledge mapping.

Specific embodiment

In order to make the above objects, features, and advantages of the present application more apparent, with reference to the accompanying drawing and it is specific real Applying mode, the present application will be further described in detail.

In the description of the present application, it is to be understood that term " first ", " second " are used for description purposes only, and cannot It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include one or more of the features.The meaning of " plurality " is two Or it is more than two, unless otherwise specifically defined.The terms "include", "comprise" and similar terms are understood to out The term of putting property, i.e., " including/including but not limited to ".Term "based" is " being based at least partially on ".Term " embodiment " It indicates " at least one embodiment "；Term " another embodiment " expression " at least one other embodiment ".The phase of other terms Pass definition provides in will be described below.

Referring to Fig.1, the process of the data fusion method first embodiment of the application knowledge mapping is shown, this method is executed The data platform of promising each stage offer running environment and computing resource is arranged in the system of embodiment, and each stage can pass through number Interaction is realized according to the unified access interface of platform.In the specific implementation, data platform can be constructed in Hadoop distributed document On system, cloud computing framework (such as Amazon AWS EMR) or other frameworks, in this regard, the application not limits.The method is implemented Example includes following several stages:

1. data preprocessing phase (InputStage): by (such as structural data A in multiple isomorphisms or heterogeneous data source The format (SPO format) of identical entity and its attribute is processed into the data of unstructured data B), as follow-up phase Input.

By configuring different data source information and data model, by data from data source extract, cleaning, deformation after with Unified data format stores on data platform.Such as relevant database data source, by configuring link information, reality Body type and entity table, relationship type and relation table, so that it may extract the SPO data of needs.For chart database, node (entity-attribute-attribute value) and side (entity-relationship-entity) are natural SPO structures.

The part configuration parameter of data preprocessing phase is as shown in the table.

Different data sources are located in advance when it is implemented, can be realized using customized (CustomInputStage) mode Reason, interface form are as follows:

It is configured defined in upper table by reading, realizes parsing, storing data after reading remote data.Such as to non-structural Change data source, machine learning interface, network interface etc. can be called to complete knowledge and extracted, saved into triplet information, returned and protect The address of deposit data and metadata information.

2. the entity subregion stage (BlockingStage):, according to its attribute, will be divided from the entity of multiple data sources To different child partitions (Block), to reduce the data scale of candidate matches pair.

For needing matched data source S and T, it is assumed that the solid data scale of data source S is m, the entity number of data source T It is n according to scale, needing to examine matched data scale is m*n.Under big data scene, this data scale is substantially nothing What method was realized, it is necessary to reduce and need matched data to scale.

When it is implemented, matched entity impossible in two data sources can be divided different data are divided into advance Qu Zhong substantially reduces the data scale in each data subregion, and multiple data subregions can be completed with parallel computation.

For example, being generally registered in the entity of country variant in real world for needing matched corporate entity in S and T It is unlikely to be same company, then can be attribute according to the state of company, it is divided into more than 220 (country) data Subregion.For each subregion, can continue to divide child partition further according to same or similar attribute.For example, in ' beauty Company below state ' subregion can continue to assign to new subregion according to identical ' state ' attribute.Matched data are finally needed to advise Mould is equal to the sum of all data subregions, and in subsequent calculating, all data subregions can be with parallel computation, so as to larger Reduce to degree the whole matching time.

The part configuration parameter in entity subregion stage is as shown in the table.

Furthermore it is possible to pass through the square partition of customized partitioning algorithm extension entity subregion stage (BlockingStage) Formula, for example, passing through following interface form:

Can the attribute according to used in subregion where current entity and next subzone generate globally unique subregion key (block key), so that data are divided into next subregion.When the possibility matching entities logarithm of the subregion reach minimum or When total number of partitions reaches maximum value, which does not continue to divide.

To the partitioning algorithm based on cluster, it can use trained Clustering Model and realize that interface form is as follows:

Clustering Model can directly predict current entity, and correspond in some class, at this time number of partitions etc. In the class quantity of Clustering Model.Certainly can also continue to divide subregion on the basis of cluster.

3. the Entities Matching stage (MatchStage):, can be according to entity itself for the candidate entity pair in same subregion Attribute and be respectively set different weights from its related entity attributes, and it is real by weighted sum to calculate the candidate The overall similarity of body pair；By the candidate entity more than certain similarity threshold to filtering out, as matching entities pair.

It should be noted that this process design allow to be inserted into it is some matching is done directly based on strongly connected rule, such as Company data in two data sources, if its be all listed company and list stock code it is identical, can be straight Matching is connect, so that the process of similarity calculation is skipped, to reduce the computation complexity of matching stage.

When providing validation data set, it can be compared, be tested with validation data set by the result that matching algorithm generates Demonstrate,prove the accuracy of matching algorithm.By adjusting attribute and weight parameter and similarity threshold, multiple comparison between calculation results, with Accuracy is continuously improved.Such as Liang Ge corporate entity by title and stock code Similarity-Weighted and compares, if title It being indicated in different data sources with different language, similarity weight is just lower, it needs to turn down its weight, and stock code Similarity relative weighting should set it is some higher.

The Entities Matching algorithm of the application can be by adjusting parameter successive ignition, to improve the accuracy of matching result.

The part configuration parameter in Entities Matching stage (MatchStage) is as shown in the table.

By customized Entities Matching algorithm, it can compare whether two entities are directed toward the same representation of knowledge.Interface Form is as follows:

In previous example, using two disaggregated model of machine learning trained in advance, with each attributes similarity of two entities Vector infers whether the probability that can be classified as the same entity as input (being then is 1).

Last matched entity will be to will be output in results set.

4. entity fusing stage (MergeStage): to the data in the different data sources for being actually pointed to same entity, root According to blending algorithm, entity property value is supplemented, replaced and is standardized, ultimately generating unified entity indicates.

Customized blending algorithm is generally required, interface form is as follows:

It is realized when data fusion in combination with different business rules, such as the settable multiple anonymities of title, mailbox, address etc. Standardized format can be used.And the attribute data of missing can be filled by crawler or manually, construct high quality Data, facilitate search, analysis of knowledge mapping etc. apply.

In a further embodiment, in addition to several stages defined above, the stage of different function can be entered with layout (such as data processing stage).The interface of following form can be used:

Data to be treated are transmitted by input configuration parameter, output are written after the completion of processing, and pass to down In one stage, realize the extension of system function.

The application realizes the universal pipeline (Pipeline) that entity merges under big data scene by above-mentioned means. Assembly line is made of multiple stages (Stage), each stage can by way of configuring flexible expansion, and can will make by oneself Adopted stage (CustomStage) is programmed into assembly line to adapt to different application scenarios.In addition to data preprocessing phase (InputStage) there was only output output, other each stages all have input input configuration.Input configuration may specify the rank Duan Yunhang needs the list of entities from different data sources, relation list, data address and the related data metamessage obtained (schema includes table name, column name etc.).Input data has been read until the stage, has run algorithm, has been written to data platform, and will All data addresses and metadata are exported by output.Therefore each stage can by input and output series operation, Input parameter can be individually specified to run.

Referring to Fig. 2, the process of the data fusion method second embodiment of the application knowledge mapping is shown, with above-mentioned first The difference of embodiment of the method is, increases an attribute align stage between data preprocessing phase and entity subregion stage (Attribute Matching): for will be from the entity root of multiple data sources being stored in data platform after pretreatment Be aligned according to the physical meaning of its attribute, such as by " Address " field of " address " field of data source A and data source B into Row alignment, the field being aligned in subsequent partitions and matching stage will be handled as the field of same meaning.

When it is implemented, the physical meaning of entity attribute can manually be set, it can also be by the way that one be arranged in systems The form of the attribute meaning table of comparisons is realized, in this regard, the application not limits.

Disclosed herein as well is a kind of storage mediums for being recorded on the program for executing the above method.It is described to deposit Storage media includes any mechanism being configured to by the readable form storage of computer (by taking computer as an example) or transmission information.Example Such as, storage medium includes read-only memory (ROM), random access memory (RAM), magnetic disk storage medium, optical storage media, sudden strain of a muscle Fast storage medium, electricity, light, sound or transmitting signal (for example, carrier wave, infrared signal, digital signal etc.) of other forms etc..

Referring to Fig. 3, the structural block diagram of one embodiment of data fusion device of the application knowledge mapping, including data are shown Platform 10, data preprocessing module 11, entity division module 12, Entities Matching module 13 and entity Fusion Module 14, in which:

Data platform 10 is configured with unified access interface, provides calculating and storage service for other modules.The application logarithm There is no limit can use for convenience of extension calculating and storage resource in the case where data volume increases framework according to platform Hadoop distributed file system or cloud computing framework.

Data preprocessing module 11 is for being converted to triple (S-P- after being handled the data from different data sources O) format by the unified access interface storage to data platform 10, and receives the diagram data index of the return of data platform 10 Information.Wherein, diagram data index information can be storage address and its member of the diagram data in data platform 10 of triplet format Data.

Entity division module 12 is used for according to the diagram data index information, is equalled data by the unified access interface The entity stored in platform 10 is one or more child partitions by Attribute transposition.When it is implemented, entity division module 12 can wrap Include the equivalence that the globally unique subregion key generated according to entity attribute carries out equivalent division to the entity being stored in data platform Subregion submodule, based on the cluster subregion submodule that default Clustering Model divides the entity being stored in data platform, And/or the submodule of other partitioned modes.

Entities Matching module 13 is used to screen the candidate entity being divided into identical child partition to similarity calculation is carried out Meet the matching entities pair of default similarity condition out.

Entity Fusion Module 14 is generated for the entity property value of the matching entities pair to be supplemented and/or replaced Unified entity indicates.

Each functional module of the application Installation practice on assembly line have upstream and downstream dependence, but disparate modules it Between only by data format constrain, by data platform provide unified interface realize mutually decoupling, can stand-alone development complete.Respectively Algorithm of module itself can be replaced flexibly, by the realization customized stage, can be inserted into new module between different modules, Freely work out customized demand.For example, in order to improve adaptability and following entities point to various different data sources Area, matching and fusion accuracy, can be inserted between data preprocessing module 11 and entity division module 12 attribute alignment Module 15, for the entity in data platform 10 will to be stored in after the processing of data preprocessing module 11 from different data sources It is aligned according to the physical meaning of its attribute.Such as by " Address " field of " address " field of data source A and data source B It is aligned, the field being aligned in subsequent partitions and matching stage will be handled as the field of same meaning.

In further preferred embodiment embodiment, Entities Matching module 13 can specifically include similarity calculation submodule And Comparative sub-module；Similarity calculation submodule therein be used for for entity itself attribute and it is relevant to the entity other Different weights is respectively set in entity attributes, and weighted sum calculates the overall similarity of candidate entity pair；Comparative sub-module is used In the overall similarity for judging the candidate entity pair in identical child partition whether be more than default similarity threshold, if so, should Candidate entity is to as matching entities pair.

In another preferred embodiment embodiment, described device can further include data processing module, for passing through the system One access interface in data platform node entities data and side solid data handle, and returned data processing result pass Pass next module.

Above-mentioned data processing module can be realized using following form:

Wherein, data to be treated are transmitted by input configuration parameter, is write the result into after the completion of data processing Output, and the functional module in next stage is passed to, the extension of realization device function.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.For the dress of the application For setting embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place is referring to method reality Apply the explanation of example part.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The module of explanation may or may not be physically separated, and both can be located in one place or may be distributed over In multiple network units.Some or all of the modules therein can be selected to realize this embodiment scheme according to the actual needs Purpose.Those of ordinary skill in the art can understand and implement without creative efforts.

Specific examples are used herein to illustrate the principle and implementation manner of the present application, and above embodiments are said It is bright to be merely used to help understand the present processes and its core concept；At the same time, for those skilled in the art, foundation The thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not It is interpreted as the limitation to the application.

Claims

1. a kind of data fusion method of knowledge mapping, which is characterized in that the system for executing the method includes configured with unified The data platform of access interface, which comprises

Triplet format is converted to after data from different data sources are handled, is stored by the unified access interface To data platform, and receive the diagram data index information that the data platform returns；

It is according to the diagram data index information, the entity stored in the data platform is sub for one or more by Attribute transposition Subregion；

To the candidate entity being divided into identical child partition to similarity calculation is carried out, filters out and meet default similarity condition Matching entities pair；

The entity property value of the matching entities pair is supplemented and/or replaced, generating unified entity indicates.

2., will be described the method according to claim 1, wherein in step according to the diagram data index information Before the entity stored in data platform is one or more child partitions by Attribute transposition, further includes: multiple data sources will be come from The entity being stored in data platform after triplet format that is converted to be aligned according to the physical meaning of its attribute.

3. the method according to claim 1, wherein the child partition division mode is to be generated according to entity attribute Globally unique subregion key carry out equivalent division, or divided based on default Clustering Model.

4. the method according to claim 1, wherein to the candidate entity being divided into identical child partition to progress Similarity calculation filters out the matching entities pair for meeting default similarity condition, specifically:

Different weights is respectively set for the attribute and other entity attributes relevant to the entity of entity itself, weighting is asked With the overall similarity for calculating candidate entity pair；

The overall similarity of the candidate entity pair in child partition is more than default similarity threshold if they are the same, then by candidate's entity pair As matching entities pair.

5. the method according to claim 1, wherein the method supplemented the entity property value of missing is logical Crawler is crossed to obtain from network or manually filled.

6. the method according to claim 1, wherein the diagram data index information is the figure number of triplet format According to the storage address and its metadata in the data platform.

7. a kind of data fusion device of knowledge mapping, which is characterized in that including data platform, data preprocessing module, entity Division module, Entities Matching module and entity Fusion Module, in which:

The data platform is configured with unified access interface；

The data preprocessing module is led to for being converted to triplet format after being handled the data from different data sources The unified access interface storage is crossed to data platform, and receives the diagram data index information that the data platform returns；

The diagram data index information that the entity division module is exported according to the data preprocessing module, by the data platform The entity of middle storage is one or more child partitions by Attribute transposition；

The Entities Matching module is used to the entity division module being divided into the candidate entity in identical child partition to progress Similarity calculation filters out the matching entities pair for meeting default similarity condition；

The entity property value for the matching entities pair that the entity Fusion Module is used to filter out the Entities Matching module carries out Supplement and/or replacement, generating unified entity indicates.

8. device according to claim 7, which is characterized in that the entity division module includes equivalent subregion submodule And/or cluster subregion submodule；

The equivalence subregion submodule is used for the globally unique subregion key that generates according to entity attribute to being stored in data platform Entity carry out equivalent division；

The cluster subregion submodule divides the entity being stored in data platform based on default Clustering Model；

The Entities Matching module specifically includes similarity calculation submodule and Comparative sub-module；

The attribute and other entity attributes relevant to the entity that the similarity calculation submodule is used for as entity itself Different weights is respectively set, weighted sum calculates the overall similarity of candidate entity pair；

The Comparative sub-module is used to judge whether the overall similarity of the candidate entity pair in identical child partition to be more than default phase Like degree threshold value, if so, by candidate's entity to as matching entities pair.

9. device according to claim 7, which is characterized in that described device further includes data processing module and/or attribute Alignment module；

The data processing module is used for through the unified access interface to the node entities data and side reality in data platform Volume data is handled, and returned data processing result passes to next module；

The attribute alignment module will be for that will be stored in number after data preprocessing module processing from multiple data sources It is aligned according to the entity in platform according to the physical meaning of its attribute.

10. a kind of storage medium, which is characterized in that the storage medium is stored with any described for perform claim requirement 1 ~ 6 Method program.