CN109739939A - The data fusion method and device of knowledge mapping - Google Patents
The data fusion method and device of knowledge mapping Download PDFInfo
- Publication number
- CN109739939A CN109739939A CN201811635696.XA CN201811635696A CN109739939A CN 109739939 A CN109739939 A CN 109739939A CN 201811635696 A CN201811635696 A CN 201811635696A CN 109739939 A CN109739939 A CN 109739939A
- Authority
- CN
- China
- Prior art keywords
- entity
- data
- module
- attribute
- data platform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013507 mapping Methods 0.000 title claims abstract description 23
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 11
- 238000005192 partition Methods 0.000 claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000010586 diagram Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims abstract description 16
- 230000004927 fusion Effects 0.000 claims abstract description 16
- 230000017105 transposition Effects 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims description 16
- 238000007781 pre-processing Methods 0.000 claims description 15
- 230000000052 comparative effect Effects 0.000 claims description 6
- 239000013589 supplement Substances 0.000 claims 1
- 230000010354 integration Effects 0.000 abstract description 3
- 230000006978 adaptation Effects 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 5
- 238000013499 data model Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of data fusion method of knowledge mapping and devices, the system for executing the method includes the data platform configured with unified access interface, the described method includes: being converted to triplet format after the data from different data sources are handled, by the unified access interface storage to data platform, and receive the diagram data index information that the data platform returns;It by Attribute transposition is one or more child partitions by the entity stored in the data platform according to the diagram data index information;To the candidate entity being divided into identical child partition to similarity calculation is carried out, the matching entities pair for meeting default similarity condition are filtered out;The entity property value of the matching entities pair is supplemented and/or replaced, generating unified entity indicates.The application can effectively solve the problem of data fusion that available data integration technology is unable to flexible adaptation difference knowledge base by above-mentioned means.
Description
Technical field
This application involves knowledge mapping technical fields, particularly, are related to the data fusion method and dress of a kind of knowledge mapping
It sets.
Background technique
Knowledge mapping is a kind of one for describing various entities or concept and its relationship present in real world and constituting
Huge semantic network figure, node presentation-entity or concept, side are then made of attribute or relationship.Present knowledge mapping by
For referring to various large-scale knowledge bases.Wherein: entity refers to distinguishability and certain self-existent things, such as
Some country, certain company, someone etc..Attribute refers to that the intrinsic characteristic of an entity, such as country have " population ", " face
The different attributes (as shown in Figure 4) such as product ", company have the attributes such as " title ", " legal representative ".Relationship is an entity and another
The linked character of one entity, for example some register of company, in some country, someone takes office in some company etc..
The node of knowledge mapping and side generally use the form of triple (S-P-O, Subject-Property-Object)
Definition, including forms, the knowledge mapping such as (entity 1- relation-entity 2) and (entity-attribute-attribute value) can be expressed as ternary
The set of group, can show as the form (as shown in Figure 4) of figure, and carry out data using chart database on data model
Storage and management.
Knowledge Source is extensive in real world, knowledge very different, from different data sources that there are quality of knowledge repeats,
The problems such as knowledge base hierarchical structure lacks;In addition, different data sources may have same entity the different representation of knowledge, than
Such as, some corporate entity has name attribute ' Alibaba ' in Baidupedia, and certain grabbed from google search
The name attribute of a corporate entity is ' alibaba ', the two entities are possible to be directed toward the same entity in real world, because
This needs the relationship by their attribute and extension to merge into each other, to generate unique entity section in knowledge mapping
Point, disambiguation generate the knowledge base of high quality.
Available data integration program generally comprise subregion index, similarity calculation and entity fusion etc. key steps, but
When specific implementation corresponding partitioning algorithm, similarity mode algorithm and entity can be selected according to the characteristics of data source and knowledge base
Alignment algorithm, and above scheme is integrated into a complete system, when the range of data source or knowledge base changes, it is
New demand is adapted to, needs to rebuild data fusion system.
Summary of the invention
The application provides the data fusion method and device of a kind of knowledge mapping, for solving available data integration technology not
The problem of data fusion of energy flexible adaptation difference knowledge base.
A kind of data fusion method of knowledge mapping disclosed in the present application, the system for executing the method include configured with system
The data platform of one access interface, which comprises be converted to ternary after being handled the data from different data sources
Group format by the unified access interface storage to data platform, and receives the diagram data index that the data platform returns
Information;It by Attribute transposition is one or more by the entity stored in the data platform according to the diagram data index information
Child partition;To the candidate entity being divided into identical child partition to similarity calculation is carried out, the default similarity item of satisfaction is filtered out
The matching entities pair of part;The entity property value of the matching entities pair is supplemented and/or replaced, unified entity table is generated
Show.
Preferably, in step according to the diagram data index information, by the entity stored in the data platform by attribute
It is divided into before one or more child partitions, further includes: by the storage after being converted to triplet format from multiple data sources
Entity in data platform is aligned according to the physical meaning of its attribute.
Preferably, the child partition division mode is that the globally unique subregion key generated according to entity attribute carries out equivalent draw
Point, or divided based on default Clustering Model.
Preferably, it is default that satisfaction is filtered out to similarity calculation is carried out to the candidate entity being divided into identical child partition
The matching entities pair of similarity condition, specifically: for the attribute of entity itself and the category of other entities relevant to the entity
Property be respectively set different weights, weighted sum calculates the overall similarity of candidate entity pair;Candidate in child partition if they are the same
The overall similarity of entity pair is more than default similarity threshold, then by candidate's entity to as matching entities pair.
Preferably, the method supplemented the entity property value of missing is to obtain or carried out manually from network by crawler
Filling.
Preferably, the diagram data index information is storage address of the diagram data in the data platform of triplet format
And its metadata.
A kind of data fusion device of knowledge mapping disclosed in the present application, including data platform, data preprocessing module, reality
Body division module, Entities Matching module and entity Fusion Module, in which: the data platform is configured with unified access interface;Institute
Data preprocessing module is stated for being converted to triplet format after being handled the data from different data sources, by described
Unified access interface storage receives the diagram data index information that the data platform returns to data platform;The entity point
The diagram data index information that area's module is exported according to the data preprocessing module, the entity stored in the data platform is pressed
Attribute transposition is one or more child partitions;The Entities Matching module is used to the entity division module being divided into identical son
Candidate entity in subregion filters out the matching entities pair for meeting default similarity condition to similarity calculation is carried out;The reality
The entity property value for the matching entities pair that body Fusion Module is used to filter out the Entities Matching module is supplemented and/or is replaced
It changes, generating unified entity indicates.
Preferably, the entity division module includes equivalent subregion submodule and/or cluster subregion submodule;The equivalence
Subregion submodule is for carrying out the entity being stored in data platform according to the globally unique subregion key that entity attribute generates etc.
Value divides;The cluster subregion submodule divides the entity being stored in data platform based on default Clustering Model.
Preferably, the Entities Matching module specifically includes similarity calculation submodule and Comparative sub-module;It is described similar
Degree computational submodule is used to that difference to be respectively set for the attribute of entity itself and other entity attributes relevant to the entity
Weight, weighted sum calculates the overall similarity of candidate entity pair;The Comparative sub-module is for judging in identical child partition
The overall similarity of candidate entity pair whether be more than default similarity threshold, if so, by candidate's entity to as matching
Entity pair.
Preferably, described device further includes data processing module and/or attribute alignment module;The data processing module is used
In by the unified access interface in data platform node entities data and side solid data handle, and return to number
Next module is passed to according to processing result;The attribute alignment module is used for will be from the pre- through the data of multiple data sources
The entity being stored in data platform after processing module processing is aligned according to the physical meaning of its attribute.
Disclosed herein as well is a kind of storage mediums for being recorded on the program for executing the above method.
Compared with prior art, the application has the following advantages:
Each stage in the application preferred embodiment scheme has upstream and downstream dependence, but different phase on assembly line
Between only by data format constrain, by data platform provide unified interface realize mutually decoupling, can stand-alone development complete.
The algorithm in each stage itself can be replaced flexibly, and by realizing the customized stage, new stream can be inserted between different phase
The journey stage freely works out customized demand.In addition, there is no limit for framework of the application to data platform, such as can adopt
With Hadoop distributed file system or cloud computing framework, to facilitate, in the case where data volume increases, extension is calculated and storage provides
Source.
Detailed description of the invention
The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as the limitation to the application.And whole
In a attached drawing, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is the flow diagram of the data fusion method first embodiment of the application knowledge mapping;
Fig. 2 is the flow diagram of the data fusion method second embodiment of the application knowledge mapping;
Fig. 3 is the structural schematic diagram of one embodiment of data fusion device of the application knowledge mapping;
Fig. 4 is the diagram data model schematic of knowledge mapping.
Specific embodiment
In order to make the above objects, features, and advantages of the present application more apparent, with reference to the accompanying drawing and it is specific real
Applying mode, the present application will be further described in detail.
In the description of the present application, it is to be understood that term " first ", " second " are used for description purposes only, and cannot
It is interpreted as indication or suggestion relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include one or more of the features.The meaning of " plurality " is two
Or it is more than two, unless otherwise specifically defined.The terms "include", "comprise" and similar terms are understood to out
The term of putting property, i.e., " including/including but not limited to ".Term "based" is " being based at least partially on ".Term " embodiment "
It indicates " at least one embodiment ";Term " another embodiment " expression " at least one other embodiment ".The phase of other terms
Pass definition provides in will be described below.
Referring to Fig.1, the process of the data fusion method first embodiment of the application knowledge mapping is shown, this method is executed
The data platform of promising each stage offer running environment and computing resource is arranged in the system of embodiment, and each stage can pass through number
Interaction is realized according to the unified access interface of platform.In the specific implementation, data platform can be constructed in Hadoop distributed document
On system, cloud computing framework (such as Amazon AWS EMR) or other frameworks, in this regard, the application not limits.The method is implemented
Example includes following several stages:
1. data preprocessing phase (InputStage): by (such as structural data A in multiple isomorphisms or heterogeneous data source
The format (SPO format) of identical entity and its attribute is processed into the data of unstructured data B), as follow-up phase
Input.
By configuring different data source information and data model, by data from data source extract, cleaning, deformation after with
Unified data format stores on data platform.Such as relevant database data source, by configuring link information, reality
Body type and entity table, relationship type and relation table, so that it may extract the SPO data of needs.For chart database, node
(entity-attribute-attribute value) and side (entity-relationship-entity) are natural SPO structures.
The part configuration parameter of data preprocessing phase is as shown in the table.
Different data sources are located in advance when it is implemented, can be realized using customized (CustomInputStage) mode
Reason, interface form are as follows:
It is configured defined in upper table by reading, realizes parsing, storing data after reading remote data.Such as to non-structural
Change data source, machine learning interface, network interface etc. can be called to complete knowledge and extracted, saved into triplet information, returned and protect
The address of deposit data and metadata information.
2. the entity subregion stage (BlockingStage):, according to its attribute, will be divided from the entity of multiple data sources
To different child partitions (Block), to reduce the data scale of candidate matches pair.
For needing matched data source S and T, it is assumed that the solid data scale of data source S is m, the entity number of data source T
It is n according to scale, needing to examine matched data scale is m*n.Under big data scene, this data scale is substantially nothing
What method was realized, it is necessary to reduce and need matched data to scale.
When it is implemented, matched entity impossible in two data sources can be divided different data are divided into advance
Qu Zhong substantially reduces the data scale in each data subregion, and multiple data subregions can be completed with parallel computation.
For example, being generally registered in the entity of country variant in real world for needing matched corporate entity in S and T
It is unlikely to be same company, then can be attribute according to the state of company, it is divided into more than 220 (country) data
Subregion.For each subregion, can continue to divide child partition further according to same or similar attribute.For example, in ' beauty
Company below state ' subregion can continue to assign to new subregion according to identical ' state ' attribute.Matched data are finally needed to advise
Mould is equal to the sum of all data subregions, and in subsequent calculating, all data subregions can be with parallel computation, so as to larger
Reduce to degree the whole matching time.
The part configuration parameter in entity subregion stage is as shown in the table.
Furthermore it is possible to pass through the square partition of customized partitioning algorithm extension entity subregion stage (BlockingStage)
Formula, for example, passing through following interface form:
Can the attribute according to used in subregion where current entity and next subzone generate globally unique subregion key
(block key), so that data are divided into next subregion.When the possibility matching entities logarithm of the subregion reach minimum or
When total number of partitions reaches maximum value, which does not continue to divide.
To the partitioning algorithm based on cluster, it can use trained Clustering Model and realize that interface form is as follows:
Clustering Model can directly predict current entity, and correspond in some class, at this time number of partitions etc.
In the class quantity of Clustering Model.Certainly can also continue to divide subregion on the basis of cluster.
3. the Entities Matching stage (MatchStage):, can be according to entity itself for the candidate entity pair in same subregion
Attribute and be respectively set different weights from its related entity attributes, and it is real by weighted sum to calculate the candidate
The overall similarity of body pair;By the candidate entity more than certain similarity threshold to filtering out, as matching entities pair.
It should be noted that this process design allow to be inserted into it is some matching is done directly based on strongly connected rule, such as
Company data in two data sources, if its be all listed company and list stock code it is identical, can be straight
Matching is connect, so that the process of similarity calculation is skipped, to reduce the computation complexity of matching stage.
When providing validation data set, it can be compared, be tested with validation data set by the result that matching algorithm generates
Demonstrate,prove the accuracy of matching algorithm.By adjusting attribute and weight parameter and similarity threshold, multiple comparison between calculation results, with
Accuracy is continuously improved.Such as Liang Ge corporate entity by title and stock code Similarity-Weighted and compares, if title
It being indicated in different data sources with different language, similarity weight is just lower, it needs to turn down its weight, and stock code
Similarity relative weighting should set it is some higher.
The Entities Matching algorithm of the application can be by adjusting parameter successive ignition, to improve the accuracy of matching result.
The part configuration parameter in Entities Matching stage (MatchStage) is as shown in the table.
By customized Entities Matching algorithm, it can compare whether two entities are directed toward the same representation of knowledge.Interface
Form is as follows:
In previous example, using two disaggregated model of machine learning trained in advance, with each attributes similarity of two entities
Vector infers whether the probability that can be classified as the same entity as input (being then is 1).
Last matched entity will be to will be output in results set.
4. entity fusing stage (MergeStage): to the data in the different data sources for being actually pointed to same entity, root
According to blending algorithm, entity property value is supplemented, replaced and is standardized, ultimately generating unified entity indicates.
Customized blending algorithm is generally required, interface form is as follows:
It is realized when data fusion in combination with different business rules, such as the settable multiple anonymities of title, mailbox, address etc.
Standardized format can be used.And the attribute data of missing can be filled by crawler or manually, construct high quality
Data, facilitate search, analysis of knowledge mapping etc. apply.
In a further embodiment, in addition to several stages defined above, the stage of different function can be entered with layout
(such as data processing stage).The interface of following form can be used:
Data to be treated are transmitted by input configuration parameter, output are written after the completion of processing, and pass to down
In one stage, realize the extension of system function.
The application realizes the universal pipeline (Pipeline) that entity merges under big data scene by above-mentioned means.
Assembly line is made of multiple stages (Stage), each stage can by way of configuring flexible expansion, and can will make by oneself
Adopted stage (CustomStage) is programmed into assembly line to adapt to different application scenarios.In addition to data preprocessing phase
(InputStage) there was only output output, other each stages all have input input configuration.Input configuration may specify the rank
Duan Yunhang needs the list of entities from different data sources, relation list, data address and the related data metamessage obtained
(schema includes table name, column name etc.).Input data has been read until the stage, has run algorithm, has been written to data platform, and will
All data addresses and metadata are exported by output.Therefore each stage can by input and output series operation,
Input parameter can be individually specified to run.
Referring to Fig. 2, the process of the data fusion method second embodiment of the application knowledge mapping is shown, with above-mentioned first
The difference of embodiment of the method is, increases an attribute align stage between data preprocessing phase and entity subregion stage
(Attribute Matching): for will be from the entity root of multiple data sources being stored in data platform after pretreatment
Be aligned according to the physical meaning of its attribute, such as by " Address " field of " address " field of data source A and data source B into
Row alignment, the field being aligned in subsequent partitions and matching stage will be handled as the field of same meaning.
When it is implemented, the physical meaning of entity attribute can manually be set, it can also be by the way that one be arranged in systems
The form of the attribute meaning table of comparisons is realized, in this regard, the application not limits.
Disclosed herein as well is a kind of storage mediums for being recorded on the program for executing the above method.It is described to deposit
Storage media includes any mechanism being configured to by the readable form storage of computer (by taking computer as an example) or transmission information.Example
Such as, storage medium includes read-only memory (ROM), random access memory (RAM), magnetic disk storage medium, optical storage media, sudden strain of a muscle
Fast storage medium, electricity, light, sound or transmitting signal (for example, carrier wave, infrared signal, digital signal etc.) of other forms etc..
Referring to Fig. 3, the structural block diagram of one embodiment of data fusion device of the application knowledge mapping, including data are shown
Platform 10, data preprocessing module 11, entity division module 12, Entities Matching module 13 and entity Fusion Module 14, in which:
Data platform 10 is configured with unified access interface, provides calculating and storage service for other modules.The application logarithm
There is no limit can use for convenience of extension calculating and storage resource in the case where data volume increases framework according to platform
Hadoop distributed file system or cloud computing framework.
Data preprocessing module 11 is for being converted to triple (S-P- after being handled the data from different data sources
O) format by the unified access interface storage to data platform 10, and receives the diagram data index of the return of data platform 10
Information.Wherein, diagram data index information can be storage address and its member of the diagram data in data platform 10 of triplet format
Data.
Entity division module 12 is used for according to the diagram data index information, is equalled data by the unified access interface
The entity stored in platform 10 is one or more child partitions by Attribute transposition.When it is implemented, entity division module 12 can wrap
Include the equivalence that the globally unique subregion key generated according to entity attribute carries out equivalent division to the entity being stored in data platform
Subregion submodule, based on the cluster subregion submodule that default Clustering Model divides the entity being stored in data platform,
And/or the submodule of other partitioned modes.
Entities Matching module 13 is used to screen the candidate entity being divided into identical child partition to similarity calculation is carried out
Meet the matching entities pair of default similarity condition out.
Entity Fusion Module 14 is generated for the entity property value of the matching entities pair to be supplemented and/or replaced
Unified entity indicates.
Each functional module of the application Installation practice on assembly line have upstream and downstream dependence, but disparate modules it
Between only by data format constrain, by data platform provide unified interface realize mutually decoupling, can stand-alone development complete.Respectively
Algorithm of module itself can be replaced flexibly, by the realization customized stage, can be inserted into new module between different modules,
Freely work out customized demand.For example, in order to improve adaptability and following entities point to various different data sources
Area, matching and fusion accuracy, can be inserted between data preprocessing module 11 and entity division module 12 attribute alignment
Module 15, for the entity in data platform 10 will to be stored in after the processing of data preprocessing module 11 from different data sources
It is aligned according to the physical meaning of its attribute.Such as by " Address " field of " address " field of data source A and data source B
It is aligned, the field being aligned in subsequent partitions and matching stage will be handled as the field of same meaning.
In further preferred embodiment embodiment, Entities Matching module 13 can specifically include similarity calculation submodule
And Comparative sub-module;Similarity calculation submodule therein be used for for entity itself attribute and it is relevant to the entity other
Different weights is respectively set in entity attributes, and weighted sum calculates the overall similarity of candidate entity pair;Comparative sub-module is used
In the overall similarity for judging the candidate entity pair in identical child partition whether be more than default similarity threshold, if so, should
Candidate entity is to as matching entities pair.
In another preferred embodiment embodiment, described device can further include data processing module, for passing through the system
One access interface in data platform node entities data and side solid data handle, and returned data processing result pass
Pass next module.
Above-mentioned data processing module can be realized using following form:
Wherein, data to be treated are transmitted by input configuration parameter, is write the result into after the completion of data processing
Output, and the functional module in next stage is passed to, the extension of realization device function.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.For the dress of the application
For setting embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place is referring to method reality
Apply the explanation of example part.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit
The module of explanation may or may not be physically separated, and both can be located in one place or may be distributed over
In multiple network units.Some or all of the modules therein can be selected to realize this embodiment scheme according to the actual needs
Purpose.Those of ordinary skill in the art can understand and implement without creative efforts.
Specific examples are used herein to illustrate the principle and implementation manner of the present application, and above embodiments are said
It is bright to be merely used to help understand the present processes and its core concept;At the same time, for those skilled in the art, foundation
The thought of the application, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not
It is interpreted as the limitation to the application.
Claims (10)
1. a kind of data fusion method of knowledge mapping, which is characterized in that the system for executing the method includes configured with unified
The data platform of access interface, which comprises
Triplet format is converted to after data from different data sources are handled, is stored by the unified access interface
To data platform, and receive the diagram data index information that the data platform returns;
It is according to the diagram data index information, the entity stored in the data platform is sub for one or more by Attribute transposition
Subregion;
To the candidate entity being divided into identical child partition to similarity calculation is carried out, filters out and meet default similarity condition
Matching entities pair;
The entity property value of the matching entities pair is supplemented and/or replaced, generating unified entity indicates.
2., will be described the method according to claim 1, wherein in step according to the diagram data index information
Before the entity stored in data platform is one or more child partitions by Attribute transposition, further includes: multiple data sources will be come from
The entity being stored in data platform after triplet format that is converted to be aligned according to the physical meaning of its attribute.
3. the method according to claim 1, wherein the child partition division mode is to be generated according to entity attribute
Globally unique subregion key carry out equivalent division, or divided based on default Clustering Model.
4. the method according to claim 1, wherein to the candidate entity being divided into identical child partition to progress
Similarity calculation filters out the matching entities pair for meeting default similarity condition, specifically:
Different weights is respectively set for the attribute and other entity attributes relevant to the entity of entity itself, weighting is asked
With the overall similarity for calculating candidate entity pair;
The overall similarity of the candidate entity pair in child partition is more than default similarity threshold if they are the same, then by candidate's entity pair
As matching entities pair.
5. the method according to claim 1, wherein the method supplemented the entity property value of missing is logical
Crawler is crossed to obtain from network or manually filled.
6. the method according to claim 1, wherein the diagram data index information is the figure number of triplet format
According to the storage address and its metadata in the data platform.
7. a kind of data fusion device of knowledge mapping, which is characterized in that including data platform, data preprocessing module, entity
Division module, Entities Matching module and entity Fusion Module, in which:
The data platform is configured with unified access interface;
The data preprocessing module is led to for being converted to triplet format after being handled the data from different data sources
The unified access interface storage is crossed to data platform, and receives the diagram data index information that the data platform returns;
The diagram data index information that the entity division module is exported according to the data preprocessing module, by the data platform
The entity of middle storage is one or more child partitions by Attribute transposition;
The Entities Matching module is used to the entity division module being divided into the candidate entity in identical child partition to progress
Similarity calculation filters out the matching entities pair for meeting default similarity condition;
The entity property value for the matching entities pair that the entity Fusion Module is used to filter out the Entities Matching module carries out
Supplement and/or replacement, generating unified entity indicates.
8. device according to claim 7, which is characterized in that the entity division module includes equivalent subregion submodule
And/or cluster subregion submodule;
The equivalence subregion submodule is used for the globally unique subregion key that generates according to entity attribute to being stored in data platform
Entity carry out equivalent division;
The cluster subregion submodule divides the entity being stored in data platform based on default Clustering Model;
The Entities Matching module specifically includes similarity calculation submodule and Comparative sub-module;
The attribute and other entity attributes relevant to the entity that the similarity calculation submodule is used for as entity itself
Different weights is respectively set, weighted sum calculates the overall similarity of candidate entity pair;
The Comparative sub-module is used to judge whether the overall similarity of the candidate entity pair in identical child partition to be more than default phase
Like degree threshold value, if so, by candidate's entity to as matching entities pair.
9. device according to claim 7, which is characterized in that described device further includes data processing module and/or attribute
Alignment module;
The data processing module is used for through the unified access interface to the node entities data and side reality in data platform
Volume data is handled, and returned data processing result passes to next module;
The attribute alignment module will be for that will be stored in number after data preprocessing module processing from multiple data sources
It is aligned according to the entity in platform according to the physical meaning of its attribute.
10. a kind of storage medium, which is characterized in that the storage medium is stored with any described for perform claim requirement 1 ~ 6
Method program.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811635696.XA CN109739939A (en) | 2018-12-29 | 2018-12-29 | The data fusion method and device of knowledge mapping |
PCT/CN2019/124552 WO2020135048A1 (en) | 2018-12-29 | 2019-12-11 | Data merging method and apparatus for knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811635696.XA CN109739939A (en) | 2018-12-29 | 2018-12-29 | The data fusion method and device of knowledge mapping |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109739939A true CN109739939A (en) | 2019-05-10 |
Family
ID=66362378
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811635696.XA Pending CN109739939A (en) | 2018-12-29 | 2018-12-29 | The data fusion method and device of knowledge mapping |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109739939A (en) |
WO (1) | WO2020135048A1 (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427415A (en) * | 2019-08-02 | 2019-11-08 | 泰康保险集团股份有限公司 | Knowledge share method, device, system media and electronic equipment |
CN110532304A (en) * | 2019-09-06 | 2019-12-03 | 京东城市(北京)数字科技有限公司 | Data processing method and device, computer readable storage medium and electronic equipment |
CN110580294A (en) * | 2019-09-11 | 2019-12-17 | 腾讯科技(深圳)有限公司 | Entity fusion method, device, equipment and storage medium |
CN110598072A (en) * | 2019-09-24 | 2019-12-20 | 恩亿科(北京)数据科技有限公司 | Feature data aggregation method and device |
CN110704635A (en) * | 2019-09-16 | 2020-01-17 | 金色熊猫有限公司 | Conversion method and device for ternary group data in knowledge graph |
CN110826316A (en) * | 2019-11-06 | 2020-02-21 | 北京交通大学 | Method for identifying sensitive information applied to referee document |
CN110929105A (en) * | 2019-11-28 | 2020-03-27 | 杭州云徙科技有限公司 | User ID (identity) association method based on big data technology |
CN111026874A (en) * | 2019-11-22 | 2020-04-17 | 海信集团有限公司 | Data processing method and server of knowledge graph |
CN111125376A (en) * | 2019-12-23 | 2020-05-08 | 秒针信息技术有限公司 | Knowledge graph generation method and device, data processing equipment and storage medium |
WO2020114022A1 (en) * | 2018-12-04 | 2020-06-11 | 平安科技(深圳)有限公司 | Knowledge base alignment method and apparatus, computer device and storage medium |
CN111291196A (en) * | 2020-01-22 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Method and device for improving knowledge graph and method and device for processing data |
WO2020135048A1 (en) * | 2018-12-29 | 2020-07-02 | 颖投信息科技(上海)有限公司 | Data merging method and apparatus for knowledge graph |
CN111444351A (en) * | 2020-03-24 | 2020-07-24 | 清华苏州环境创新研究院 | Method and device for constructing knowledge graph in industrial process field |
CN111475653A (en) * | 2019-12-30 | 2020-07-31 | 北京国双科技有限公司 | Method and device for constructing knowledge graph in oil and gas exploration and development field |
CN111522803A (en) * | 2020-04-14 | 2020-08-11 | 北京仁科互动网络技术有限公司 | Tenant interaction method and device of software service platform and electronic equipment |
CN111563133A (en) * | 2020-05-06 | 2020-08-21 | 支付宝(杭州)信息技术有限公司 | Method and system for data fusion based on entity relationship |
CN111597239A (en) * | 2020-04-10 | 2020-08-28 | 中科驭数(北京)科技有限公司 | Data alignment method and device |
CN112182330A (en) * | 2020-09-23 | 2021-01-05 | 创新奇智(成都)科技有限公司 | Knowledge graph construction method and device, electronic equipment and computer storage medium |
WO2021082100A1 (en) * | 2019-10-30 | 2021-05-06 | 平安科技(深圳)有限公司 | Method and apparatus for aligning entities of knowledge graph, device, and storage medium |
CN112906826A (en) * | 2021-03-30 | 2021-06-04 | 平安科技(深圳)有限公司 | Multi-dimension-based knowledge graph fusion method and device and computer equipment |
CN113297213A (en) * | 2021-04-29 | 2021-08-24 | 军事科学院系统工程研究院网络信息研究所 | Dynamic multi-attribute matching method for entity object |
CN113392227A (en) * | 2021-05-31 | 2021-09-14 | 交控科技股份有限公司 | Metadata knowledge map engine system facing rail transit field |
CN113760995A (en) * | 2021-09-09 | 2021-12-07 | 上海明略人工智能(集团)有限公司 | Entity linking method, system, equipment and storage medium |
CN113901264A (en) * | 2021-11-12 | 2022-01-07 | 央视频融媒体发展有限公司 | Method and system for matching periodic entities among movie and television attribute data sources |
CN113934866A (en) * | 2021-12-17 | 2022-01-14 | 鲁班(北京)电子商务科技有限公司 | Commodity entity matching method and device based on set similarity |
CN114282073A (en) * | 2022-03-02 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Data storage method and device and data reading method and device |
CN114861818A (en) * | 2022-05-25 | 2022-08-05 | 平安普惠企业管理有限公司 | Main data matching method, device, equipment and storage medium based on artificial intelligence |
CN114896363A (en) * | 2022-04-19 | 2022-08-12 | 北京月新时代科技股份有限公司 | Data management method, device, equipment and medium |
CN115577318A (en) * | 2022-09-30 | 2023-01-06 | 北京大数据先进技术研究院 | Data fusion evaluation method, system, equipment and storage medium based on semi-physical object |
CN117556058A (en) * | 2024-01-11 | 2024-02-13 | 安徽大学 | Knowledge graph enhanced network embedded author name disambiguation method and device |
CN117725555A (en) * | 2024-02-08 | 2024-03-19 | 暗物智能科技(广州)有限公司 | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112699252B (en) * | 2021-03-25 | 2021-07-23 | 成都数联铭品科技有限公司 | Processing method of attribute data applied to knowledge graph and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107145523A (en) * | 2017-04-12 | 2017-09-08 | 浙江大学 | Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching |
US20180082183A1 (en) * | 2011-02-22 | 2018-03-22 | Thomson Reuters Global Resources | Machine learning-based relationship association and related discovery and search engines |
CN108647318A (en) * | 2018-05-10 | 2018-10-12 | 北京航空航天大学 | A kind of knowledge fusion method based on multi-source data |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2874073A1 (en) * | 2013-11-18 | 2015-05-20 | Fujitsu Limited | System, apparatus, program and method for data aggregation |
CN105956015A (en) * | 2016-04-22 | 2016-09-21 | 四川中软科技有限公司 | Service platform integration method based on big data |
CN107545046B (en) * | 2017-08-17 | 2021-05-25 | 北京奇安信科技有限公司 | Fusion method and device for multi-source heterogeneous data |
CN107958086A (en) * | 2017-12-18 | 2018-04-24 | 北京睿力科技有限公司 | The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method |
CN109033129B (en) * | 2018-06-04 | 2021-08-03 | 桂林电子科技大学 | Multi-source information fusion knowledge graph representation learning method based on self-adaptive weight |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
-
2018
- 2018-12-29 CN CN201811635696.XA patent/CN109739939A/en active Pending
-
2019
- 2019-12-11 WO PCT/CN2019/124552 patent/WO2020135048A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180082183A1 (en) * | 2011-02-22 | 2018-03-22 | Thomson Reuters Global Resources | Machine learning-based relationship association and related discovery and search engines |
CN107145523A (en) * | 2017-04-12 | 2017-09-08 | 浙江大学 | Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching |
CN108647318A (en) * | 2018-05-10 | 2018-10-12 | 北京航空航天大学 | A kind of knowledge fusion method based on multi-source data |
Non-Patent Citations (1)
Title |
---|
》,《图书情报工作》杂志社编: "《面向MOOC的图书馆嵌入式服务创新》", 北京:海洋出版社, pages: 154 - 155 * |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020114022A1 (en) * | 2018-12-04 | 2020-06-11 | 平安科技(深圳)有限公司 | Knowledge base alignment method and apparatus, computer device and storage medium |
WO2020135048A1 (en) * | 2018-12-29 | 2020-07-02 | 颖投信息科技(上海)有限公司 | Data merging method and apparatus for knowledge graph |
CN110427415A (en) * | 2019-08-02 | 2019-11-08 | 泰康保险集团股份有限公司 | Knowledge share method, device, system media and electronic equipment |
CN110532304A (en) * | 2019-09-06 | 2019-12-03 | 京东城市(北京)数字科技有限公司 | Data processing method and device, computer readable storage medium and electronic equipment |
CN110580294B (en) * | 2019-09-11 | 2022-11-29 | 腾讯科技(深圳)有限公司 | Entity fusion method, device, equipment and storage medium |
CN110580294A (en) * | 2019-09-11 | 2019-12-17 | 腾讯科技(深圳)有限公司 | Entity fusion method, device, equipment and storage medium |
CN110704635A (en) * | 2019-09-16 | 2020-01-17 | 金色熊猫有限公司 | Conversion method and device for ternary group data in knowledge graph |
CN110704635B (en) * | 2019-09-16 | 2023-12-12 | 金色熊猫有限公司 | Method and device for converting triplet data in knowledge graph |
CN110598072A (en) * | 2019-09-24 | 2019-12-20 | 恩亿科(北京)数据科技有限公司 | Feature data aggregation method and device |
CN110598072B (en) * | 2019-09-24 | 2022-03-01 | 恩亿科(北京)数据科技有限公司 | Feature data aggregation method and device |
WO2021082100A1 (en) * | 2019-10-30 | 2021-05-06 | 平安科技(深圳)有限公司 | Method and apparatus for aligning entities of knowledge graph, device, and storage medium |
CN110826316B (en) * | 2019-11-06 | 2021-08-10 | 北京交通大学 | Method for identifying sensitive information applied to referee document |
CN110826316A (en) * | 2019-11-06 | 2020-02-21 | 北京交通大学 | Method for identifying sensitive information applied to referee document |
CN111026874A (en) * | 2019-11-22 | 2020-04-17 | 海信集团有限公司 | Data processing method and server of knowledge graph |
CN110929105A (en) * | 2019-11-28 | 2020-03-27 | 杭州云徙科技有限公司 | User ID (identity) association method based on big data technology |
CN110929105B (en) * | 2019-11-28 | 2022-11-29 | 广东云徙智能科技有限公司 | User ID (identity) association method based on big data technology |
CN111125376A (en) * | 2019-12-23 | 2020-05-08 | 秒针信息技术有限公司 | Knowledge graph generation method and device, data processing equipment and storage medium |
CN111125376B (en) * | 2019-12-23 | 2023-08-29 | 秒针信息技术有限公司 | Knowledge graph generation method and device, data processing equipment and storage medium |
CN111475653A (en) * | 2019-12-30 | 2020-07-31 | 北京国双科技有限公司 | Method and device for constructing knowledge graph in oil and gas exploration and development field |
CN111475653B (en) * | 2019-12-30 | 2021-03-02 | 北京国双科技有限公司 | Method and device for constructing knowledge graph in oil and gas exploration and development field |
CN111291196A (en) * | 2020-01-22 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Method and device for improving knowledge graph and method and device for processing data |
CN111291196B (en) * | 2020-01-22 | 2024-03-22 | 腾讯科技(深圳)有限公司 | Knowledge graph perfecting method and device, and data processing method and device |
CN111444351B (en) * | 2020-03-24 | 2023-09-12 | 清华苏州环境创新研究院 | Knowledge graph construction method and device in industrial process field |
CN111444351A (en) * | 2020-03-24 | 2020-07-24 | 清华苏州环境创新研究院 | Method and device for constructing knowledge graph in industrial process field |
CN111597239A (en) * | 2020-04-10 | 2020-08-28 | 中科驭数(北京)科技有限公司 | Data alignment method and device |
CN111522803B (en) * | 2020-04-14 | 2023-05-19 | 北京仁科互动网络技术有限公司 | Tenant interaction method and device of software service platform and electronic equipment |
CN111522803A (en) * | 2020-04-14 | 2020-08-11 | 北京仁科互动网络技术有限公司 | Tenant interaction method and device of software service platform and electronic equipment |
CN111563133A (en) * | 2020-05-06 | 2020-08-21 | 支付宝(杭州)信息技术有限公司 | Method and system for data fusion based on entity relationship |
CN112182330A (en) * | 2020-09-23 | 2021-01-05 | 创新奇智(成都)科技有限公司 | Knowledge graph construction method and device, electronic equipment and computer storage medium |
CN112906826A (en) * | 2021-03-30 | 2021-06-04 | 平安科技(深圳)有限公司 | Multi-dimension-based knowledge graph fusion method and device and computer equipment |
CN113297213B (en) * | 2021-04-29 | 2023-09-12 | 军事科学院系统工程研究院网络信息研究所 | Dynamic multi-attribute matching method for entity object |
CN113297213A (en) * | 2021-04-29 | 2021-08-24 | 军事科学院系统工程研究院网络信息研究所 | Dynamic multi-attribute matching method for entity object |
CN113392227B (en) * | 2021-05-31 | 2024-04-19 | 交控科技股份有限公司 | Metadata knowledge graph engine system oriented to rail transit field |
CN113392227A (en) * | 2021-05-31 | 2021-09-14 | 交控科技股份有限公司 | Metadata knowledge map engine system facing rail transit field |
CN113760995A (en) * | 2021-09-09 | 2021-12-07 | 上海明略人工智能(集团)有限公司 | Entity linking method, system, equipment and storage medium |
CN113901264A (en) * | 2021-11-12 | 2022-01-07 | 央视频融媒体发展有限公司 | Method and system for matching periodic entities among movie and television attribute data sources |
CN113934866A (en) * | 2021-12-17 | 2022-01-14 | 鲁班(北京)电子商务科技有限公司 | Commodity entity matching method and device based on set similarity |
CN114282073B (en) * | 2022-03-02 | 2022-07-15 | 支付宝(杭州)信息技术有限公司 | Data storage method and device and data reading method and device |
CN114282073A (en) * | 2022-03-02 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Data storage method and device and data reading method and device |
CN114896363A (en) * | 2022-04-19 | 2022-08-12 | 北京月新时代科技股份有限公司 | Data management method, device, equipment and medium |
CN114861818A (en) * | 2022-05-25 | 2022-08-05 | 平安普惠企业管理有限公司 | Main data matching method, device, equipment and storage medium based on artificial intelligence |
CN115577318B (en) * | 2022-09-30 | 2023-07-21 | 北京大数据先进技术研究院 | Semi-physical-based data fusion evaluation method, system, equipment and storage medium |
CN115577318A (en) * | 2022-09-30 | 2023-01-06 | 北京大数据先进技术研究院 | Data fusion evaluation method, system, equipment and storage medium based on semi-physical object |
CN117556058A (en) * | 2024-01-11 | 2024-02-13 | 安徽大学 | Knowledge graph enhanced network embedded author name disambiguation method and device |
CN117556058B (en) * | 2024-01-11 | 2024-05-24 | 安徽大学 | Knowledge graph enhanced network embedded author name disambiguation method and device |
CN117725555A (en) * | 2024-02-08 | 2024-03-19 | 暗物智能科技(广州)有限公司 | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
CN117725555B (en) * | 2024-02-08 | 2024-06-11 | 暗物智能科技(广州)有限公司 | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2020135048A1 (en) | 2020-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109739939A (en) | The data fusion method and device of knowledge mapping | |
US11599714B2 (en) | Methods and systems for modeling complex taxonomies with natural language understanding | |
CN110990638B (en) | Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment | |
CN110347719B (en) | Enterprise foreign trade risk early warning method and system based on big data | |
CN107391677B (en) | Method and device for generating Chinese general knowledge graph with entity relation attributes | |
US20210097089A1 (en) | Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium | |
CN110750649A (en) | Knowledge graph construction and intelligent response method, device, equipment and storage medium | |
US20160217189A1 (en) | Augmenting queries when searching a semantic database | |
EP2973038A1 (en) | Classifying resources using a deep network | |
CN111414491A (en) | Power grid industry knowledge graph construction method, device and equipment | |
US11423018B1 (en) | Multivariate analysis replica intelligent ambience evolving system | |
US11809506B1 (en) | Multivariant analyzing replicating intelligent ambience evolving system | |
CN102123172A (en) | Implementation method of Web service discovery based on neural network clustering optimization | |
KR20180129001A (en) | Method and System for Entity summarization based on multilingual projected entity space | |
CN114996549A (en) | Intelligent tracking method and system based on active object information mining | |
CN115344698A (en) | Label processing method, label processing device, computer equipment, storage medium and program product | |
CN117149804A (en) | Data processing method, device, electronic equipment and storage medium | |
CN115713386A (en) | Multi-source information fusion commodity recommendation method and system | |
WO2023278154A1 (en) | Apparatus and method for transforming unstructured data sources into both relational entities and machine learning models that support structured query language queries | |
CN116702784B (en) | Entity linking method, entity linking device, computer equipment and storage medium | |
CN110032574A (en) | The processing method and processing device of SQL statement | |
CN112966084B (en) | Knowledge graph-based answer query method, device, equipment and storage medium | |
CN116523041A (en) | Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment | |
KR102454261B1 (en) | Collaborative partner recommendation system and method based on user information | |
CN114519106A (en) | Document level entity relation extraction method and system based on graph neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190510 |
|
RJ01 | Rejection of invention patent application after publication |