[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2017137000A1 - 对描述同一实体的不同实例进行合并的方法、装置及设备 - Google Patents

对描述同一实体的不同实例进行合并的方法、装置及设备 Download PDF

Info

Publication number
WO2017137000A1
WO2017137000A1 PCT/CN2017/072995 CN2017072995W WO2017137000A1 WO 2017137000 A1 WO2017137000 A1 WO 2017137000A1 CN 2017072995 W CN2017072995 W CN 2017072995W WO 2017137000 A1 WO2017137000 A1 WO 2017137000A1
Authority
WO
WIPO (PCT)
Prior art keywords
instance
instances
nodes
connection
relationship
Prior art date
Application number
PCT/CN2017/072995
Other languages
English (en)
French (fr)
Inventor
杨扬
穆冠宇
华能威
张伟
吴嘉
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州神马移动信息科技有限公司 filed Critical 广州神马移动信息科技有限公司
Priority to US16/046,166 priority Critical patent/US11544578B2/en
Publication of WO2017137000A1 publication Critical patent/WO2017137000A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/048Fuzzy inferencing

Definitions

  • the present invention relates to the field of computer technology, and more particularly to a method, apparatus and apparatus for merging different instances of the same entity.
  • the knowledge map is intended to describe the various entities or concepts that exist in the real world.
  • Each entity or concept in the knowledge map is identified by a globally unique ID, called its identifier.
  • Each attribute-value pair (AVP) is used to characterize the intrinsic properties of an entity, and the relationship is used to connect two entities to characterize the association between them.
  • the common fusion method is mainly to integrate the instance pairs whose attribute similarity exceeds the threshold by calculating the attribute similarity between different instances.
  • this fusion method can identify different instances describing the same entity to a certain extent, since this fusion method only uses the attribute similarity as the standard of the fusion instance, it makes the attribute fuzzy matching rule used in the fusion process. It must be set as much as possible to effectively identify different instances of the same entity for fusion, but this is difficult to implement in practical applications, so it is easy to identify instance pairs that represent the same entity as different instances, for knowledge maps. Build data that brings redundancy.
  • the technical problem to be solved by the present invention is to provide a method, device and device for merging different instances of the same entity, which can more fully identify instance pairs describing the same entity.
  • a computing device comprising: a memory for storing a connection graph including a plurality of instances, wherein different nodes in the connection graph represent different instances, and a connection between the nodes represents a node corresponding to An instance relationship between instances; a processor coupled to the memory, the processor being capable of obtaining a connection graph from the memory, the processor configured to: identify different instances describing the same entity in the connection graph based on the instance relationship, identify The nodes corresponding to the instance are merged, and the connection diagram is updated; in the updated connection diagram, the instance pairs of the unexisting instance relationship are identified, and the connection for connecting the corresponding node to the instance pair is added; The steps of updating the connection diagram based on the instance relationship and adding the connection operation in the updated connection diagram until the specified condition is met.
  • the apparatus of the present invention combines the equivalent instances of the plurality of instances to be determined using a connection diagram.
  • the instance relationship is utilized, and based on the merged connection graph, the instance relationship is expanded, and then the steps of merging and augmenting are performed iteratively, so that the equivalent instances existing in the connection graph can be fully exploited.
  • an apparatus for merging different instances describing the same entity comprising: an obtaining module, configured to acquire a connection graph including a plurality of instances, wherein different nodes in the connection graph represent different
  • the connection between the nodes represents an instance relationship between the instances corresponding to the node
  • the merging module is configured to identify different instances of the same entity in the connection diagram based on the instance relationship, and perform the node corresponding to the identified instance.
  • the expansion module is configured to identify the unrecognized instance pairs of the existing instance relationship in the updated connection diagram, and add a connection line for connecting the corresponding pair of the instance pairs;
  • the merge module and the expansion module are caused to iteratively perform the operation of updating the connection graph and the operation of adding the connection until the specified condition is met.
  • the expansion module mentioned in the foregoing apparatus may include: an association calculation module, configured to calculate, for any node in the updated connection diagram, an instance corresponding to the node and the section The degree of association between the instances corresponding to the nodes connected by the N nodes, where N is greater than or equal to 1; the first identifying module is configured to identify the instance pairs corresponding to the two nodes whose relevance degree reaches the predetermined relevance threshold For instance pairs that have instance relationships, add a connection between the two nodes. Due to the merging of the equivalent instances, the instance relationships in the merged join graph will also change. At this point, you can find the instance pair with the instance relationship by calculating the degree of association between the nodes.
  • an association calculation module configured to calculate, for any node in the updated connection diagram, an instance corresponding to the node and the section The degree of association between the instances corresponding to the nodes connected by the N nodes, where N is greater than or equal to 1
  • the first identifying module is configured to identify the instance pairs corresponding to the two nodes whose relevance
  • the specified condition described in the foregoing apparatus may be set such that the number of instance pairs of the undiscovered existence instance relationship identified by the expansion module in the updated connection diagram is zero.
  • the merging module mentioned in the foregoing apparatus may include: a grouping module, configured to group multiple instances; and a similarity calculating module, configured to calculate, for each group, any two instances in the group based on the instance relationship.
  • the similarity between the two; the second identification module is configured to identify the instance pair whose similarity reaches the predetermined similarity threshold as an instance pair describing the same entity.
  • the similarity calculation module can calculate the similarity Sim between the two instances according to the following formula:
  • C i is an instance set having an instance relationship with instance i
  • C j is an instance set having an instance relationship with instance j
  • Jac ij is an instance relationship similarity between instances i and j
  • Uniq is an instance unique measure
  • Cnt sourceA i is the number of instances of the same name of instance i in source A
  • Cnt sourceB j is the number of instances of the same name of instance j in source B.
  • the acquiring module mentioned in the foregoing apparatus may further include: the attribute similarity calculating module may calculate the attribute similarity between the instances corresponding to any two nodes in the connection graph; and the second combining module may perform the attribute The nodes corresponding to the two instances whose similarity exceeds the predetermined attribute similarity threshold are merged into one node.
  • a method of merging different instances describing the same entity comprising: obtaining a connection graph comprising a plurality of instances, wherein the connection graph Different nodes in the middle represent different instances, and the connection between the nodes represents an instance relationship between the instances corresponding to the nodes; based on the instance relationship, different instances in the connection graph describing the same entity are identified, and the nodes corresponding to the identified instances are identified.
  • connection graph Merging and updating the connection graph; identifying unrecognized instance pairs with instance relationships in the updated connection graph, and adding connections for connecting the corresponding pairs of the instance pairs; iteratively executing the connection graph based on the instance relationship update Steps and steps to add wires to the updated connection diagram until the specified conditions are met.
  • the step of identifying an instance pair of the undiscovered existence instance relationship in the updated connection diagram described in the foregoing method may include: calculating, for the node in the updated connection diagram, the corresponding node The degree of association between the instance and the instance corresponding to the node connected by the node through the N nodes, where N is greater than or equal to 1; the instance pair corresponding to the two nodes whose relevance degree reaches the predetermined relevance threshold is identified as being present An instance pair of instance relationships adds a connection between the two nodes.
  • the specified condition described in the above method may be set such that the number of instance pairs of the undiscovered existence instance relationship identified in the updated connection diagram is zero.
  • the step of identifying different instances in the connection diagram describing the same entity in the connection diagram may include: grouping multiple instances; for each group, calculating any two groups in the group based on the instance relationship. The degree of similarity between instances; an instance pair that achieves a similarity to a predetermined similarity threshold is identified as an instance pair describing the same entity.
  • the similarity Sim between the two instances can be calculated according to the following formula:
  • C i is an instance set having an instance relationship with instance i
  • C j is an instance set having an instance relationship with instance j
  • Jac ij is an instance relationship similarity between instances i and j
  • Uniq is an instance unique measure
  • Cnt sourceA i is the number of instances of the same name of instance i in source A
  • Cnt sourceB j is the number of instances of the same name of instance j in source B.
  • the step of acquiring the connection graph including the multiple instances mentioned in the foregoing method may further include: calculating an attribute similarity between the instances corresponding to any two nodes in the connection graph; and comparing the attribute similarity to the predetermined one.
  • the nodes corresponding to the two instances of the attribute similarity threshold are combined into one node.
  • the method, device and device for merging different instances of the same entity in the present invention merge the equivalent instances in the plurality of instances by means of a connection diagram, wherein the instances existing in the connection diagram are utilized in the process of merging Relationship, and augment the instance relationship based on the merged connection graph, and then further discover the equivalent instances existing in the connection graph based on the extended instance relationship, and so on, and iteratively perform the steps of the foregoing merge and expansion, so that the scheme based on the present invention An instance pair describing the same entity can be more fully identified.
  • Fig. 1 shows a schematic view of a connection diagram of the present invention.
  • FIG. 2 is a block diagram showing the structure of a computing device in accordance with an embodiment of the present invention.
  • FIG. 3 shows a functional block diagram of an apparatus for merging different instances of the same entity, in accordance with an embodiment of the present invention.
  • FIG. 4 shows a functional block diagram of an apparatus for merging different instances of the same entity, in accordance with another embodiment of the present invention.
  • FIG. 5 shows a schematic flow diagram of a method of merging different instances of the same entity, in accordance with an embodiment of the present invention.
  • FIG. 6 shows a schematic flow chart of sub-steps that can be included in step S110 of FIG. 5.
  • FIG. 7 shows a schematic flow chart of sub-steps that can be included in step S120 of FIG. 5.
  • FIG. 8 shows a schematic flow chart of sub-steps that can be included in step S130 in FIG.
  • Entity A unit of knowledge in a knowledge map with a uniquely identified ID identifier.
  • Example Data from various sources used in the process of building entities in a knowledge map.
  • Instance relationship The relationship existing between instances. For different data sources, the relationship here can be attribute relationship, reference relationship, link relationship and other relationships.
  • Baidu Encyclopedia refers to the various terms in Baidu Encyclopedia.
  • the entry "Li Ning” in Baidu Encyclopedia is a polysemous word. Li Ning, who refers to the famous gymnast, also Li Ning, who refers to the magician. Here, Li Ning, who refers to the magician, and Li Ning, who refers to the gymnast, are examples of the same name. Under the entry of Li Ning, a famous gymnast, there are still entries such as “Olympic Champion” and “Gold Medal”. Here, we can think that “Li Ning” has an instance relationship with “Olympic Champion” and “Gold Medal”. Baidu Encyclopedia refers to the "Li Ning” of the gymnasts and the "Gymnastics Prince" in the Sogou Encyclopedia.
  • the present invention primarily proposes a solution for identifying equivalent instances in numerous instances.
  • the scheme mainly identifies the equivalent instances based on the connection graph and continuously updates the connection graph to identify more equivalent instances.
  • connection diagram including a plurality of instances may be constructed first.
  • nodes in the connection diagram represent instances, and connections between nodes represent instance relationships.
  • the existing equivalence instances may be identified according to the instance relationships existing in the connection graph, and the nodes corresponding to the identified equivalence instances are merged.
  • the instance relationships not found in the connection graph may be found based on a certain identification rule, and the connection graph is updated according to the found instance relationship. Then, the steps of finding the equivalent instance based on the instance relationship and the step of finding the undiscovered instance relationship based on the merged connection graph are repeated, until the specified condition is satisfied.
  • the specified condition here may be that a new instance relationship cannot be found or a new equivalent instance cannot be found or the repeated steps are reached a certain number of times, and of course other specified conditions.
  • the connection diagram may be updated after all the found equivalent instances in the connection diagram are merged.
  • the solution of the present invention can be implemented as a computing device as shown in FIG. 2.
  • the computing device can be configured to include a memory 1 and a processor 2.
  • the memory 1 can store a connection map containing a plurality of instances.
  • the processor 2 is connected to the memory 1, and the connection map can be acquired from the memory 1, and the operation of implementing the relevant steps in the above scheme can be performed.
  • the processor 2 may be, for example, a central processing unit CPU, a microprocessor MCU, or the like
  • the memory 1 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a nonvolatile memory such as a hard disk, and the like.
  • computing devices may also include interface devices, communication devices, display devices, input devices, speakers, microphones, etc., but These components are not relevant to the present invention and are therefore omitted herein.
  • the physical implementation of the computing device is not limited.
  • the computing device may be a server, such as a blade server, or a computer or a computer-like electronic device such as a notebook computer, a tablet computer, or the like.
  • the solution of the invention can also be implemented as a device comprising a plurality of functional modules.
  • the functions of the processor 2 shown in FIG. 2 can be implemented by corresponding functional modules in the device.
  • an apparatus for merging different instances of the same entity of the present invention may include an acquisition module 21, a merge module 22, an expansion module 23, and an iteration module 24.
  • the obtaining module 21, the merging module 22, the expanding module 23, and the iterative module 24 can perform the implementation described above.
  • the acquisition module 21 can acquire a connection map.
  • the merging module 22 may identify an equivalent instance existing in the connection graph based on the instance relationship existing in the connection graph, and merge the nodes corresponding to the identified equivalent instance.
  • the expansion module 23 can identify instance relationships that are not found in the connection diagram.
  • the iteration module 24 may cause the merge module 22 and the expansion module 23 to iteratively perform the corresponding operations until the specified conditions are met.
  • the acquisition module 21 may include an attribute similarity calculation module 211 and a second merge module 212.
  • the merging module 22 may include a grouping module 221, a similarity calculating module 222, and a second identifying module 223.
  • the expansion module 23 may include an association degree calculation module 231 and a first identification module 232.
  • the functions of the obtaining module 21, the merging module 22, and the expansion module 23 may be implemented by corresponding sub-modules included therein, and will not be specifically described herein.
  • FIG. 5 to 8 show in detail a flowchart for carrying out the solution of the present invention.
  • the steps shown in FIG. 5 to FIG. 8 can be implemented by the corresponding functional modules in the above-mentioned processor or device.
  • the working flow of the solution of the present invention will be described in detail below with reference to FIG. 5 to FIG.
  • step S110 a connection diagram including a plurality of instances is acquired by the processor 2 or the acquisition module 21.
  • the step of obtaining the connection map described herein may be to acquire a previously constructed connection diagram.
  • the connection map may be constructed in advance based on a plurality of instances and then stored in the memory, which is acquired by the processor 2 or the acquisition module 21 from the memory when processing is required.
  • connection map may be constructed according to a plurality of instance data to be determined and an instance relationship existing in the instance data.
  • a connection graph may be constructed according to a plurality of instance data to be determined and an instance relationship existing in the instance data.
  • the constructed connection diagram it can be stored in the memory.
  • the processor 2 or the acquisition module 21 can obtain the connection diagram from the memory.
  • the constructed connection diagram can be directly sent to the processor 2 or Obtain module 21.
  • an equivalent instance existing in the connection graph may be identified based on a certain identification rule, and the node corresponding to the equivalent instance may be merged.
  • the identification rule referred to herein may be a manner of identifying the attribute similarity as shown in FIG. 6.
  • step S1110 the attribute similarity between the instances corresponding to any two nodes in the connection graph is calculated by the processor 2 or by the attribute similarity calculation module 211 in the acquisition module 21.
  • step S1120 the nodes corresponding to the instances whose attribute similarities exceed the predetermined attribute similarity threshold are merged into one node by the processor 2 or by the second merge module 212 in the acquisition module 21.
  • step S110, step S1120 the step of merging the nodes in the connection diagram (step S110, step S1120) in step S110 is an alternative of the present invention, so that the existing existence of the connection diagram can be initially found based on the existing calculation manner. Valuation instances and merge them to reduce the complexity of subsequent steps.
  • step S120 may be performed, by the processor 2 or by the merging module 21, based on the instance relationship, identifying different instances (ie, equivalent instances) describing the same entity in the connection graph, The nodes corresponding to the identified equivalent instances are merged and the connection graph is updated.
  • an instance having an instance relationship with the current instance may be involved in the process of calculating the similarity, and then the instance pair whose similarity exceeds the threshold is identified as an equivalent instance.
  • Figure 7 illustrates a specific implementation for identifying an equivalent instance based on an instance relationship.
  • step S1210 a plurality of instances in the connection map are grouped by the processor 2 or by the grouping module 221 in the merging module 22.
  • the similarity between any two instances within the group may be calculated by the processor 2 or by the similarity calculation module 222 in the grouping module 22 based on the instance relationship.
  • the similarity Sim between the two instances can be calculated according to the following formula:
  • C i is an instance set having an instance relationship with instance i
  • C j is an instance set having an instance relationship with instance j
  • Jac ij is an instance relationship similarity between instances i and j
  • Uniq is an instance unique measure
  • Cnt sourceA i is the number of instances of the same name of instance i in source A
  • Cnt sourceB j is the number of entities of the same name in instance B in source B.
  • the above formula can have different forms of deformation. Taking the instance data source as the encyclopedia entry, the similarity Sim between two instances from different encyclopedias can be calculated based on the following formula:
  • is a weighting coefficient
  • C iout number of instances is determined to be an example of a chain i
  • C jout instance number is determined to be an example of a chain of j
  • C iin of i is to be determined
  • Example of links The number, C jin is the number of instances to be determined by instance j
  • Jac out is the similarity of the instance to be determined by instance i, j
  • Jac in is the entity to which the instance i, j is to be determined.
  • Uniq is the uniqueness measure of the instance, Cnt sourceA, i is the number of instances of the same name of the instance i to be determined in source A, Cnt sourceB, j is the same name entity of the instance j to be determined in source B number.
  • the above calculation formula can be implemented in parallel on a distributed computing platform such as SPARK, and achieves the purpose of large-scale parallel graph calculation.
  • a distributed computing platform such as SPARK
  • the above calculation formula can be implemented in parallel on a distributed computing platform such as SPARK, and achieves the purpose of large-scale parallel graph calculation.
  • SPARK a distributed computing platform
  • step S1230 an instance pair whose similarity reaches a predetermined similarity threshold is identified by the processor 2 or by the second identification module 223 in the grouping module 22 as an equivalent instance.
  • an equivalent instance existing in the connection graph can be identified based on the instance relationship.
  • step S130 may be executed, and the processor 2 or the expansion module 23 identifies an instance pair of the undiscovered existence instance relationship in the updated connection diagram, and adds the same Connect the connected node to the corresponding node.
  • connection diagram after step S120 due to the merging of the equivalent instances, the instance relationship in the merged connection graph will also change.
  • a certain identification rule can be used to identify the instance pairs of the existing existing instance relationships in the connection graph.
  • Figure 8 illustrates a specific embodiment for identifying an instance relationship not found in the connection diagram.
  • step S1310 for any node in the updated connection diagram, the processor 2 or the association degree calculation module 231 in the expansion module 23 calculates an instance corresponding to the node and passes the node.
  • the degree of association between instances corresponding to the nodes to which the N nodes are connected, N being greater than or equal to 1.
  • step S1320 the instance pair corresponding to the two nodes whose relevance degree reaches the predetermined relevance degree threshold is identified by the processor 2 or the first identification module 232 in the expansion module 23 as an instance pair in which the instance relationship exists, and the connection is added.
  • the connection between the nodes is added.
  • node D and node L in FIG. 1 node D and node L are connected by two nodes, node A and node E, so that the degree of similarity between node A and node E can be analyzed. To determine if there is a degree of association between nodes D and E.
  • step S140 For the connection diagram of the instance relationship extended in step S130, step S140 may be performed.
  • the processor 2 or the iteration module 24 may determine whether the specified condition is met. If the specified condition is not met, the process returns to step S120.
  • the steps of S120, S130, and S140 are repeatedly executed.
  • the merged connection diagram is output until the specified conditions are met.
  • the specified condition in step S140 may be to repeatedly perform steps S120, S130, and S140. The number of times has reached a certain value. It is also possible that in the process of repeatedly performing steps S120, S130, and S140, in step S120, a new equivalent instance cannot be found in the connection diagram of the expanded instance relationship (as a preference, step S120 may be recognized multiple times in succession. To the new equivalent example). It is also possible that during the execution of S130, a new instance relationship cannot be found. Preferably, a new instance relationship may not be found as a specified condition during execution of S130.
  • the method, device and device for merging different instances of the same entity of the present invention combine the equivalent instances of the plurality of instances by means of a connection diagram.
  • the instance relationship existing in the connection graph is utilized, and the instance relationship is expanded based on the merged connection graph, and then the equivalent instance existing in the connection graph is further discovered based on the extended instance relationship, and so on. Iteratively performs the above steps of merging and augmenting, so that the connection graph can be parallelized and propagated, and the scheme based on the present invention can more fully dig out the equivalent instances.
  • the method according to the invention can also be implemented as a computer program comprising computer program code instructions for performing the various steps defined above in the above method of the invention.
  • the method according to the invention may also be embodied as a computer program product comprising a computer readable medium on which is stored a computer for performing the above-described functions defined in the above method of the invention program.
  • the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
  • each block of the flowchart or block diagram can represent a module, a program segment, or a portion of code that includes one or more of the Executable instructions.
  • the functions noted in the blocks may also occur in a different order than the ones in the drawings. For example, two consecutive blocks may be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented by a dedicated hardware-based system that performs the specified functions or operations. Now, it can be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种对描述同一实体的不同实例进行合并的方法、装置及设备。所述方法包括:获取包含多个实例的连接图(S110),其中,连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;基于实例关系,识别出连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新连接图(S120);在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接实例对所对应的节点的连线(S130);迭代执行基于实例关系更新连接图的步骤和在更新后的连接图中增添连线的步骤,直到满足指定条件(S140)。能够较为充分地识别出描述同一实体的不同实例。

Description

对描述同一实体的不同实例进行合并的方法、装置及设备 技术领域
本发明涉及计算机技术领域,更具体地,涉及一种对描述同一实体的不同实例进行合并的方法、装置及设备。
背景技术
知识图谱旨在描述真实世界中存在的各种实体或概念。知识图谱中的每个实体或概念用一个全局唯一确定的ID来标识,称为它们的标识符(identifier)。每个属性-值对(attribute-value pair,又称AVP)用来刻画实体的内在特性,而关系(relation)用来连接两个实体,刻画它们之间的关联。
在知识图谱的构建过程中,需要用到不同来源的数据来构建图谱中的实体及关系,例如,为了使得构建的知识图谱可以更加全面,可以用来自百度百科、维基百科、搜狗百科等多种百科类站点来源的数据来构建知识图谱中的实体及关系。而实体在不同来源数据中往往会存在差异化、表述不同的实例。直接使用未融合的实例数据将给知识图谱带来冗余和错误信息,因此对描述相同实体的不同实例进行融合是知识图谱构建中一个重要的任务和步骤。
目前常见的融合方法主要是通过计算不同实例间的属性相似度,将属性相似度超过阈值的实例对进行融合。这种融合方法虽然在一定程度上也能识别出描述同一实体的不同实例,但是由于这种融合方法仅以属性相似度作为融合实例的标准,使得其对融合过程中所使用的属性模糊匹配规则必须尽可能设置完善,才能有效识别同一实体的不同实例以进行融合,但这在实际应用中是很难实现的,因此很容易将表述同一实体的实例对识别为不同的实例,对知识图谱的构建带来冗余的数据。
由此,需要一种可以较为充分地识别出描述同一实体的不同实例的方案。
发明内容
本发明主要解决的技术问题是提供一种对描述同一实体的不同实例进行合并的方法、装置及设备,其能够较为充分地识别出描述同一实体的实例对。
根据本发明的一个方面,提供了一种计算设备,包括:存储器,用于存储包含多个实例的连接图,其中,连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;处理器,与存储器相连接,处理器能够从存储器获取连接图,该处理器配置为:基于实例关系,识别出连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新连接图;在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接实例对所对应的节点的连线;迭代执行基于实例关系更新连接图的步骤和在更新后的连接图中增添连线的操作,直到满足指定条件。
由此,本发明的设备采用连接图的方式对多个待判定实例中的等价实例进行合并。而在合并的过程中又利用了实例关系,并基于合并后的连接图,扩充实例关系,然后迭代执行上述合并、扩充的步骤,使得可以较为充分地挖掘出连接图中存在的等价实例。
根据本发明的另一个方面,提供了一种对描述同一实体的不同实例进行合并的装置,包括:获取模块,用于获取包含多个实例的连接图,其中,连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;合并模块,用于基于实例关系,识别出连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新连接图;扩充模块,用于在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接实例对所对应的节点的连线;迭代模块,用于使得合并模块和扩充模块迭代执行更新连接图的操作和增添连线的操作,直到满足指定条件。
可选地,上述装置中述及的扩充模块可以包括:关联度计算模块,用于对于更新后的连接图中的任一节点,计算该节点所对应的实例和与该节 点通过N个节点进行连接的节点所对应的实例之间的关联度,其中N大于等于1;第一识别模块,用于将关联度达到预定关联度阈值的两个节点所对应的实例对识别为存在实例关系的实例对,并增添连接这两个节点之间的连线。由于等价实例的合并,合并后的连接图中的实例关系也会发生一定的变化。此时,可以通过计算节点间的关联度,来发现存在实例关系的实例对。
可选地,上述装置中述及的指定条件可以设定为,扩充模块在更新后的连接图中识别出的未发现的存在实例关系的实例对的数目为零。
可选地,上述装置中述及的合并模块可以包括:分组模块,用于对多个实例进行分组;相似度计算模块,用于针对每个分组,基于实例关系计算组内任意两个实例之间的相似度;第二识别模块,用于将相似度达到预定相似度阈值的实例对识别为描述同一实体的实例对。
可选地,对于来自不同来源的两个实例,相似度计算模块可以根据以下公式计算这两个实例之间的相似度Sim:
Sim=Jacij/Uniq
Figure PCTCN2017072995-appb-000001
Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
其中,Ci为与实例i具有实例关系的实例集合,Cj为与实例j具有实例关系的实例集合,Jacij为实例i、j之间的实例关系相似度,Uniq为实例的唯一性度量,CntsourceA,i为实例i在来源A中的同名实例的个数、CntsourceB,j为实例j在来源B中的同名实例的个数。
可选地,上述装置中述及的获取模块还可以包括:属性相似度计算模块可以计算连接图中任意两个节点所对应的实例之间的属性相似度;和第二合并模块,可以将属性相似度超过预定属性相似度阈值的两个实例所对应的节点合并为一个节点。
根据本发明的另一个方面,提供了一种对描述同一实体的不同实例进行合并的方法,该方法包括:获取包含多个实例的连接图,其中,连接图 中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;基于实例关系,识别出连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新连接图;在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接实例对所对应的节点的连线;迭代执行基于实例关系更新连接图的步骤和在更新后的连接图中增添连线的步骤,直到满足指定条件。
可选地,上述方法中述及的在更新后的连接图中识别出未发现的存在实例关系的实例对的步骤可以包括:对于更新后的连接图中的任一节点,计算该节点所对应的实例和与该节点通过N个节点进行连接的节点所对应的实例之间的关联度,其中N大于等于1;将关联度达到预定关联度阈值的两个节点所对应的实例对识别为存在实例关系的实例对,增添连接这两个节点之间的连线。
可选地,上述方法中述及的指定条件可以设定为,在更新后的连接图中识别出的未发现的存在实例关系的实例对的数目为零。
可选地,上述方法中述及的基于实例关系,识别出连接图中描述同一实体的不同实例的步骤可以包括:对多个实例进行分组;针对每个分组,基于实例关系计算组内任意两个实例之间的相似度;将相似度达到预定相似度阈值的实例对识别为描述同一实体的实例对。
可选地,对于来自不同来源的两个实例,可以根据以下公式计算这两个实例之间的相似度Sim:
Sim=Jacij/Uniq
Figure PCTCN2017072995-appb-000002
Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
其中,Ci为与实例i具有实例关系的实例集合,Cj为与实例j具有实例关系的实例集合,Jacij为实例i、j之间的实例关系相似度,Uniq为实例的唯一性度量,CntsourceA,i为实例i在来源A中的同名实例的个数、 CntsourceB,j为实例j在来源B中的同名实例的个数。
可选地,上述方法中述及的获取包含多个实例的连接图的步骤还可以包括:计算连接图中任意两个节点所对应的实例之间的属性相似度;和将属性相似度超过预定属性相似度阈值的两个实例所对应的节点合并为一个节点。
本发明的对描述同一实体的不同实例进行合并的方法、装置及设备采用连接图的方式对多个实例中的等价实例进行合并,其中,在合并的过程中利用了连接图中存在的实例关系,并基于合并后的连接图扩充实例关系,然后再基于扩充的实例关系进一步发现连接图中存在的等价实例,以此类推,迭代执行上述合并、扩充的步骤,使得基于本发明的方案可以较为充分地识别出描述同一实体的实例对。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例,并且连同其说明一起用于解释本发明的原理。
图1示出了本发明述及的连接图的示意图。
图2示出了根据本发明一实施例的计算设备的结构示意图。
图3示出了根据本发明一实施例的对描述同一实体的不同实例进行合并的装置的功能模块示意图。
图4示出了根据本发明另一实施例的对描述同一实体的不同实例进行合并的装置的功能模块示意图。
图5示出了根据本发明一实施例的对描述同一实体的不同实例进行合并的方法的示意性流程图。
图6示出了图5中的步骤S110可以包括的子步骤的示意性流程图。
图7示出了图5中的步骤S120可以包括的子步骤的示意性流程图。
图8示出了图5中的步骤S130可以包括的子步骤的示意性流程图。
具体实施方式
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
在介绍本发明前,首先对本发明涉及的几个概念做以简要说明。
实体:知识图谱中的知识单元,具有唯一确定的ID标识。
实例:在构建知识图谱中的实体的过程中用到的各种来源的数据。
实例关系:实例间存在的关系,对于不同的数据来源,这里的关系可以是属性关系、引用关系、链接关系等多种关系。
同名实例:名称相同,但描述的实体(内容)不同的实例。
等价实例:描述同一实体(内容)的实例。
举例来说,百度百科中的各种词条就是实例。百度百科中的词条“李宁”是一个多义词,有指代著名体操运动员的李宁,也有指代魔术师的李宁。这里,指代魔术师的李宁和指代体操运动员的李宁就是一个同名实例。在指代著名体操运动员的李宁的词条下,还存在着“奥运冠军”、“金牌”等词条,这里,我们就可以认为“李宁”和“奥运冠军”、“金牌”存在实例关系。而百度百科中指代体操运动员的“李宁”和搜狗百科中的“体操王子”就属于等价实例。
本发明主要提出了一种在众多实例中识别等价实例的方案。该方案主要基于连接图的方式识别出等价实例,并不断对连接图进行更新,以识别出更多的等价实例。
具体地说,可以首先构建包含多个实例的连接图,如图1所示,连接图中的节点表示实例,节点间的连线表示实例关系。对于连接图中的多个实例,可以根据连接图中存在的实例关系,识别出存在的等价实例,对识别出的等价实例所对应的节点进行合并。其中,在根据实例关系对连接图中的节点进行合并后,可以基于一定的识别规则,找出连接图中未发现的实例关系,根据所找到的实例关系更新连接图。然后重复执行上述基于实例关系找到等价实例的步骤和基于合并后的连接图,寻找未发现的实例关系的步骤,直到满足指定条件。
这里的指定条件可以是找不到新的实例关系或找不到新的等价实例或重复步骤达到一定次数,当然还可以是其它指定条件。另外,对本方案中的对等价实例进行合并以及更新连接图的步骤来说,可以是在对连接图中的所发现的等价实例全部合并后再更新连接图。
本发明的方案可以实现为一种如图2所示的计算设备。该计算设备可以配置为包括存储器1和处理器2。存储器1可以存储包含多个实例的连接图。处理器2与存储器1连接,可以从存储器1获取连接图,并可以执行实现上述方案中的相关步骤的操作。
具体地,处理器2例如可以是中央处理器CPU、微处理器MCU等,存储器1例如包括ROM(只读存储器)、RAM(随机存取存储器)、诸如硬盘的非易失性存储器等。并且,应当理解的是,尽管在图2中仅示出了计算设备包括存储器1和处理器2,但是计算设备还可以包括接口装置、通信装置、显示装置、输入装置、扬声器、麦克风等,但是,这些部件与本发明无关,故在此省略。在本发明中,并不限制计算设备的实体实施形式,计算设备可以是服务器,例如刀片服务器,也可以是计算机或者类似计算机的电子设备,例如笔记本电脑、平板电脑等。
本发明的方案还可以实现为一种包含多个功能模块的装置。其中,图2示出的处理器2的功能可以由该装置中相应的功能模块实现。
参见图3,本发明的对描述同一实体的不同实例进行合并的装置可以包括获取模块21、合并模块22、扩充模块23以及迭代模块24。其中,获取模块21、合并模块22、扩充模块23以及迭代模块24可以执行实现上述 方案中的相应步骤的操作。简单地说,获取模块21可以获取连接图。合并模块22可以基于连接图中存在的实例关系识别出连接图中存在的等价实例,并对识别到的等价实例所对应的节点进行合并。扩充模块23可以识别出连接图中未发现的实例关系。迭代模块24可以使得合并模块22和扩充模块23迭代执行相应的操作,直到满足指定条件。
参见图4,获取模块21可以包括属性相似度计算模块211和第二合并模块212。合并模块22可以包括分组模块221、相似度计算模块222以及第二识别模块223。扩充模块23可以包括关联度计算模块231和第一识别模块232。对于图4所示的结构来说,获取模块21、合并模块22及扩充模块23的功能可以由其包括的相应子模块实现,此处暂不做具体描述。
图5至图8详细示出了执行本发明的方案的流程图。其中,图5至图8所示的各个步骤都可以由上文提及的处理器或装置中的相应的功能模块实现,下面结合图5至图8对本发明的方案的工作流程进行详细说明。
参见图5,在步骤S110,由处理器2或者获取模块21,获取包含多个实例的连接图。
这里述及的获取连接图的步骤可以是获取事先构建好的连接图。例如,可以事先根据多个实例构建连接图,然后存储在存储器中,需要处理时,由处理器2或获取模块21从存储器获取。
也可以是根据需要判定的实例数据构建连接图。例如,可以根据待判定的多个实例数据及实例数据中存在的实例关系,构建连接图。对于构建好的连接图,可以将其存储在存储器中,需要处理时,再由处理器2或获取模块21从存储器获取连接图,当然也可以将构建好的连接图直接发送给处理器2或获取模块21。
在执行步骤S110的过程中,还可以基于一定的识别规则识别出连接图中存在的等价实例,并合并等价实例所对应的节点。这里述及的识别规则可以是如图6所示的属性相似度的识别方式。
如图6所示,在步骤S1110,由处理器2或者由获取模块21中的属性相似度计算模块211计算连接图中任意两个节点所对应的实例间的属性相似度。
在步骤S1120,由处理器2或者由获取模块21中的第二合并模块212将属性相似度超过预定属性相似度阈值的实例所对应的节点合并为一个节点。
应该知道,在步骤S110中对连接图中的节点进行合并的步骤(步骤S110、步骤S1120)是本发明的一个可选方案,这样使得可以基于现有的计算方式初步发现连接图中存在的等价实例,并对其进行合并,以降低后续步骤的复杂度。
返回步骤S110,在执行完步骤S110后,就可以执行步骤S120,由处理器2或者由合并模块21,基于实例关系,识别出连接图中描述同一实体的不同实例(即等价实例),对识别出的等价实例所对应的节点进行合并,并更新连接图。
其中,可以有多种基于实例关系识别连接图中的等价实例的方式。例如,可以在计算实例间的相似度的过程中,将与当前实例存在实例关系的实例参与到相似度的计算的过程中,然后将相似度超过阈值的实例对识别为等价实例。
图7示出了一种基于实例关系识别出等价实例的具体实施方式。
如图7所示,在步骤S1210,由处理器2或者由合并模块22中的分组模块221,对连接图中的多个实例进行分组。
其中,可以有多种分组方式,如可以根据名称进行分组,还可以根据属性值进行。当然根据具体情况,还有其它分组方式,此处不再赘述。
在步骤S1220,针对每个分组,可以由处理器2或者由分组模块22中的相似度计算模块222,基于实例关系计算组内任意两个实例之间的相似度。
其中,对于来自不同数据来源的两个实例来说,可以根据下述公式计算这两个实例间的相似度Sim:
Sim=Jacij/Uniq
Figure PCTCN2017072995-appb-000003
Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
其中,Ci为与实例i具有实例关系的实例集合,Cj为与实例j具有实例关系的实例集合,Jacij为实例i、j之间的实例关系相似度,Uniq为实例的唯一性度量,CntsourceA,i为实例i在来源A中的同名实例的个数、CntsourceB,j为实例j在来源B中的同名实体的个数。
其中,对于不同来源的实例数据来说,上述公式可以有不同形式的变形。以实例数据来源为百科词条来说,可以基于下列公式计算来自不同百科的两个实例间的相似度Sim:
Sim=(α×Jacout+(1-α)×Jacin)/Uniq
Figure PCTCN2017072995-appb-000004
Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
其中,α为权重系数,Ciout为待判定实例i链出的实例的个数,Cjout为待判定实例j链出的实例的个数,Ciin为待判定实例i被链入的实例的个数,Cjin为待判定实例j被链入的实例的个数,Jacout为待判定实例i、j链出的实例的相似度,Jacin为待判定实例i、j被链入的实体的相似度,Uniq为实例的唯一性度量,CntsourceA,i为待判定实例i在来源A中的同名实例的个数、CntsourceB,j为待判定实例j在来源B中的同名实体的个数。
以百度百科和搜狗百科为例对上述变形公式加以说明。以百度百科中的词条“李宁”和搜狗百科中的词条“李宁”来说。在百度百科中,词条“李宁”具有60个同名实例,在搜狗百科中,词条“李宁”具有52个同名实例。而对于表示体操运动员的“李宁”,该词条在百度百科中存在着“奥运冠军”、“金牌”、“自由体操”等内链词条,这些词条与“李宁”就存在实例关系,这些词条就可以看成是词条“李宁”的链出的词条(实例)。而词条“体操王子”下存在词条“李宁”,此时,“体操王子”就是“李宁”被链入的词条(实例),词条“体操王子”与词条“李宁”也存在实例关系。此时,基于上述变形公式就可以计算出百度百科中的词条“李宁”和搜狗百科中的词条“李宁”之间的相似度。
其中,上述计算公式可以在分布式计算平台如SPARK上并行实现,达到大规模并行化图计算的目的。另外,应该知道,对于其它来源的实例数据来说,还可以有其它基于实例关系计算相似度的方式,此处不再赘述。
在步骤S1230,由处理器2或者由分组模块22中的第二识别模块223将相似度达到预定相似度阈值的实例对识别为等价实例。由此,就可以基于实例关系识别出连接图中存在的等价实例。
下面返回步骤S120,在执行完步骤S120后,就可以执行步骤S130,由处理器2或者由扩充模块23,在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接实例对所对应的节点的连线。
对于执行步骤S120后的连接图,由于等价实例的合并,合并后的连接图中的实例关系也会发生一定的变化。此时,可以使用一定的识别规则识别出连接图中新增的存在实例关系的实例对。
图8示出了一种识别出连接图中未发现的实例关系的具体实施方式。
如图8所示,在步骤S1310,对于更新后的连接图中的任一节点,由处理器2或者由扩充模块23中的关联度计算模块231计算该节点所对应的实例和与该节点通过N个节点进行连接的节点所对应的实例之间的关联度,N大于等于1。
在步骤S1320,由处理器2或者由扩充模块23中的第一识别模块232将关联度达到预定关联度阈值的两个节点所对应的实例对识别为存在实例关系的实例对,增添连接这两个节点之间的连线。
其中,可以有多种计算关联度的方式。例如,对于图1中的节点D和节点L来说,节点D和节点L通过节点A、节点E两个节点进行连接,这样就可以通过分析节点A和节点E之间的相似度的大小,来判断节点D和E之间是否存在关联度。
下面返回步骤S130,对于经过步骤S130扩充后实例关系的连接图,可以执行步骤S140,由处理器2或者可以由迭代模块24判断是否满足指定条件,在不满足指定条件的情况下,返回步骤S120,重复执行S120、S130、S140的步骤。直至满足指定条件,输出合并后的连接图。
其中,步骤S140中的指定条件可以是重复执行步骤S120、S130、S140 的次数达到一定值。也可以是在重复执行步骤S120、S130、S140的过程中,在步骤S120,在扩充后实例关系的连接图中找不到新的等价实例(作为优选,可以是步骤S120连续多次识别不到新的等价实例)。还可以是在S130的执行过程中,找不到新的实例关系。作为优选,可以将在S130的执行过程中,找不到新的实例关系作为指定条件。
至此,参考附图详细描述了根据本发明的对描述同一实体的不同实例进行合并的方法、装置及设备。通过上述描述可知,本发明的对描述同一实体的不同实例进行合并的方法、装置及设备采用连接图的方式对多个实例中的等价实例进行合并。其中,在合并的过程中利用了连接图中存在的实例关系,并基于合并后的连接图扩充实例关系,然后再基于扩充的实例关系进一步发现连接图中存在的等价实例,以此类推,迭代执行上述合并、扩充的步骤,使得连接图可以并行化传播,并使得基于本发明的方案可以更充分地挖掘出等价实例。
此外,根据本发明的方法还可以实现为一种计算机程序,该计算机程序包括用于执行本发明的上述方法中限定的上述各步骤的计算机程序代码指令。或者,根据本发明的方法还可以实现为一种计算机程序产品,该计算机程序产品包括计算机可读介质,在该计算机可读介质上存储有用于执行本发明的上述方法中限定的上述功能的计算机程序。本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。
附图中的流程图和框图显示了根据本发明的多个实施例的系统和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实 现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (13)

  1. 一种计算设备,包括:
    存储器,用于存储包含多个实例的连接图,其中,所述连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;以及
    处理器,与所述存储器相连接,所述处理器能够从所述存储器获取所述连接图,该处理器配置为:
    基于所述实例关系,识别出所述连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新所述连接图;
    在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接所述实例对所对应的节点的连线;
    迭代执行所述基于实例关系更新连接图的步骤和所述在更新后的连接图中增添连线的操作,直到满足指定条件。
  2. 一种对描述同一实体的不同实例进行合并的装置,包括:
    获取模块,用于获取包含多个实例的连接图,其中,所述连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;
    合并模块,用于基于所述实例关系,识别出所述连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新所述连接图;
    扩充模块,用于在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接所述实例对所对应的节点的连线;
    迭代模块,用于使得所述合并模块和所述扩充模块迭代执行更新所述连接图的操作和增添连线的操作,直到满足指定条件。
  3. 根据权利要求2所述的装置,其中,所述扩充模块包括:
    关联度计算模块,用于对于更新后的连接图中的任一节点,计算该节点所对应的实例和与该节点通过N个节点进行连接的节点所对应的实例之间的关联度,其中N大于等于1;
    第一识别模块,用于将所述关联度达到预定关联度阈值的两个节点所对应的实例对识别为存在实例关系的实例对,并增添连接这两个节点之间的连线。
  4. 根据权利要求2或3所述的装置,其中,所述指定条件被设定为,
    所述扩充模块在更新后的连接图中识别出的未发现的存在实例关系的实例对的数目为零。
  5. 根据权利要求2-4中任意一项所述的装置,其中,所述合并模块包括:
    分组模块,用于对所述多个实例进行分组;
    相似度计算模块,用于针对每个分组,基于实例关系计算组内任意两个实例之间的相似度;
    第二识别模块,用于将相似度达到预定相似度阈值的实例对识别为描述同一实体的实例对。
  6. 根据权利要求2-5中任意一项所述的装置,其中,对于来自不同来源的两个实例,所述相似度计算模块根据以下公式计算这两个实例之间的相似度Sim:
    Sim=Jacij/Uniq
    Figure PCTCN2017072995-appb-100001
    Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
    其中,Ci为与实例i具有实例关系的实例集合,Cj为与实例j具有实例关系的实例集合,Jacij为实例i、j之间的实例关系相似度,Uniq为实 例的唯一性度量,CntsourceA,i为实例i在来源A中的同名实例的个数、CntsourceB,j为实例j在来源B中的同名实例的个数。
  7. 根据权利要求2-6中任意一项所述的装置,其中,所述获取模块还包括:
    属性相似度计算模块,用于计算连接图中任意两个节点所对应的实例之间的属性相似度;和
    第二合并模块,用于将所述属性相似度超过预定属性相似度阈值的两个实例所对应的节点合并为一个节点。
  8. 一种对描述同一实体的不同实例进行合并的方法,包括:
    获取包含多个实例的连接图,其中,所述连接图中的不同节点表示不同实例,节点间的连线表示节点所对应的实例之间的实例关系;
    基于所述实例关系,识别出所述连接图中描述同一实体的不同实例,对识别出的实例所对应的节点进行合并,并更新所述连接图;
    在更新后的连接图中识别出未发现的存在实例关系的实例对,并增添用以连接所述实例对所对应的节点的连线;
    迭代执行所述基于所述实例关系更新所述连接图的步骤和所述在更新后的连接图中增添连线的步骤,直到满足指定条件。
  9. 根据权利要求8所述的方法,其中,所述在更新后的连接图中识别出未发现的存在实例关系的实例对的步骤包括:
    对于更新后的连接图中的任一节点,计算该节点所对应的实例和与该节点通过N个节点进行连接的节点所对应的实例之间的关联度,其中N大于等于1;
    将所述关联度达到预定关联度阈值的两个节点所对应的实例对识别为存在实例关系的实例对,增添连接这两个节点之间的连线。
  10. 根据权利要求8或9所述的方法,其中,所述指定条件被设定为,
    在更新后的连接图中识别出的未发现的存在实例关系的实例对的数目为零。
  11. 根据权利要求8-10中任意一项所述的方法,其中,所述基于实例关系,识别出连接图中描述同一实体的不同实例的步骤包括:
    对所述多个实例进行分组;
    针对每个分组,基于所述实例关系计算组内任意两个实例之间的相似度;
    将相似度达到预定相似度阈值的实例对识别为描述同一实体的实例对。
  12. 根据权利要求8-11中任意一项所述的方法,其中,对于来自不同来源的两个实例,根据以下公式计算这两个实例之间的相似度Sim:
    Sim=Jacij/Uniq
    Figure PCTCN2017072995-appb-100002
    Uniq=Log(Max(CntsourceA,i,CntsourceB,j)+1)
    其中,Ci为与实例i具有实例关系的实例集合,Cj为与实例j具有实例关系的实例集合,Jacij为实例i、j之间的实例关系相似度,Uniq为实例的唯一性度量,CntsourceA,i为实例i在来源A中的同名实例的个数、CntsourceB,j为实例j在来源B中的同名实例的个数。
  13. 根据权利要求8-12中任意一项所述的方法,其中,所述获取包含多个实例的连接图的步骤还包括:
    计算连接图中任意两个节点所对应的实例之间的属性相似度;和
    将所述属性相似度超过预定属性相似度阈值的两个实例所对应的节点合并为一个节点。
PCT/CN2017/072995 2016-02-14 2017-02-06 对描述同一实体的不同实例进行合并的方法、装置及设备 WO2017137000A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/046,166 US11544578B2 (en) 2016-02-14 2017-02-06 Method, device and equipment for fusing different instances describing same entity

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610084741.1 2016-02-14
CN201610084741.1A CN105786980B (zh) 2016-02-14 2016-02-14 对描述同一实体的不同实例进行合并的方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2017137000A1 true WO2017137000A1 (zh) 2017-08-17

Family

ID=56402221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/072995 WO2017137000A1 (zh) 2016-02-14 2017-02-06 对描述同一实体的不同实例进行合并的方法、装置及设备

Country Status (3)

Country Link
US (1) US11544578B2 (zh)
CN (1) CN105786980B (zh)
WO (1) WO2017137000A1 (zh)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786980B (zh) * 2016-02-14 2019-12-20 广州神马移动信息科技有限公司 对描述同一实体的不同实例进行合并的方法、装置及设备
US10614162B2 (en) * 2016-05-27 2020-04-07 Ricoh Company, Ltd. Apparatus, system, and method of assisting information sharing, and recording medium
CN108205564B (zh) * 2016-12-19 2021-04-09 北大方正集团有限公司 知识体系构建方法及系统
CN109145178A (zh) * 2017-06-16 2019-01-04 阿里巴巴集团控股有限公司 一种关系图处理方法及装置
CN108038183B (zh) * 2017-12-08 2020-11-24 北京百度网讯科技有限公司 结构化实体收录方法、装置、服务器和存储介质
CN109492027B (zh) * 2018-11-05 2022-02-08 南京邮电大学 一种基于弱可信数据的跨社群潜在人物关系分析方法
CN109597856B (zh) * 2018-12-05 2020-12-25 北京知道创宇信息技术股份有限公司 一种数据处理方法、装置、电子设备及存储介质
CN109558468B (zh) * 2018-12-13 2022-04-01 北京百度网讯科技有限公司 资源的处理方法、装置、设备和存储介质
CN110298445A (zh) * 2019-05-30 2019-10-01 合肥阿拉丁智能科技有限公司 深度学习自主运行方法
CN110427436B (zh) * 2019-07-31 2022-03-22 北京百度网讯科技有限公司 实体相似度计算的方法及装置
US11275777B2 (en) 2019-08-22 2022-03-15 International Business Machines Corporation Methods and systems for generating timelines for entities
CN110674360B (zh) * 2019-09-27 2023-03-31 厦门美亚亿安信息科技有限公司 一种用于数据的溯源方法和系统
CN112579770A (zh) * 2019-09-30 2021-03-30 北京国双科技有限公司 知识图谱的生成方法,装置,存储介质及设备
CN111191715A (zh) * 2019-12-27 2020-05-22 深圳市商汤科技有限公司 图像处理方法及装置、电子设备和存储介质
CN111597351A (zh) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 可视化文档图谱构建方法
CN113705236B (zh) * 2021-04-02 2024-06-11 腾讯科技(深圳)有限公司 实体比较方法、装置、设备及计算机可读存储介质
CN113160956A (zh) * 2021-04-21 2021-07-23 复旦大学附属中山医院 一种基于多身份数据融合的患者管理方法和系统
CN115203436B (zh) * 2022-07-15 2023-12-15 国网江苏省电力有限公司信息通信分公司 一种基于有向图数据融合的电力知识图谱构建方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192926A1 (en) * 2004-02-18 2005-09-01 International Business Machines Corporation Hierarchical visualization of a semantic network
US20090307213A1 (en) * 2008-05-07 2009-12-10 Xiaotie Deng Suffix Tree Similarity Measure for Document Clustering
CN101667201A (zh) * 2009-09-18 2010-03-10 浙江大学 基于树合并的Deep Web查询接口集成方法
CN101714142A (zh) * 2008-10-06 2010-05-26 易搜比控股公司 文件群集的合并方法
US20150081656A1 (en) * 2013-09-13 2015-03-19 Sap Ag Provision of search refinement suggestions based on multiple queries
CN105045863A (zh) * 2015-07-13 2015-11-11 苏州大学张家港工业技术研究院 一种用于实体匹配的方法及系统
CN105786980A (zh) * 2016-02-14 2016-07-20 广州神马移动信息科技有限公司 对描述同一实体的不同实例进行合并的方法、装置及设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689526B2 (en) * 2007-01-25 2010-03-30 Fair Isaac Corporation Knowledge base with clustered data objects
TW201013426A (en) * 2008-09-19 2010-04-01 Esobi Inc Combination method for document clusters
US8335754B2 (en) * 2009-03-06 2012-12-18 Tagged, Inc. Representing a document using a semantic structure
US9461876B2 (en) * 2012-08-29 2016-10-04 Loci System and method for fuzzy concept mapping, voting ontology crowd sourcing, and technology prediction
US20150169758A1 (en) * 2013-12-17 2015-06-18 Luigi ASSOM Multi-partite graph database
CN103955531B (zh) * 2014-05-12 2017-06-30 南京提坦信息科技有限公司 基于命名实体库的在线知识地图
CN104866625B (zh) 2015-06-15 2018-08-17 苏州大学张家港工业技术研究院 一种用于实体匹配的方法及系统
CN105069039B (zh) * 2015-07-22 2018-05-18 山东大学 一种基于spark平台的内存迭代的重叠社区并行发现方法
US10671668B2 (en) * 2016-07-11 2020-06-02 Hewlett Packard Enterprise Development Lp Inferring graph topologies

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050192926A1 (en) * 2004-02-18 2005-09-01 International Business Machines Corporation Hierarchical visualization of a semantic network
US20090307213A1 (en) * 2008-05-07 2009-12-10 Xiaotie Deng Suffix Tree Similarity Measure for Document Clustering
CN101714142A (zh) * 2008-10-06 2010-05-26 易搜比控股公司 文件群集的合并方法
CN101667201A (zh) * 2009-09-18 2010-03-10 浙江大学 基于树合并的Deep Web查询接口集成方法
US20150081656A1 (en) * 2013-09-13 2015-03-19 Sap Ag Provision of search refinement suggestions based on multiple queries
CN105045863A (zh) * 2015-07-13 2015-11-11 苏州大学张家港工业技术研究院 一种用于实体匹配的方法及系统
CN105786980A (zh) * 2016-02-14 2016-07-20 广州神马移动信息科技有限公司 对描述同一实体的不同实例进行合并的方法、装置及设备

Also Published As

Publication number Publication date
CN105786980A (zh) 2016-07-20
US20190005392A1 (en) 2019-01-03
CN105786980B (zh) 2019-12-20
US11544578B2 (en) 2023-01-03

Similar Documents

Publication Publication Date Title
WO2017137000A1 (zh) 对描述同一实体的不同实例进行合并的方法、装置及设备
US8694979B2 (en) Efficient egonet computation in a weighted directed graph
US10430255B2 (en) Application program interface mashup generation
Zhang et al. Attributed network alignment: Problem definitions and fast solutions
Mao et al. On mixed memberships and symmetric nonnegative matrix factorizations
JP6608972B2 (ja) ソーシャルネットワークに基づいてグループを探索する方法、デバイス、サーバ及び記憶媒体
CN109697641A (zh) 计算商品相似度的方法和装置
CN110162637B (zh) 信息图谱构建方法、装置及设备
WO2019019385A1 (zh) 跨平台数据匹配方法、装置、计算机设备和存储介质
Xiao et al. Towards confidence interval estimation in truth discovery
US10191786B2 (en) Application program interface mashup generation
Park et al. On the power of gradual network alignment using dual-perception similarities
Zhang et al. Characterization and edge sign prediction in signed networks
US8965910B2 (en) Apparatus and method of searching for instance path based on ontology schema
WO2019127300A1 (zh) 数据存储方法、电子设备及存储介质
Kim Friend recommendation with a target user in social networking services
CN106844338B (zh) 基于属性间依赖关系的网络表格的实体列的检测方法
JPWO2018135515A1 (ja) 情報処理装置、ニューラルネットワークの設計方法及びプログラム
CN114912458A (zh) 一种情感分析方法、装置和计算机可读介质
US11409773B2 (en) Selection device, selection method, and non-transitory computer readable storage medium
JP2014228975A (ja) 検索装置、検索方法および検索プログラム
CN110968668B (zh) 一种基于超网络的网络舆情主题相似度计算方法及装置
Patra et al. Motif discovery in biological network using expansion tree
JP6005583B2 (ja) 検索装置、検索方法および検索プログラム
CN114238576A (zh) 数据匹配方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17749891

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.05.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17749891

Country of ref document: EP

Kind code of ref document: A1