CN111931485B

CN111931485B - A Multimodal Heterogeneous Associated Entity Recognition Method Based on Cross-Network Representation Learning

Info

Publication number: CN111931485B
Application number: CN202010806775.3A
Authority: CN
Inventors: 周小平
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2021-03-23
Anticipated expiration: 2040-08-12
Also published as: CN111931485A

Abstract

The invention provides a multimodal heterogeneous associated entity recognition method based on cross-network representation learning. The method includes: Given two multimodal heterogeneous information networks:

and

E _A and E _B are entity sets, R _A and R _B are entity relation sets, T _A and T _B are entity type sets, and C _A and C _B are entity relation type sets. Let two entities E _Ai ∈ E _A and E _Bj ∈E _B , based on the set of random walk paths between E _Ai and E _Bj , the multimodal relationship transition probability M _ij between E _Ai and E _Bj is established by an iterative method, and E is obtained by learning the objective function through M _ij Multimodal heterogeneous feature vectors of _Ai and E _Bj ; when judging that E _Ai and E _Bj have multimodal heterogeneous consistency, attribute consistency and environmental consistency at the same time, then E _Ai and E _Bj are determined as associated entities. The invention fully analyzes the multi-modal heterogeneous characteristics of the multi-modal heterogeneous information network, and forms a set of formal description methods for the multi-modal heterogeneous information network and a multi-modal heterogeneous associated entity recognition model and method based on cross-network representation learning.

Description

Multi-mode heterogeneous associated entity identification method based on cross-network representation learning

Technical Field

The invention relates to the technical field of identification of multimode heterogeneous information network associated entities, in particular to a multimode heterogeneous associated entity identification method based on cross-network representation learning.

Background

The multimode heterogeneous Information network (Building Information Model/modeling) is a digital expression of physical and functional characteristics of Building facilities, aims to provide reliable shared knowledge resources for decision and cooperation of different participants in the whole life cycle of a Building, and becomes important content for modernization of the Building industry, construction of smart cities and the like in China.

The multimodal heterogeneous information network associated entity identifies data entities that are intended to find out in different multimodal heterogeneous information networks that refer to the same object in the real world. The accurate and comprehensive identification of the multimode heterogeneous information network associated entities realizes the organic integration of dispersed and isolated multimode heterogeneous information networks, is the key for realizing the whole-process integrated application of the multimode heterogeneous information networks and the whole life cycle data sharing of engineering construction projects, solves the problems of 'information fault' and 'information isolated island' in the digitization of the current construction projects, and provides reliable and complete infrastructure big data support for the engineering construction projects and the whole-cycle management of smart cities. At the present stage, most multimode heterogeneous information network associated entity identification methods are based on manual labeling, geometric attribute matching or text attribute modeling; few studies consider the entity relationships of the multi-mode heterogeneous information networks, but ignore the multi-mode characteristics of the entity relationships of the multi-mode heterogeneous information networks.

The identification of the multimode heterogeneous information network associated entity is a cross-domain and cross-discipline task, is a key of the whole-process integrated application and the whole-life-cycle data fusion and sharing of the multimode heterogeneous information network, and is an important component of a domain-oriented building big data value and knowledge discovery theory and method. The implementation of the method enriches and perfects theories and methods such as associated entity recognition and network representation learning in the field of data mining, promotes the application and innovation of leading-edge theories and methods of computer science in the building science, develops a new thought of multimode heterogeneous information network basic research, develops a new research direction for the cross fields of computers, building and civil engineering and the like, and has important theoretical value. The research result of the method can promote the national important requirement of modernization transformation and upgrading of the building industry, serve the construction and the 'full-period management' of smart cities, smart infrastructures, smart people and the like, and have great economic and social benefits.

Currently, the importance of identification of multi-mode heterogeneous information network associated entities has attracted extensive attention of scholars at home and abroad. The cross-network representation learning of continuous low-dimensional vectors embedding different networks into the same space is a research hotspot in the field of machine learning in recent years. Many colleges and universities and scientific research institutions at home and abroad develop researches such as identification of multimode heterogeneous information network associated entities and cross-network representation learning, and achievements can be found in top-level periodicals and conferences of computers and cross-subject applications thereof. Thus, multi-modal heterogeneous information network association entity identification is the leading edge of current computer and building and civil interdisciplinary research.

The multimode heterogeneous information network associated entity refers to a multimode heterogeneous information network entity which refers to the same real-world object in different multimode heterogeneous information networks. In general, a multi-mode heterogeneous information network

Can be expressed as

Wherein, E and R are respectively a heterogeneous entity set and an inter-entity multi-mode relationship set, and T and C are respectively a type set of E and R. Given two multimode heterogeneous information networks

And

if E_Ai∈E_AAnd E_Bj∈E_BRefer to the same object in the real world, then called E_AiAnd E_BjFor the associated entity, note E_Ai＝E_Bj(ii) a Otherwise E_Ai≠E_Bj. FIG. 1 is a schematic diagram of identification of a multi-mode heterogeneous information network associated entity through which identification is performed

And

data feature determination in (E)_Ai∈E_AAnd E_Bj∈E_BWhether it is an associated entity, i.e.:

IFCs (Industry Foundation Classes) are currently recognized international standards for multimode heterogeneous information networks and are widely used in various enterprises in the construction Industry. At present, almost all multi-mode heterogeneous information network software supports the IFC format, and most multi-mode heterogeneous information network researches are based on the IFC standard, such as building construction and the like. Based on the IFC standard, the multi-mode heterogeneous information network shows multi-mode heterogeneous characteristics and massive entity characteristics.

Multi-mode heterogeneous characteristics

The heterogeneous characteristics mean that the types of the multimode heterogeneous information network entities are various, and the attributes of different types of entities are different. Currently, IFCs have defined 653 different entities, and the number of entities continues to expand with the actual demand and iteration of the IFC version. The attributes of the multi-mode heterogeneous information network entity can be divided into semi-structured text attributes for describing basic information of the entity and unstructured geometric attributes for describing a three-dimensional shape of the entity. In the IFC standard, only entities that inherit the IFCProduct class are likely to have geometric properties. The roof objects in FIG. 1 all inherit to an IFCProduct class, which contains both geometric and textual properties. The problems of non-uniform fields, missing values, redundancy, inaccuracy, inconsistency and the like exist in entity text attributes of different multimode heterogeneous information networks, so that the identification quality of the multimode heterogeneous information network associated entity identification method based on the text attributes is poor (the recall rate and the accuracy rate are low), and the requirement of the multimode heterogeneous information network on the whole-process integrated application cannot be met.

The multimode characteristic means that a plurality of relationships of potentially different modes exist between any two multimode heterogeneous information network entities. Currently, IFC has defined 5 major classes of 19 different types of relationships, including: reference, containment, decomposition, connection, inheritance, and the like. The multimode heterogeneous information network has different multimode relation description forms, and challenges are brought to the formal description and mathematical expression of the multimode heterogeneous information network. The multimode characteristic also means that multimode heterogeneous information network entities are interdependent in different forms, showing strong dependence. The introduction of the entity relationship is an effective way for solving the problem of poor identification quality of the identification method of the multimode heterogeneous information network associated entity based on the text attribute, however, the existing method ignores the multimode characteristic of the multimode heterogeneous information network relationship.

The multimode heterogeneous characteristics of the multimode heterogeneous information network are important manifestations of the complexity of the multimode heterogeneous information network. At present, the research is started from the attributes of multimode heterogeneous information network entities, and the multimode characteristics of the multimode heterogeneous information network are researched and explored less. If the multimode heterogeneous characteristics of the multimode heterogeneous information network can be deeply explored, a formal description method of the multimode heterogeneous information network is established from the perspective of a complex network, application innovation of theories and methods such as graph theory, network science, graph learning and big data in the multimode heterogeneous information network is promoted, a new idea of fundamental application research of the multimode heterogeneous information network is developed, and a model basis is established for identification, parallel computing and the like of multimode heterogeneous information network associated entities.

② mass entity characteristics

The IFC is a multi-mode heterogeneous information network description file with highly compressed information, and a million IFC file contains millions or even tens of millions of multi-mode heterogeneous information network entities. Generally, a multi-mode heterogeneous information network of an actual engineering project is composed of a plurality of IFC files of different specialties. According to statistics, the multimode heterogeneous information network of a three-layer building in the design stage can reach 50G. Thus, the multi-mode heterogeneous information network contains a vast number of multi-mode heterogeneous information network entities.

In the prior art, most of the research methods of the multimode heterogeneous information network only aim at the multimode heterogeneous information network with smaller volume. Some students pay attention to massive entities and big data characteristics thereof in the multimode heterogeneous information network, and develop researches on multimode heterogeneous information network big data distributed storage and management frameworks and the like for lightweight visualization of the multimode heterogeneous information network and field-oriented application. The parallel computing distributes computing tasks to a plurality of processing units for computing, and is an effective way for improving the processing capacity and efficiency of the multimode heterogeneous information network. A few researches initially explore a multimode heterogeneous information network parallel computing method, however, the method ignores the imbalance of the multimode heterogeneous information network entity attributes, is difficult to be applied to any multimode heterogeneous information network, and cannot meet the requirement of full-life-cycle multimode heterogeneous information network parallel processing. The strong dependence of the multi-mode heterogeneous information network makes it difficult for the existing parallel computing framework to be directly applied to the multi-mode heterogeneous information network. Due to disciplinary intersection, the research of the current multimode heterogeneous information network parallel computing method is less, and the method for identifying the associated entity is limited to rapidly process the large-volume multimode heterogeneous information network.

Identification research status of multi-mode heterogeneous information network associated entity

The identification of the multimode heterogeneous information network associated entity based on UUID (Universal Unique Identifier) is the simplest and most accurate method; however, different multimode heterogeneous information network tools maintain different UUIDs, and even UUIDs formed by different versions of the same multimode heterogeneous information network tool are different. At present, most of identification methods of the multimode heterogeneous information network associated entities are based on manual labeling, geometric attribute matching or text attribute modeling.

The identification of the manually marked multimode heterogeneous information network associated entity depends on the quality of the change relation model and the accuracy of the manual change marking, and the manual workload is heavy and is easy to make mistakes. Although the associated entity identification method based on geometric attribute matching can detect three-dimensional similarities and differences between two models; however, the method only identifies the model difference in geometric shape, is difficult to be applied to identification of the multi-mode heterogeneous information network associated entity with complex relationships such as reference and inheritance, and cannot identify the multi-mode heterogeneous information network entity without the geometric shape. In order to solve the problems existing in manual labeling, a part of researches propose an associated entity identification model based on text attributes of multimode heterogeneous information network entities; however, entities of the same type typically have similar text attributes. For example, in fig. 1, the text attributes of a plurality of window entities of the same type are mostly the same or similar. The similarity of the attribute characteristics of the same type of entities of the multimode heterogeneous information network limits the application range of the method. A few studies convert the reference relationship between the entities of the multimode heterogeneous information network into an RDF (Resource Description Framework) graph and a reference hierarchy, so as to improve the quality of identification of the multimode heterogeneous information network associated entities based on text attributes. The method also ignores the complex relation and geometric attribute characteristics of the multimode heterogeneous information network.

The comprehensive utilization of the attributes and the multimode heterogeneous characteristics of the multimode heterogeneous information network entities is an effective way for improving the identification quality of the multimode heterogeneous information network associated entities, however, research on the aspects in the prior art is less. On one hand, the multimode heterogeneous information network field is less provided with multimode heterogeneous information network formalized description methods facing multimode heterogeneous characteristics, so that the identification of the existing multimode heterogeneous information network associated entities is limited to attribute characteristics such as texts; on the other hand, the existing data mining theory and method are difficult to extract the multi-mode heterogeneous characteristics of mass entities of different networks to the same characteristic space.

(2) Cross-network representation learning research status oriented to associated entity identification

Network Representation Learning (Network Representation Learning), also known as Network/Graph Embedding (Network/Graph Embedding), is one of the research hotspots and frontiers of machine Learning in recent years. Given the ability of network representation learning to represent and infer in vector space, more and more scholars extend network representation learning from a single network to multiple networks, exploring cross-network representation learning models and their application in social network associated user identification and knowledge graph alignment, etc. Most social network associated user identification researches establish a homogeneous single mode network by taking users as nodes and user relationships as edges, and then establish a cross-network representation learning model and method by adopting a graph neural network, deep active learning and the like. Some scholars notice the heterogeneous entities in the social network, and establish the heterogeneous network by taking the heterogeneous entities as nodes and the heterogeneous entity relationship as edges. Wang et al extracts user interests according to user contents, establishes a heterogeneous network with the users and the interests as nodes, and then provides a cross-network user feature representation learning model. Zhou et al establishes a heterogeneous network with entities such as users, locations, postings, pictures, and the like in a social network as nodes and relationships between the entities as sides, establishes a cross-network representation learning model by designing a Meta Path (Meta Path), and completes the identification of associated users. Ye et al uses a graph convolutional network to establish a cross-network edge and node feature representation learning model under a priori associated entities.

Extensibility is a marker that represents learning across networks that can handle large amounts of data. The existing cross-network representation learning method which is experimentally verified in a million-level data set and above uses the distributed learning capability of a Word vector (Word2Vec) model for reference. Word vector model-based meta path which is often required to be designed skillfully for heterogeneous network representation learning^[30]While the design of meta-paths relies on domain knowledge and its design complexity increases dramatically with the increase of network entity types and modal relationships. This also makes the learning study less for multi-modal heterogeneous feature oriented distributed representation across networks. If a domain-independent cross-multimode heterogeneous network distributed representation learning model can be designed, the dependence of the existing heterogeneous network distributed representation learning on element path design can be thoroughly solved, and the method can be suitable for single-mode or (and) homogeneous networks and any fields and has universality.

(3) Identification of research status of multi-mode heterogeneous associated entity

Data mining for multi-mode heterogeneous features has become the leading edge of research, however, most research focuses mainly on data mining tasks such as network embedding, personalized recommendation and the like in a single data set. Some studies have preliminarily explored the identification of associated entities to multimode or heterogeneous networks without a priori knowledge. In the field of social networks, Zhang et al propose an unsupervised heterogeneous network associated entity identification method facing two types of heterogeneous entities, namely users and positions. In the traffic field, Nassar et al propose an ISORank-based multimode homogeneous network associated entity identification method. In the field of bioinformatics, Gu et al extend homogeneous network associated entity identification methods to heterogeneous networks using graph staining methods. In the field of electronic commerce, Zhu et al have used Graph Summarization (Graph Summarization) and other methods to identify heterogeneous entities such as manufacturers and commodities. In the knowledge base field, the Shen et al multimode heterogeneous information network is regarded as a field knowledge base, and the problem of entity link of unstructured field texts and the field knowledge base is explored. The multimode heterogeneous associated entities have attracted the attention of many researchers in many fields, however, most of the existing research is still multimode or heterogeneous network design. The identification research of the multimode heterogeneous associated entity is less without prior, and the mass entity characteristics of the multimode heterogeneous network are ignored in many researches.

The multi-modal heterogeneous associated entity recognition is also similar or related to studies of language translation in natural language processing, entity alignment in knowledge base, database record linking, entity matching, named recognition in information retrieval, social network associated user recognition, bipartite graph matching, homogeneous network alignment in biological information, and the like. However, these methods have certain limitations in the identification of the multi-mode heterogeneous information network associated entity, which are specifically expressed as follows:

modeling of multi-modal heterogeneous characteristics is absent. Most of the existing methods are designed with associated entity identification models and methods under single mode or homogeneous scenes oriented to specific fields, multimode or (and) heterogeneous characteristics are not integrated into the existing methods, and the identification quality of the associated entities cannot meet the requirement of multimode heterogeneous information network associated entity identification oriented to whole-process integrated application.

Secondly, the computing power of mass entities is insufficient. Parallel and distributed algorithms in a big data environment are still the public problem of the identification of associated entities in various fields. Many associated entity identification methods cannot process massive data, so that the methods cannot be directly applied to identification of multi-mode heterogeneous information network associated entities with massive entities.

And the dependency of the prior associated entity is strong. Most methods rely on prior associated entities to construct supervised and semi-supervised associated entity recognition models and methods, and the associated entity recognition quality depends on the quality and quantity of the prior associated entities. Moreover, the prior associated entities are difficult to label, and the manual work is heavy. This also limits the applicability of such methods to identification of multimodal heterogeneous information network associated entities.

In summary, the related entity identification research in the prior art mainly focuses on single-mode homogeneous environment, many methods require a priori related entities, and few researches pay attention to multimode characteristics or heterogeneous characteristics in data and develop preliminary exploration. Identification of multimode heterogeneous information network associated entities oriented to multimode heterogeneous characteristics is an important trend of current associated entity identification research; theoretically, the research result can be generalized and applied to the existing single-mode or (and) homogeneous environments and the like, and the method is more universal; in application, the research result can be used for a multi-mode heterogeneous information network, and can also be used for other field data such as a social network, a traffic network, biological information, an electronic commerce system, a knowledge graph and the like.

At present, no multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics exists in the prior art.

Disclosure of Invention

The embodiment of the invention provides a multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics, which aims to overcome the problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

A multi-mode heterogeneous associated entity identification method based on cross-network representation learning comprises the following steps:

two multimode heterogeneous information networks:

and

E_Aand E_BIs a set of entities, R_AAnd R_BBeing a set of entity relationships, T_AAnd T_BAs a set of entity types, C_AAnd C_BFor entity relationship type set, let two entities E_Ai∈E_AAnd E_Bj∈E_B，

Based on entity E_AiAnd E_BjThe random walk path set between the two sets is established by an iterative method_AiAnd E_BjTransition probability M of multi-mode relation between_ijTransition the probability M through the multi-modal relationship_ijLearning by using an objective function to obtain the entity E_AiAnd E_BjThe multi-modal heterogeneous eigenvectors of (a);

according to said entity E_AiAnd E_BjJudging the two entities E by the multi-mode heterogeneous characteristic vector_AiAnd E_BjWhether the multi-mode heterogeneous consistency exists or not, and two entities E are also judged_AiAnd E_BjWhether attribute consistency and environment consistency exist, when the two entities E_AiAnd E_BjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistency_AiAnd E_BjIs an associated entity.

Preferably, said entity-based E_AiAnd E_BjThe entity E is established by an iterative method through a random walk path set_AiAnd E_BjTransition probability M of multi-mode relation between_ijThe method comprises the following steps:

assuming the relation of | C | different modes in the multimode heterogeneous information network, the multimode relation transfer matrix is expressed by | C | × | C | matrix M, wherein M is_ijRepresenting relationship type C in a multi-mode heterogeneous network_iTo C_jThe transition probability of (2);

in a random walk, if the last node E_iBy the relation C_xTransfer to current node E_jIt is transferred to the next node E_kProbability p (E)_k|E_i,E_j,C_xAnd M) is calculated by the following method:

wherein, W_ijAs entity E_iAnd E_jWeight of (C)_ijIs a relation (E)_i,E_j) Type (b) N_iAs entity E_iSet of neighbor nodes of, W_ij＝(N_i∩N_j)/(N_i∪N_j) If d is_ijAs entity E_iAnd E_jThe distance between them is:

acquiring a set of random walk path sets P ═ { P ] by adopting random walks according to formula (2)₁,P₂,P₃… and corresponding multimode transition path T ═ { T ═₁,T₂,T₃… }, wherein

Using a vector e of dimension | P |_iRepresents a relationship type C_iFeatures in a random walk set P, where e_ijIs represented by C_iAt P_iThe number of occurrences in (a);

calculating a relationship type C according to the Pearson correlation coefficient_iAnd C_jOf (2) similarity, i.e.

Updating multimode relation transition probability by adopting Sigmoid function

Initially, M_ijThe matrix is set to be an all 1 matrix or a random matrix according to M_ijAcquiring a random walk path set P by adopting a formula (2), and updating M according to a formula (5)_ijContinuously iterating the above process until M_ijConverging to complete the multi-mode relationship transfer matrix Z_ijAnd (4) constructing.

Preferably, said transition matrix M through said multi-modal relationship_ijLearning by using an objective function to obtain the entity E_AiAnd E_BjThe multi-modal heterogeneous feature vector of (1), comprising:

the entity E_AiAnd E_BjRespectively serving as a node, establishing a cross-network distributed representation learning model and an algorithm by using a Skip-Gram model in Word2Vec, and setting a target optimization function of the Skip-Gram model in the cross-network distributed representation learning facing the multi-mode heterogeneous characteristics as follows:

where θ is the band solution parameter, N_t(v) A context node of type t in a neighboring node being node V, if V_tFor a set of nodes of type t in two networks, then:

wherein, X_vA multi-mode heterogeneous feature vector of a node v;

obtaining entity E by solving equation (10)_AiAnd E_BjOf the multi-modal heterogeneous eigenvector X_AiAnd X_Bj。

Preferably, said method is based on said entity E_AiAnd E_BjJudging the two entities E by the multi-mode heterogeneous characteristic vector_AiAnd E_BjWhether or not there is multi-modal heterogeneous consistency, including:

according to entity E_AiAnd E_BjThe multi-mode heterogeneous feature vector judgment entity E_AiType T_AiAnd E_BjType T_BjWhether or not they are identical, if so, two entities E_AiAnd E_BjDegree of identification of type relationship between H_ijEqual to 1; otherwise, two entities E_AiAnd E_BjThe type relation identification degree between the two is equal to 0;

when two entities E_AiAnd E_BjWhen the types of (A) are the same, entity E_AiAnd E_BjBetween the multimode heterogeneous similarity R_ijThe calculation method comprises the following steps:

X_Aiand X_BjTwo entities E obtained for solution_AiAnd E_BjThe multi-modal heterogeneous eigenvectors of (A), R_ijComposition of entity set E_AAnd E_BAnd a multi-modal heterogeneous feature similarity matrix R therebetween.

Preferably, said determining two entities E_AiAnd E_BjWhether attribute consistency exists includes:

the entity E_AiAnd E_BjThe attribute of the entity E comprises a text attribute and a geometric attribute, wherein the text attribute is a short text, a semantic feature vector model of the entity attribute is analyzed and established by adopting a short text word vector method, and the entity E is calculated by cos similarity or Euclidean distance method_AiAnd E_BjText attribute feature similarity between them;

fusion entity E_AiAnd E_BjThe similarity of text attribute features and the similarity of geometric attribute features between form an entity E_AiAnd E_BjAttribute consistency feature similarity matrix P therebetween_ijAll P are_ijComposition of entity set E_AAnd E_BThe attribute consistency feature similarity matrix P therebetween.

Preferably, said determining two entities E_AiAnd E_BjWhether there is environmental consistency, including:

if Z is

And

in the set of associated entities, entity E_AiAnd E_BjEnvironmental consistency feature similarity between them Y_ijThe calculation method comprises the following steps:

wherein, I_Ai＝N_Ai∩Z，I_Bj＝N_BjN and Z, in the initial stage,

as the iterative process continues, more and more associated entities in Z will be present, all Y_ijComposition of entity set E_AAnd E_BThe environment consistency feature similarity matrix Y between them.

Preferably, said two entities E_AiAnd E_BjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistency_AiAnd E_BjIs an association entity, comprising:

synthetic entity E_AiAnd E_BjDegree of identification of type relationship between H_ijMulti-mode heterogeneous similarity R_ijEnvironment consistency feature similarity Y_ijAnd attribute consistency feature similarity matrix P_ijObtaining said entity E_AiAnd E_BjThe similarity value S between_ij：

S_ij＝sim(E_Ai,E_Bj)＝H_ij·R_ij·Y_ij·P_ij

Based on E_AAnd E_BThe similarity value between all entities in E constitutes_AAnd E_BThe similarity matrix S between the entities selects the unassociated entity pair E with the maximum similarity value in S_AiAnd E_BjIs a related entity and needs to satisfy S_ij>Tau, tau is a set similarity threshold;

when a new associated entity Δ Z is identified, the associated entity set Z is updated to be: and Z is Z U delta Z, updating Y and S, re-identifying a new associated entity, finishing iteration when the associated entity meeting the requirement cannot be identified, and outputting an identified associated entity set Z.

It can be seen from the technical solutions provided by the embodiments of the present invention that, in the embodiments of the present invention, starting from the important requirements of the full-process integrated application and the full-life cycle data sharing of the multi-mode heterogeneous information network, the identification of the multi-mode heterogeneous information network associated entity under a massive entity is taken as a research target, and on the basis of fully analyzing the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network, a formal description method of the complex multi-mode heterogeneous information network, a domain-independent distributed representation learning model and method across the multi-mode heterogeneous network, a parallel computing method of the multi-mode heterogeneous information network, and an associated entity identification model and algorithm of comprehensive attribute characteristics and multi-mode heterogeneous characteristics are mainly researched, and experimental verification is performed on massive data.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating identification of an associated entity of a multimode heterogeneous information network in the prior art;

fig. 2 is a general implementation framework structure diagram of a multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics according to an embodiment of the present invention;

fig. 3 is a framework diagram of a cross-network-node multi-mode relationship feature representation learning method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a random walk according to an embodiment of the present invention, in which a multi-modal relationship is considered;

fig. 5 is a schematic diagram of a geometric property similarity calculation process according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating an iterative association entity identification according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The embodiment of the invention aims at the urgent need of the building industry for the whole-process integrated application of a multimode heterogeneous information Network and the whole life cycle data sharing of construction projects and the Network Representation Learning (Network Representation Learning) leading edge scientific theory, and establishes a multimode heterogeneous information Network associated entity identification model and method based on cross-Network Representation Learning by taking the cooperative coupling of the computer science and the key technology of the building and civil engineering science as a means.

The invention comprehensively considers text and geometric attribute characteristics, multimode heterogeneous characteristics and massive entities of a multimode heterogeneous information network, researches a multimode heterogeneous information network associated entity identification model and a method based on cross-network representation learning by using the theory and the method of network representation learning, and the overall implementation framework structure of the multimode heterogeneous information network associated entity identification method oriented to the multimode heterogeneous characteristics is shown in figure 2.

The method firstly researches a formal description method of the complex multimode heterogeneous information network, converts the multimode heterogeneous information network into the multimode heterogeneous network from the perspective of the complex network, and establishes a model basis for a multimode heterogeneous information network associated entity identification and parallel computing method and the like. Aiming at the multimode heterogeneous characteristics and the mass entities, by establishing a multimode relation transfer model, a cross-network random walk model and a cross-network distributed representation learning model based on word vectors, the multimode heterogeneous characteristics of different network nodes are embedded into the same space low-dimensional continuous vector, and a foundation is established for multimode heterogeneous consistency calculation. Aiming at mass entity characteristics, by establishing a multi-mode heterogeneous consistency model, an environment consistency model and an attribute consistency model and comprehensively considering the attribute characteristics and the multi-mode heterogeneous characteristics of a multi-mode heterogeneous information network, the identification quality of the associated entities of the multi-mode heterogeneous information network is improved, and the applicability of the associated entity identification model is ensured. And finally, carrying out extensive experimental verification by adopting actual engineering data, and ensuring that research results can serve the whole-process integrated application and the whole life cycle data sharing of the multimode heterogeneous information network.

(1) Multi-mode heterogeneous feature analysis and formalization description method of multi-mode heterogeneous information network based on IFC

The invention is supposed to combine IFC data standard, and analyze the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network from the two aspects of entity attribute characteristics and relationship characteristics. Aiming at the attribute characteristics, the invention aims to adopt a literature research method and an induction summarizing method to summarize the common attribute characteristics and the characteristics of each entity and establish a foundation for the subsequent extraction of the attribute characteristic vectors. Aiming at the relation characteristics, the invention aims to establish entity relation graphs under different modes on the basis of summarizing and summarizing the types and characteristics of the existing relation modes; and then, analyzing the structural characteristics and similarities, including density, degree distribution, radius and the like of each modal relational graph from a large number of actual engineering multimode heterogeneous information networks by adopting a data analysis method, and providing necessary support for theoretical analysis and algorithm improvement of subsequent algorithms.

And then, according to the research results, by using a complex network theory for the purpose of referring to the formal description of the social network, researching a formal description method of the multimode heterogeneous information network. In general, a multi-mode heterogeneous information network

May be composed of entities and entity relationships,

wherein E is

In the entity set, R is an entity relationship set, T is an entity type set, and C is an entity relationship type set. For any entity E in the multi-mode heterogeneous information network_iWhich includes the attribute characteristics of the entity, the specific attribute characteristics being referenced to the data standard of the IFC. For any two entities E_iAnd E_jWhere there may be a plurality of different modal relationships, the present invention contemplates the use of R_ijRepresents E_iAnd E_jA set of all relationships. For any entity relationship R_ijk∈R_ijIt can be described as: r_ijk＝{E_i,E_j,C_kIs defined as E_iIn relation to C_kE C depends on E_j. Thus, is available

Description of E_iIn relation to C_kAll entities that depend. Entity E_iMay be of the type T_iOr T (E)_i) A description will be given.

After formal description, the invention converts the multimode heterogeneous information network model into a multimodeA heterogeneous information network. At this time, the entities are also called nodes, and the relationships are also called edges. The invention is intended to use | · | to represent the number of sets. When R_ijWhen | < 1, the multi-mode heterogeneous information network degenerates to a heterogeneous information network; when | T | ═ 1, the multimodal heterogeneous information network degenerates to a homogeneous network. Therefore, the research content of the invention has more universality compared with homogeneous and/or single-mode information networks. On the basis, the formal description method of the multi-mode heterogeneous information network is further deepened, so that a basic mathematical model is provided for the establishment of a subsequent multi-mode heterogeneous information network associated entity recognition model, a multi-mode heterogeneous information network parallel computing algorithm and the like, and a model basis is established for the research of other multi-mode heterogeneous information networks.

(2) Domain-independent cross-multimode heterogeneous network distributed representation learning method

Fig. 3 is a framework diagram of a cross-network-node multi-modal relationship feature representation learning method according to an embodiment of the present invention. The cross-network representation learning aims at embedding network features of different network nodes into the same low-dimensional continuous space, and is one of effective methods for calculating the similarity of node network structures in different networks. The introduction of Meta Path (Meta Path) to extend homogeneous network distributed representation learning methods (such as deep walk, LINE and node2vec) to heterogeneous networks is the mainstream method of heterogeneous network distributed representation learning, such as Meta Path2 vec. On one hand, the heterogeneous network distributed representation method based on meta-paths requires sufficient domain knowledge to design reasonable meta-paths, so that it has no universality; on the other hand, the meta-path based method only considers heterogeneous nodes, and does not fully consider the multi-mode relationship. Furthermore, as the number of node types and modality types in the network increases, the design of meta-paths becomes extremely complex.

Partial research explores a cross-network distributed representation learning method under a given certain correlation node; however, it often requires a certain amount of associated nodes and is not adaptable in a multi-mode heterogeneous network. Considering the mass of the multi-mode heterogeneous information network entities, the invention aims to research a domain-independent cross-multi-mode heterogeneous network distributed representation learning method based on a word vector model and establish a foundation for an associated entity identification model under the multi-mode heterogeneous network. As shown in fig. 4, the cross-network representation learning model contains three parts: a multi-modal relational transfer model, a cross-network random walk model, and a cross-network distributed representation learning model based on word vectors.

Multi-mode relation transfer model

The multimode relation transfer model aims to establish multimode relation transfer probability in a multimode heterogeneous network, so that the problems of dependence on professional field knowledge, universality and the like of the conventional meta path-based method are solved. Given the relationship of | C | different modalities in a multimodal heterogeneous information network, the multimodal relationship transition matrix may be represented by | C | × | C | matrix M, where M is_ijRepresenting relationship type C in a multi-mode heterogeneous network_iTo C_jThe transition probability of (2).

Fig. 4 is a schematic diagram of a random walk considering a multi-mode relationship according to an embodiment of the present invention. In a random walk, if the last node E_iBy the relation C_xTransfer to current node E_j(as shown in FIG. 4), it is transferred to the next node E_kThe probability of (c) is:

wherein, W_ijAs entity E_iAnd E_jWeight of (C)_ijIs a relation (E)_i,E_j) Type (b) N_iAs entity E_iIs determined. W_ijCan be set according to actual conditions, and the invention adopts W_ij＝(N_i∩N_j)/(N_i∪N_j) And (6) performing calculation. If d is_ijAs entity E_iAnd E_jThe distance between them is:

formula (2) considers not only the weight relationship between nodes, but also the transition probability relationship between multi-mode relationships, thereby facilitating the embedding of multi-mode relationship features into low-dimensional continuous vectors.

Given matrix M, a set of random walk path sets P ═ P can be obtained using random walks according to equation (2)₁,P₂,P₃… and corresponding multimode transition path T ═ { T ═₁,T₂,T₃… }, wherein

At this time, a vector e of | P | dimension can be used_iRepresents a relationship type C_iFeatures in a random walk set P, where e_ijIs represented by C_iAt P_iThe number of occurrences in (c).

On the basis, the invention is intended to calculate the relation type C according to the Pearson correlation coefficient_iAnd C_jOf (2) similarity, i.e.

Then, updating the multi-mode relation transfer matrix by adopting Sigmoid function

Initially, the M matrix may be set to be an all 1 matrix or a random matrix. And (3) acquiring a random walk path set P by adopting a formula (2) according to the M, and updating the M according to a formula (5). And continuously iterating the process until M converges, and finishing the construction of the multi-mode relation transfer matrix.

On the basis, the invention theoretically demonstrates the convergence of the M iteration process and forms a corresponding algorithm.

② cross-network random walk model

The multi-modal relationship transfer model solves the problem of random walk in a single model considering multi-modal relationships. The cross-network random walk model connects the nodes and relations of different networks in series on a path, which is the key for mapping the node relation characteristics of different networks to the same low-dimensional continuous space.

Given two multimode heterogeneous information networks

And

two entities E in_Ai∈E_AAnd E_Bj∈E_BThe invention is to define the structural similarity as follows:

if | N_AiI denotes E_AiNumber of neighbors, | E_AI and R_AModel is expressed respectively |

The number of middle entities and the number of relationships, then

In the initial state, a node E in a multimode heterogeneous network is randomly selected by a cross-network random walk model_AiAs an initial node for random walks. Then, the following rules are adopted to form a random walk path across the network:

a. acquiring random probability, and if the probability is smaller than a specified threshold epsilon, wandering in the current multimode heterogeneous network; otherwise, the network roams to another multimode heterogeneous network model;

b. when the current multi-mode heterogeneous network is kept to walk, selecting a next walking node by adopting the probability of the formula (2);

c. when switching to another multimode heterogeneous network for wandering, if the current node has a node with known association, the next node of random wandering is the node with known associationConnecting nodes; otherwise, from E_AiSwim to the next node E_BjThe probability of (c) is:

wherein, h (E)_Ai,E_Bj) Is E_AiAnd E_BjThe calculation method of the attribute similarity is shown in formula (16).

By the above method, a set of sample paths S may be formed that may be used for distributed representation learning across network nodes.

Distributed representation learning model and algorithm for node multi-mode relation characteristics

The Word vector model (Word2Vec) characterizes semantic information of words in a Word vector manner by learning text, i.e., words that are semantically similar are close together in an embedding space by the space. Considering the mass of the multi-mode heterogeneous information network entity, the invention aims to use Skip-Gram model in Word2Vec for reference to establish cross-network distributed representation learning model and algorithm. In a single homogeneous network (the nodes in the network are of the same type and the relations are of the same type, i.e., | T | ═ 1 and | C | ═ 1), the target optimization function of the Skip-Gram model is:

where θ is a band solution parameter.

Considering the multimode heterogeneous characteristics of the network, the formula (9) can be extended to the learning of the cross-network distributed representation oriented to the multimode heterogeneous characteristics, and the objective optimization function can be converted into:

wherein N is_t(v) The type t context node in the adjacent node of the node v. If V_tFor a set of nodes of type t in both networks, then

Wherein, X_vIs the multi-modal heterogeneous eigenvector of node v.

Obtaining entity E by solving equation (10)_AiAnd E_BjOf the multi-modal heterogeneous eigenvector X_AiAnd X_BjFor subsequent calculation of multi-modal heterogeneity coherence. The formula (10) considers the multi-mode characteristics of the network through the multi-mode relation transfer matrix M and considers the heterogeneous characteristics of the network through T. Therefore, the feature vector learned by equation (10) embeds the multi-modal heterogeneous features of the network.

The solution operation amount of the formula (10) is large due to a large number of nodes in the network, and the model training complexity is reduced by adopting negative sampling, so that the objective function can be converted into:

wherein σ (·) is sigmoid function, NEG is negative sampling edge number. And then training X by adopting a random gradient descent method to obtain the multimode heterogeneous characteristic vector of each node. Many studies have verified that the negative sampling-based Skip-Gram model is applicable to node feature representation learning of ten million levels and above of node networks; therefore, the method can be used for extracting the multimode heterogeneous characteristics of massive entities of the multimode heterogeneous information network.

The invention aims to design a cross-network distributed representation learning algorithm according to the model, and theoretically discuss the complexity of the algorithm, the influence of the hyper-parameter on the model and the like.

(4) Associated entity recognition model and method integrating attribute characteristics and relationship characteristics

In order to improve the quality of the identification of the associated entity without prior, the invention considers that: an entity depends on its surrounding "environment" and can be identified from the surrounding "environment". For this reason, the basic idea of the identification of the associated entity of the invention is: if E_AiAnd E_BjIs associated withEntities, i.e. E_Ai＝E_BjThen E is_AiAnd E_BjThe following conditions should be satisfied:

a. and (4) multi-modal heterogeneous consistency. E_AiAnd E_BjIs the same type or the same type of inherited entity, and E_AiAnd E_BjHave similar multimode heterogeneous characteristics;

b. and (4) consistency of the attributes. E_AiAnd E_BjShould have similar text and geometric attribute features;

c. and (4) environment consistency. E_AiAnd E_BjHave a similar "environment"; i.e. N_AiAnd N_BjMost of the entities in (2) are also associated entities.

|E_A|×|E_BThe matrix S represents M_AAnd M_BA similarity matrix of entities. When two entities E_AiAnd E_BjIs different, the similarity of the two entities is directly set as 0, S _ij0. At this point, there is no need to compute entity E_AiAnd E_BjMulti-modal heterogeneous consistency, environmental consistency, and attribute consistency. If | E_A|×|E_BThe matrix H represents M_AAnd M_BThe type relation matrix of (1) is

Multi-mode heterogeneous consistency model

After the multi-mode heterogeneous features of the nodes of two different multi-mode heterogeneous networks are embedded into the low-dimensional continuous vectors in the same space, the cosine similarity can be adopted to calculate two nodes E_AiAnd E_BjFeature vector X of_AiAnd X_BjAnd forming a multi-mode heterogeneous consistency model according to the similarity. That is to say that the first and second electrodes,

wherein, | E_A|×|E_BThe matrix R is

And

a multi-modal heterogeneous feature similarity matrix of the entity. X_AiAnd X_BjTwo entities E obtained for the solution described above_AiAnd E_BjThe multi-modal heterogeneous eigenvectors of (A), R_ijComposition of entity set E_AAnd E_BAnd a multi-modal heterogeneous feature similarity matrix R therebetween.

Environment consistency model

If Z is the set of the associated entities in the two multimode heterogeneous networks, two nodes E_AiAnd E_BjThe environmental consistency model of (a) can be calculated using the Jaccard similarity. Namely:

wherein, I_Ai＝N_Ai∩Z，I_Bj＝N_BjAndu is Z. Without a priori associated entities, initially, with

The invention designs an iterative algorithm to mine the associated entities; thus, as the iterative process continues, there are more and more associated entities in Z. All Y are_ijComposition of entity set E_AAnd E_BThe environment consistency feature similarity matrix Y between them.

Third, attribute consistency model

The multi-mode heterogeneous information network attribute comprises two forms of text attribute and geometric attribute. The method establishes similarity models for the text attributes and the geometric attributes respectively.

a. And (5) a text attribute feature model. The text attribute of the multimode heterogeneous information network entity is mostly short text. The method adopts a short text word vector method to analyze and establish an entity attribute semantic feature vector model; then, the cos similarity or Euclidean distance is used for the equationMethod calculation entity E_AiAnd E_BjForm n_A×n_BAttribute feature similarity matrix P of order_P。

b. And (5) a geometric attribute feature model. IFCs support a number of different geometric model types. Specifically, the IFC adopts a model composed of basic graphic primitives such as Curve2D, GeometricSet and GeometricCurveSet description points, lines and surfaces, adopts a surface model and adopts a Solidmodel to describe an entity model; wherein, the SolidModel can be subdivided into various types such as SweptSolid, Brep, CSG, Clipping, advanced SweptSolid, and the like. The multiple kinds and complex citations of the IFC geometric description bring great challenges to the similarity of the geometric attributes of the multimode heterogeneous information network.

Fig. 5 is a schematic diagram of a geometric property similarity calculation process provided in an embodiment of the present invention, where the calculation process includes: firstly, the invention aims to fully utilize the result of the early multi-mode heterogeneous information network lightweight visualization and convert each geometric model type into Brep; then, Brep is converted into a Delaunay triangulation network, and similarity calculation based on the Delaunay triangulation network is further performed. In the aspect of triangulation network similarity calculation, the invention adopts shape distribution similarity to calculate. On the basis, finally forming a similarity matrix P of all entity geometric attributes in the two multimode heterogeneous information networks_G。

In the identification of the multimode heterogeneous information network associated entity, two entities are associated without the condition that the similarity of all attributes is large; when the similarity value of the text attribute or (and) the geometric attribute is large, the two multimode heterogeneous information network entities have a certain probability as associated entities. Therefore, the invention adopts a Logit regression model to fuse the text attribute and the geometric attribute similarity to form a multi-mode heterogeneous information network entity attribute similarity matrix P,

all P are added_ijComposition of entity set E_AAnd E_BThe attribute consistency feature similarity matrix P therebetween.

Associated entity identification method

In order to improve the identification accuracy of the associated entity, the multi-heterogeneous characteristics of the multi-heterogeneous information network are simulated and integrated, the multi-heterogeneous consistency, the attribute consistency and the environment consistency are considered, and an associated entity identification iterative algorithm is designed. Fig. 6 is a flowchart illustrating iterative association entity identification. First, H, R, Y, and P matrices are calculated based on the study contents (2) and (3) and the multi-modal heterogeneous consistency, attribute consistency, and environment consistency models. Then, for two entities E_AiAnd E_BjAnd calculating the similarity as follows:

S_ij＝sim(E_Ai,E_Bj)＝H_ij·R_ij·Y_ij·P_ij。 (18)

the algorithm will select the unassociated entity pair E with the largest similarity value in S_AiAnd E_BjIs a related entity and needs to satisfy S_ij>τ, τ is a set similarity threshold. When a new associated entity Δ Z is identified, the associated entity set Z is updated to be: z ═ Z @ U Δ Z. Then, Y and S are updated and new associated entities are re-identified. And when the associated entities meeting the requirements cannot be identified, finishing the iteration and outputting the identified associated entity set Z. Considering that in each iteration process, the delta Z only affects part of the content in the Y; therefore, each iteration does not need to update all Y values and S values, and therefore the efficiency of the associated entity identification method is guaranteed.

The invention aims to design a corresponding algorithm on the basis of the above, and theoretically discuss the influence of algorithm complexity and hyperparameters on the associated entity recognition model.

In summary, in the embodiments of the present invention, starting from the important requirements of the full-process integrated application and the full-life cycle data sharing of the multi-mode heterogeneous information network, the identification of the multi-mode heterogeneous information network associated entity under a massive entity is taken as a research target, and on the basis of fully analyzing the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network, a formal description method of the complex multi-mode heterogeneous information network, a domain-independent learning model and method of distributed representation across the multi-mode heterogeneous network, and an associated entity identification model and algorithm of the comprehensive attribute characteristics and the multi-mode heterogeneous characteristics are mainly researched, and experimental verification is performed on massive data.

The invention forms a set of multimode heterogeneous information network formalized description method and a multimode heterogeneous associated entity identification model and method based on cross-network representation learning, enriches and perfects theories and methods of network representation learning in the field of data mining and associated entity identification, multimode heterogeneous information network in the field of building informatization, promotes the cross fusion of computer science and building and civil engineering schools, and has important theoretical value. The research result promotes the whole-process integrated application and the whole-life-cycle data sharing of the multimode heterogeneous information network, improves the big data application capability and the management decision level of the building industry and enterprises, serves the national important requirement of modernized transformation and upgrading of the building industry, supports the big data construction and the 'whole-cycle management' of smart cities, smart infrastructures, smart people and the like, and has great economic and social benefits.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a multimodal heterogeneous associated entity recognition method based on cross-network representation learning, is characterized in that, comprises:

Two multimodal heterogeneous information networks:

and

E _A and E _B are entity sets, R _A and R _B are entity relation sets, T _A and T _B are entity type sets, and C _A and C _B are entity relation type sets. Let two entities E _Ai ∈ E _A and E _Bj ∈ E _B ;

Based on the random walk path set between entities E _Ai and E _Bj , the multimodal relationship transition probability M _ij between E _Ai and E _Bj is established by an iterative method, and the multimodal relationship transition probability M _ij is learned by using the objective function Obtain the multimodal heterogeneous eigenvectors of the entities E _Ai and E _Bj ;

According to the multimodal heterogeneous feature vectors of the entities E _Ai and E _Bj , it is determined whether the two entities E _Ai and E _Bj have multi-modal heterogeneity consistency, and also determine whether the two entities E _Ai and E _Bj have attribute consistency properties and environmental consistency, when the two entities E _Ai and E _Bj have multimodal heterogeneity consistency, attribute consistency and environmental consistency at the same time, then determine E _Ai and E _Bj as associated entities;

The attributes of the entities E _Ai and E _Bj include text attributes and geometric attributes, and the text attributes are short texts. The method of short text word vectors is used to analyze and establish the entity attribute semantic feature vector model, and the cos similarity or Euclidean distance method is adopted. Calculate the text attribute feature similarity between entities E _Ai and E _Bj ;

Integrate the text attribute feature similarity and geometric attribute feature similarity between entities E _Ai and E _Bj to form the attribute consistency feature similarity matrix P _ij between entities E _Ai and E _Bj , and combine all P _ij into entity set E Attribute consistency feature similarity matrix _P between _A and EB.

2. The method according to claim 1, characterized in that, based on the set of random walk paths between entities E _Ai and E _Bj , the multiple paths between the entities E _Ai and E _Bj are established by an iterative method. Modular relationship transition probability M _ij , including:

Assuming that there are |C| different modal relationships in the multimodal heterogeneous information network, the |C|×|C| matrix M is used to represent the multimodal relationship transition matrix, where M _ij represents the relationship type C _i in the multimodal heterogeneous network the transition probability to C _j ;

In a random walk, if the previous node E _i is transferred to the current node E _j through the relationship C _x , the probability p(E _k |E _i ,E _j ,C _x ,M) of transferring to the next node E _k The calculation method is:

Among them, W _ij is the weight of the entities E _i and E _j , C _ij is the type of the relationship (E _i , E _j ), _Ni is the set of neighbor nodes of the entity E _i , W _ij =(N _i ∩N _j )/ (N _i ∪N _j ), if d _ij is the distance between entities E _i and E _j , then:

According to formula (2), random walk is used to obtain a set of random walk paths P={P ₁ , P ₂ , P ₃ ,...} and their corresponding multimodal relation transfer paths T={T ₁ , T ₂ , T ₃ ,…}, where

Use |P| dimensional vector e _i to represent the characteristics of relation type C _i in random walk set P, where e _ij represents the number of times C _i appears in P _i ;

The similarity of relationship types C _i and C _j is calculated according to the Pearson correlation coefficient, namely

Updating Transition Probabilities of Multimodal Relations Using Sigmoid Function

Initially, the M _ij matrix is set to an all-one matrix or a random matrix. According to M _ij , formula (2) is used to obtain the random walk path set P, and M _ij is updated according to formula (5), and the above process is continuously iterated until M _ij converges. , to complete the construction of the multimodal relational transition matrix Z _ij .

3. method according to claim 2, is characterized in that, described through described multimodal relation transition matrix _Mij utilizes objective function to learn to obtain the multimodal heterogeneous eigenvectors of described entity E _Ai and E _Bj , including :

Taking the entities E _Ai and E _Bj as a node respectively, the Skip-Gram model in Word2Vec is used to establish a cross-network distributed representation learning model and algorithm, and the cross-network distributed representation learning oriented to multimodal heterogeneous features is set up. The objective optimization function of the Skip-Gram model is:

Among them, θ is the parameter with solution, N _t (v) is the context node of type t in the adjacent nodes of node v, if V _t is the set of nodes of type t in the two networks, then:

where X _v is the multimodal heterogeneous feature vector of node v;

The multimodal heterogeneous eigenvectors X _Ai and X _Bj of the entities E _Ai and E _Bj are obtained by solving Equation (10).

4. The method according to claim 3, characterized in that judging whether the two entities E _Ai and E _Bj have multi-modal heterogeneity according to the multi-modal heterogeneity feature vectors of the entities E _Ai and E _Bj qualitative consistency, including:

According to the multimodal heterogeneous feature vectors of entities E _Ai and E _Bj , determine whether the type T _Ai of the entity E _Ai and the type T _Bj of E _Bj are the same. If they are the same, then the type relationship between the two entities E _Ai and E _Bj is acquainted The degree H _ij is equal to 1; otherwise, the degree of recognition of the type relationship between the two entities E _Ai and E _Bj is equal to 0;

When the two entities E _Ai and E _Bj are of the same type, the calculation method of the multimodal heterogeneous similarity R _ij between the entities E _Ai and E _Bj is:

X _Ai and X _Bj are the multimodal heterogeneous eigenvectors of the two entities _E _Ai and E _Bj obtained from the solution, and all R _ij are formed into the multimodal heterogeneous feature similarity matrix _R between the entity sets EA and EB .

5. The method according to claim 3, wherein the judging whether the two entities E _Ai and E _Bj have environmental consistency, comprising:

If Z is

and

A collection of associated entities in ,

and

are two multimodal heterogeneous information networks;

Then the calculation method of the environmental consistency feature similarity Y _ij between the entities E _Ai and E _Bj is:

Among them, I _Ai =N _Ai ∩Z, I _Bj =N _Bj ∩Z, initially,

As the iterative process continues, there are more and more associated entities in Z, and all Y _ij are formed into an environment consistency feature similarity _matrix _Y between entity sets EA and EB.

6. The method according to claim 5, wherein, when the two entities E _Ai and E _Bj have multimodal heterogeneity consistency, attribute consistency and environment consistency at the same time, then determine E _Ai and E _Bj are associated entities, including:

The entity E _Ai is obtained by synthesizing the type relationship recognition degree H _ij , the multimodal heterogeneity similarity R _ij , the environmental consistency feature similarity Y _ij and the attribute consistency feature similarity matrix P _ij between the entities E _Ai and E _Bj The similarity value S _ij between EBj and E _Bj :

S _ij =sim(E _Ai ,E _Bj )=H _ij ·R _ij ·Y _ij ·P _ij

Based on the similarity values between all entities in EA and _EB , the similarity matrix _S between EA and _{EB is formed, and the unrelated entity pair E Ai and EBj} _with _the _largest similarity value in S is selected as the associated entity, And it needs to satisfy S _ij >τ, τ is the set similarity threshold;

When the new associated entity ΔZ is identified, update the associated entity set Z as: Z=Z∪ΔZ, update Y and S, and re-identify the new associated entity. When the associated entity that meets the requirements cannot be identified, the iteration ends, and the output identification The associated entity set Z out of.