CN111881290A

CN111881290A - Distribution network multi-source grid entity fusion method based on weighted semantic similarity

Info

Publication number: CN111881290A
Application number: CN202010555531.2A
Authority: CN
Inventors: 秦丹丹; 郑高峰; 刘丽; 李龙跃; 王鑫; 张淑娟; 赵龙; 汪玉; 高博; 徐斌; 李金中; 王潇; 孙伟; 李博; 卞真旭; 仇茹嘉; 钱光超; 邵珺伟
Original assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-11-03

Abstract

The invention discloses a distribution network multi-source network frame entity fusion method based on weighted semantic similarity, which comprises the following steps: the method comprises the following steps: extracting knowledge of the multi-source net rack to obtain a plurality of heterogeneous bodies; step two: searching the relation among a plurality of heterogeneous ontologies, establishing corresponding mapping, fusing the heterogeneous ontologies to form a plurality of knowledge graph ontology models; step three: fusing a plurality of knowledge graph body models by using a weighting algorithm; step four: and obtaining a fused result. And finally generating the final distribution network rack through the steps.

Description

Distribution network multi-source grid entity fusion method based on weighted semantic similarity

Technical Field

The invention relates to a distribution network multi-source net rack entity fusion method based on weighted semantic similarity.

Background

Due to the lack of an overall planning design and a transverse communication mechanism of the power business system, the problems of mutual isolation of functional processes of all business systems, multi-head input of basic data, non-uniform data standard and the like exist, so that the problem of weak cross-functional and cross-department transverse business process management of a power supply enterprise is highlighted. By utilizing a distribution network multi-source network frame entity fusion technology based on weighted semantic similarity, a semantic-based data fusion model is established on an original data storage model, data barriers are shielded on an application layer, a cross-department, cross-professional and cross-field integrated data resource system is formed, data collection, fusion and sharing can be promoted, and the data service capability of an enterprise is enhanced, so that the application level of data analysis and the value of big data are improved, management and service promotion are promoted, and powerful support is provided for developing value-added services.

Disclosure of Invention

In order to solve the problems of multi-head input of basic data, non-uniform data standard and the like and the problem of weak cross-functional and cross-department transverse business process management of a power supply enterprise, the invention adopts a data processing method to solve the problems. The specific implementation steps are as follows:

the method comprises the following steps: and (5) extracting knowledge.

Knowledge extraction extracts three major parts respectively:

1. entity extraction

Entity extraction is to identify and extract entities from information sources, and is the most basic and critical part in information extraction.

Methods of entity extraction are generally divided into three types:

1.1 rules and dictionary based approach: under the conditions of defining text fields and semantic unit types, a rule and dictionary-based method is mainly adopted, for example, defined rules are used for extracting distribution network entities, place names, organization names, specific time, faults and other entities in texts.

1.2 statistical machine learning based method: a supervised learning algorithm in machine learning is used for the extraction of named entities, the performance of the simple supervised learning algorithm is limited by a training set, and the accuracy and the recall rate of the algorithm are not ideal. Recognizing the restrictive nature of the supervised learning algorithm, the supervised learning algorithm is combined with the rules.

1.3 extraction method facing to open domain: the open domain clustering algorithm of unsupervised learning has the basic idea that named entities are identified in a search log based on semantic features of known entities and then are clustered.

2. Relationship extraction

The relation extraction is to extract the relation between entities from an information source to solve the problem of semantic connection between the entities, and is generally divided into supervised learning extraction and semi-supervised learning extraction.

And (3) supervised learning: the relationship set in supervised learning relationship extraction is usually determined, and the relationship extraction process only needs to be treated as a simple classification problem. The accuracy of a supervised learning model under high-quality supervised data is high, but the method has the defects that a large amount of labor cost and time cost are needed for labeling text data, new relation categories are difficult to expand, the model is fragile, and the generalization capability is limited. Semi-supervised learning: and extracting a large number of new instances from the unstructured data to form new training data by using a small amount of marking information as a seed template. The main method comprises the following steps: the Bootstrap algorithm has the core idea and basic steps as follows:

(1) a resampling technique is used to extract a certain number (freely set) of samples from the original samples, a process that allows for resampling.

(2) The statistic T is calculated from the extracted samples.

(3) This is repeated N times (typically greater than 1000) to obtain the statistic T.

(4) And calculating the sample variance of the N statistics T to obtain the variance of the statistics.

3. Attribute extraction

The characteristics and properties of the entities in the information source are extracted, and the attributes of the entities can be regarded as a part-of-speech relationship between the entities and the attributes, so that the attribute extraction problem can also be regarded as a relationship extraction problem.

In the invention, the processed data mainly come from structured data of the full-service unified data center and are extracted in a template mode. Since in the definition of the onto-model,

the mapping of entities, attributes, relationships and source systems has been set, so that the extraction script of the structured data can be written at the same time, and the structured form of the relational data can be stored.

Step two: and fusing the bodies.

The invention adopts a method of comprehensively utilizing ontology mapping and ontology integration;

1. global ontology-local ontology integration

Consistent, approved knowledge between different systems is first extracted, called the global ontology. The knowledge unique to each system itself is retained, called local ontology. A mapping between the global ontology and the local ontology is established. The process is as follows: 1, importing an ontology to be mapped, 2, finding mapping: based on the natural language processing technology, the similarity between the mapping objects is compared, the similarity of the structure is found, and the mapping between the ontologies is searched by utilizing the technologies such as machine learning and the like. Thereby covering individual services throughout the system.

2. Mapping between local ontologies

And searching the relation among the local ontologies by using a concept similarity related algorithm, a character string-based method and a language-based method, and establishing a mapping rule among the ontologies according to the relation.

3. Rational representation mapping

Ontology mapping, meaning that there are two ontologies A, B, for each concept in ontology A an attempt is made to find a semantically identical or similar corresponding concept for it in ontology B, and so on for each concept or node in ontology B. The most important process of mapping is thus the discovery of semantic associations.

Step three: example fusion.

Two kinds of algorithms of alignment of paired entities and alignment of cooperative entities are comprehensively adopted. The paired entity alignment judges whether two entities are in the same physical phenomenon, and specifically judges the alignment degree of the two entities by judging attributes; the cooperative entity alignment is that the alignment between different entities is considered to be influenced mutually, and a global optimal result is obtained by coordinating the matching conditions between different objects, namely finding a common point between different entities.

1. Paired entity alignment

Pairwise entity alignment is based on a knowledge base, which is a six-tuple of a set of instances, literal quantities, a collection of relationships and attributes, relationship facts, and attribute facts. The alignment of the entities is according to a specific formula to obtain a calculated value, wherein the calculated value is a numerical value describing the similarity size, and the larger the value is, the closer the two entities are. That is, the method for calculating the alignment result can be simply described as: given two knowledge bases and a group of priori aligned data, entity matching calculation is carried out under the common control of optional adjusting parameters and a series of related external resources, and finally an alignment result is obtained.

2. Entity similarity and relationship similarity

An intuitive aligned classification method is: and correspondingly assigning different weights to each matched attribute to show the importance of the matched attribute to the alignment result, respectively assigning different weights to the attribute of the entity and the attribute of the entity related to the entity, and weighting and summing the attributes to calculate the overall similarity. Setting a similarity threshold value, and judging the result of comparing the total entity similarity score with the similarity threshold value.

3. Feature matching based on similarity functions

And converting the character strings to be matched into a set of a series of sub strings by using a function, namely a marking function of the function, and calculating according to a weighted similarity to obtain the weighted similarity.

3.1 Token-based similarity function

And converting the matched text character strings into a set of a series of sub strings by using a function, and calling the sub strings as tokens. Commonly used token-based similarity functions are the Jaccard similarity function and the cosine similarity function.

The similarity function based on the Jaccard coefficient is characterized in that the set intersection operation is order-independent, so the order of different tokens has no influence on the measurement result.

Cosine similarity also has the advantage of order independence of token-based similarity functions, and simultaneously, because of the added weight, the similarity degree of tokens can be better reflected.

3.2 similarity function based on edit distance

The similarity function based on the editing distance considers the text strings to be matched as a whole, and the minimum cost of editing operation required for converting one character string into another character string is used as the measurement for measuring the similarity of the two character strings. Common editing distance-based similarity functions are Levenshtein distance-based, Smith-Waterman distance-based, Jaro-and Jaro-Winkler distance-based similarity functions.

Given two strings s₁And s₂The Levenshtein distance between them equals s₁Conversion to s₂The minimum number of insertion, deletion and replacement operations required. The similarity function based on the Levenshtein distance may reduce the error sensitivity of the similarity matching.

The invention utilizes a similarity calculation based on weighting to calculate the real similarity of the name and the attribute of two entities in the neo4j gallery, namely the similarity after weighting.

Drawings

FIG. 1 is a similarity technique flow diagram of the present invention;

FIG. 2 is a multi-source grid entity fusion diagram of the present invention;

FIG. 3 is a diagram of a body model component of the present invention.

Detailed Description

As shown in fig. 3: the invention only needs to solve the technical problem of fusing three systems of feeder line, transformer and network frame relationship and constructing a knowledge network frame.

Example 1:

extracting knowledge of the multi-source net rack to obtain a plurality of heterogeneous bodies; the method comprises the following specific steps: entity extraction, relationship extraction and attribute extraction; wherein:

the specific steps of knowledge extraction are as follows:

1. knowledge extraction

Knowledge extraction (Knowledge extraction) is the step 1 of Knowledge graph construction, and the key problems are as follows: how to automatically extract knowledge from heterogeneous data sources to get candidate pointing units? Knowledge extraction is a technique for automatically extracting structured knowledge such as entities, relationships, and entity attributes from semi-structured and unstructured data.

The purpose of knowledge extraction is to extract knowledge from data from different sources and different structures and store the extracted knowledge into a knowledge graph, and the knowledge extraction method is an important technology for realizing automatic construction of a large-scale knowledge graph. Entity extraction refers to automatic extraction from a data set to an entity. The quality of entity extraction has great influence on the subsequent knowledge acquisition efficiency and quality, and is therefore the most basic and key part in knowledge extraction.

The knowledge extraction is divided into three steps:

and (7) extracting entities. The entity extraction is the entity extraction of the formulation unit in the semi-structured data and the unstructured data.

And (9) extracting the relationship. After the entities are extracted, in order to obtain semantic information, the association relationship between the entities needs to be extracted from the related data, and the entities (concepts) are linked through the association relationship, so that a mesh knowledge structure can be formed.

And extracting attributes. The purpose of attribute extraction is to collect attribute information of a specific entity from different power grid information sources. For example, for a certain transformer, information such as the transformer identifier, the city to which the transformer belongs, and the name of the power supply unit of the transformer can be obtained from different information sources. The attribute extraction technology can collect the information from various data sources, and complete delineation of entity attributes is achieved.

In the technology, the processed data mainly come from the structured data of the full-service unified data center and are extracted in a template mode. Because the mapping of the entity, the attribute, the relation and the source system is set when the ontology model is defined, a structured data extraction script can be written according to the mapping, and the relational data is stored in a graph structure.

Example 2

The ontology fusion means that a global ontology is obtained first, and the mapping relation between each local ontology and the global ontology is searched. The ontology fusion refers to merging of heterogeneous ontologies obtained in example 1. The mapping relation finding and printing method comprises the following three steps: firstly, the method comprises the following steps: importing a theme to be mapped; II, secondly: discovering the mapping; thirdly, the method comprises the following steps: the mapping is represented.

The specific steps and methods of ontology fusion are as follows:

a common method to achieve ontology fusion is ontology integration and ontology mapping. Ontology integration directly merges a plurality of ontologies into one large ontology, and ontology mapping seeks a mapping rule among ontologies, and the two methods can eliminate the heterogeneity among ontologies.

The technology comprehensively utilizes a method of ontology mapping and ontology integration, and integrates the three established three-system ontologies to form a unified distribution network frame knowledge graph ontology model as a specification of knowledge storage.

1. Global ontology-local ontology based integration

The method firstly extracts common knowledge among the heterogeneous ontologies, and accordingly a global ontology is established. The global ontology describes knowledge that is consistently recognized among the various systems. Meanwhile, the ontology of each system can retain own unique knowledge, which is called as a local ontology. And finally, establishing mapping from the global ontology to knowledge of each local ontology, so that all knowledge in the ontologies of each business system can be covered.

2. Ontology mapping

The process of ontology mapping can be mainly divided into three steps:

the first step is as follows: and importing the ontology to be mapped. It is ensured that the components of the ontology that need to be mapped can be easily obtained.

The second step is that: a mapping is discovered. And searching for the relation between heterogeneous ontologies by using a concept similarity related algorithm, and then establishing a mapping rule between the ontologies according to the relation. To improve the accuracy of the mapping result, this step often requires manual intervention.

The third step: the mapping is represented. When mappings between ontologies are found, these mappings need to be represented reasonably.

It can be seen that the focus of the ontology mapping is to find the mapping. The present technique employs ontology mapping based on terminology and structure in conjunction with reality. The method starts from the terms of each system ontology, compares names, labels or comments related to ontology components, finds similarity among heterogeneous ontologies, and mainly utilizes a character string-based method and a language-based method.

Example 3:

example fusion means that the ontology models of the knowledge graph are fused by using a weighting algorithm. The essence of the entity fusion algorithm is a process of judging whether instance data from different knowledge maps describe the same objective physical object, entity fusion is also called entity alignment, and the technology mainly researches alignment of cross-system entities based on information such as entity attributes, entity relationships and the like in the distribution network single system network rack knowledge map.

The instance fusion process is similar to the ontology fusion process, but instance fusion is usually a large-scale data processing problem, and the time complexity and the space complexity need to be considered in the fusion process. The technology comprehensively utilizes two different algorithms of paired entity alignment and collaborative entity alignment. The paired entity alignment means that whether two entities correspond to the same physical object is independently judged, and the alignment degree of the two entities is judged by matching the characteristics of entity attributes and the like. The coordination entity alignment considers that the alignment between different entities is mutually influenced, and a global optimal alignment result is achieved by coordinating the matching conditions between different objects.

1. Principle of pairwise entity alignment algorithm

Before describing the specific principles of the algorithm, the definition of the knowledge base is explained first.

A knowledge base is a six-membered group consisting of: KB ═ I (I, L, R, P, FR, FP). Wherein, I, L, R and P are respectively 1 group of examples, literal quantity, relationship and attribute set;

is a relationship fact that an SPO triple represents an object as an instance;

is an attribute fact that an SPO triplet represents an object as a literal.

The formalization of entity alignment is defined as:

Align_entity(KB₁，KB₂)＝{(e₁，e₂，con)|e₁∈KB₁，e₂∈，con∈[0，1]}

wherein con is a numerical value describing the similarity of the entities, and the larger con is, the more similar two entities are.

The process of aligning two knowledge base entities can be described simply as: given two knowledge bases and a group of priori aligned data, entity matching calculation is carried out under the common control of optional adjusting parameters and a series of related external resources, and finally an alignment result is obtained.

2. Entity similarity and relationship similarity

The probabilistic model based alignment method is a method of pairwise comparison based on attribute similarity, which does not consider the relationship between matching entities. The entity matching problem based on attribute similarity scores may be translated into a classification problem. An intuitive entity alignment classification method is to add similarity scores of all matching attributes, then set a similarity threshold, and judge the result of comparison between the total entity similarity score and the similarity threshold, which can be expressed in a formalized way as follows:

wherein e is₁，e₂Is an entity pair to be matched; t is a similarity threshold.

One of the main problems of this method is that the influence of different attributes on the final similarity is not reflected. An important solution is to assign different weights to each matching attribute pair to reflect its importance to the alignment result: defining two knowledge bases A and B, e to be matched_iAnd e_jTwo disjoint sets M and U are defined for the entities in A and B, respectively

M＝{(e_i，e_j)|e_i＝e_j，e_i∈A，e_j∈B}

U＝{(e_i，e_j)|e_i≠e_j，e_i∈A，e_j∈B}

Defining a comparison vector x^*For the vectors formed by all matched attributes of the entities to be matched, the comparison space X is all possible X^*The space formed; defining the ratio of two conditional probabilities R ═ P (x)^*∈X|M)/P(x^*E X | U), the decision of the matching result can be expressed as:

on the assumption of comparing vector x^*Under the condition that the attributes in (1) are independent of each other, the weight of the attribute is:

wherein, a_iAnd b_iFor the i-th attribute, m, of the pair of entities to be matched_iTo assume the probability that two entities are identical with their ith attribute value equal, u_iThe probability that two entities are not the same that their ith attribute values are equal is assumed. Based on these two probability values, the weight ω of the ith attribute can be calculated_iComprises the following steps:

the relation between the entities in the knowledge base has important significance for entity alignment, and the matching accuracy and recall rate can be effectively improved. The local entity alignment method based on the simple relationship respectively assigns different weights to the attributes of the entity and the attributes of the entity related to the entity, and calculates the overall similarity by weighted summation, which can be expressed in a formalization mode as follows:

sim(e₁，e₂)＝αsim_attr(e₁，e₂)+(1-α)sim_NB(e₁，e₂)

3. feature matching based on similarity functions

(1) Token-based similarity function

The similarity function based on Token converts the matched text character string into a set of a series of sub-strings by using a certain function, the sub-strings are called Token, and the function is called a labeling function and is called Token (). Commonly used token-based similarity functions are the Jaccard similarity function and the cosine similarity function.

The Jaccard coefficient is equal to the ratio of the intersection and union of the two sets, and can be used for measuring the correlation of the two sets. The calculation method is as follows:

Cosine similarity is that token sets of two text character strings are regarded as two n-dimensional vectors, and the similarity degree of the character strings represented by the two vectors is evaluated by calculating cosine values of included angles of the two vectors. The weight w of token in each vector is typically calculated using the tf-idf model, two strings s₁And s₂The vector of the corresponding document is represented as<w₁₁，w₁₂，…，w_1n>，<w₂₁，w₂₂，…，w_2n>Then s₁And s₂The cosine similarity of (c) can be expressed as:

wherein,

(2) Edit distance based similarity function

Unlike token-based similarity functions, the edit distance-based similarity function treats the text strings to be matched as a whole, and takes the minimum cost of an editing operation required for converting one string into another as a measure for measuring the similarity of the two strings. Basic editing operations include insert, delete, replace, swap locations, and the like. The similarity function based on the editing distance can effectively process error sensitivity problems such as entry errors and the like. Common editing distance-based similarity functions are Levenshtein distance-based, Smith-Waterman distance-based, Jaro-and Jaro-Winkler distance-based similarity functions.

The similarity distance of the two character strings can be obtained when the similarity is calculated by the method, and the true similarity of the names and the attributes of the two entities in the neo4j gallery, namely the weighted similarity, is calculated by utilizing similarity calculation based on weighting. Fig. 1 is a brief summary of the algorithm.

Example 4:

the technology invents a corresponding knowledge fusion algorithm, the algorithm takes three systems of knowledge maps as input, and entities with the same type in the three systems of knowledge maps are subjected to fusion calculation through a distribution network entity semantic fusion algorithm to construct a uniform distribution network frame knowledge map.

The fusion effect of the different systems is as follows: wherein:

cms _ equip _ id: marketing service application system transformer id;

cms _ tran _ name: marketing service application system transformer name;

pms _ obj _ id: an equipment (asset) operation and maintenance lean management system transformer id;

pms _ tran _ name: the name of a transformer of the equipment (asset) operation and maintenance lean management system;

gis _ oid: a geographic information system transformer id;

gis _ tran _ name: geographical information system transformer name;

and the form fuses the marketing service application system, the marketing service application system and the geographic information system according to the fusion method, and the form is the obtained fusion result.

In the technology, the fusion refers to semantic fusion of ontology models of a marketing business application system, an equipment (asset) operation and maintenance lean management system and a geographic information system. FIG. 2 shows a specific fusion step. The ontology models of the three systems are respectively constructed, and different description modes can be defined for the same attribute, so that an ontology model fusion function is developed. The fusion function can automatically complete the fusion of the three-system ontology models to a certain extent, and supports the user to modify the fusion result so as to improve the accuracy of the fusion. The fused ontology model is a storage template of final knowledge graph instance data, and the quality of the ontology model directly influences the application effect of the graph.

The invention provides a distribution network multi-source network frame entity fusion method based on weighted semantic similarity, which improves the efficiency of marketing and distribution through work and ensures the reliability of entities and relations. Compared with the traditional method, the matching is checked manually instead of automatically. Meanwhile, the method introduces a similarity meter algorithm, a NLP natural language processing machine learning algorithm and the like, and improves the matching accuracy. In addition, the technology promotes the data collection, fusion and sharing, and enhances the enterprise data service capability, thereby improving the data analysis application level and the big data value. The data can be operated conveniently, and the fusion effect of the multi-source network rack entity of the distribution network is improved.

Claims

1. A distribution network multi-source grid entity fusion method based on weighted semantic similarity is characterized by comprising the following steps:

the method comprises the following steps: carrying out knowledge extraction on the net racks of a plurality of different sources to obtain a plurality of heterogeneous bodies;

step two: searching the relation among a plurality of heterogeneous ontologies, establishing corresponding mapping, fusing the heterogeneous ontologies to form a plurality of knowledge graph ontology models;

step three: fusing a plurality of knowledge graph body models by using a weighting algorithm;

step four: and obtaining a fused result.

2. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 1, characterized in that: the knowledge extraction in the first step comprises the following steps: and (4) entity extraction, relation extraction and attribute extraction, and performing knowledge extraction on the multi-source network frame according to the sequence.

3. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 1, characterized in that: and the ontology fusion in the step two is to obtain a global ontology by adopting an ontology integration method and then obtain the mapping relation between the single heterogeneous ontology and the global ontology.

4. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 3, characterized in that: the ontology integration is to eliminate the isomerism among a plurality of heterogeneous ontologies and directly combine the heterogeneous ontologies into a global ontology.

5. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 3, characterized in that: the mapping relation is obtained by three steps: and importing an ontology to be mapped, finding a mapping and representing the mapping.

6. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 5, characterized in that: the discovery mapping is to discover the relationship between the global ontology and the single heterogeneous ontology, compare names, labels and comments of the single heterogeneous ontology and the global ontology by using a character string-based and language-based method based on the attributes of the single heterogeneous ontology, find the similarity between the single heterogeneous ontology and the global ontology, obtain the mapping relationship between the single heterogeneous ontology and the global ontology, and further obtain the knowledge graph ontology model.

7. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 1, characterized in that: the weighting algorithm in the third step is an alignment method based on a probability model, and different weights are distributed to the attributes obtained when the attributes are extracted in the knowledge extraction.

8. The distribution network multi-source rack entity fusion method based on the weighted semantic similarity according to claim 7, characterized in that: the weights are calculated for each attribute based on the Ttf-idf model.