CN110909174B

CN110909174B - Knowledge graph-based method for improving entity link in simple question answering

Info

Publication number: CN110909174B
Application number: CN201911131171.7A
Authority: CN
Inventors: 陈凯
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2022-01-04
Anticipated expiration: 2039-11-19
Also published as: CN110909174A

Abstract

The invention discloses an improved method for entity linkage in simple question answering based on a knowledge graph, which belongs to the technical field of natural language processing, and comprises the steps of establishing a central server and a problem input client, establishing an entity detection module, an entity candidate set module, a knowledge graph retrieval module and an entity matching module in the central server, detecting problem data, establishing an entity candidate set, encoding the problem data, performing three-level encoding on entities in the entity candidate set, selecting n entities with the highest matching scores with the problem data from the entity candidate set, and adopting a unique problem encoding mode.

Description

Knowledge graph-based method for improving entity link in simple question answering

Technical Field

The invention belongs to the technical field of big data, and relates to a knowledge graph-based method for improving entity links in simple questions and answers.

Background

In recent years, more and more open source knowledge maps (KGs) have emerged that contain a large number of facts, such as FreeBase, Yago, and DBpedia. Question-answering (KG-QA) with knowledge graph as the answer source is a hot spot of recent research. There are two main ways to store knowledge maps: RDF-based storage, graph database-based storage.

Conventional KG-QA methods can be divided into three major classes of KG-QA. The first type is semantic parsing: the method is a partial linguistic method, and the main idea is that a natural language is converted into a series of formal logic forms, the logic forms are analyzed from bottom to top to obtain a logic form capable of expressing the whole problem semantics, and a corresponding query statement (similar to lambda-Caculus) is used for querying in a knowledge base to obtain an answer. The second type is Information Extraction (Information Extraction): the method comprises the steps of extracting an entity in a question, inquiring the entity in a knowledge base to obtain a knowledge base subgraph taking the entity node as the center, taking each node or edge in the subgraph as a candidate answer, extracting information according to certain rules or templates by observing the question to obtain a question feature vector, and establishing a classifier to screen the candidate answer by inputting the question feature vector, thereby obtaining a final answer. The third type is Vector Modeling (Vector Modeling), which is characterized in that the idea of the method is closer to the idea of information extraction, candidate answers are obtained according to questions, the questions and the candidate answers are both mapped into Distributed expressions (Distributed expressions), and the Distributed expressions are trained through training data, so that the scores (usually in the form of dot multiplication) of the Vector expressions of the questions and the correct answers are as high as possible.

In general, a simple knowledge-graph-based question-answer (kG-Simpleqa) involves two key subtasks, (1) entity linking. The purpose of entity linking is to detect the entities mentioned in the problem and link them into the KG; (2) relationship prediction, the subtask identifies relationships in the knowledge graph about the entity that the question asks. For example, the question "what is the question of the bone map writer in? ", it is necessary to find the expression of an entity in the knowledge-graph in the question: "skope megazine", which links it to the corresponding entity "m.03c 4nk" in the knowledge-graph, and the relationship about that entity asked in the question: "book/periodic/language".

Entity linking presents some unsolved problems, namely the problem of entity ambiguity and the problem of OOV (the entity in the problem cannot find the corresponding vector expression in the pre-trained word vector model). Entity ambiguity problem this means that different entities in the knowledge-graph have the same name, which creates a huge impediment to how to link the entity in question to the correct entity in the knowledge-graph. For example, in the above example, the entity involved in the problem is an "applet," but there are many entities in the knowledge graph named "applet," which creates an entity confusion problem. To address the issues of entity confusion and OOV, some previous work has proposed some models. Lukovnikov et al introduce character-level coding of each word in the problem when vectorizing the problem, combine with word-level coding, as the vector representation of the problem, well solve the OOV problem, but because 92.9% of the words related to OOV are all entities or a part of the entities, if character-level coding is used, the semantics of the entities will be lost, which is a little information loss in the entity links; to address the problem of entity confusion, dai et al encode the type information of an entity as a vector representation of the entity. Each dimension of the type vector is either 1 or 0, indicating whether the entity is associated with a particular type, so the dimension of the vector is the number of entity types in the knowledge-graph. This approach may work well for entity confusion issues, but does not take into account some information about the entity itself. Yin w. et al, when encoding a problem in an entity link, link the character-level code and the word-level code of each word in the problem together as the code of the problem, and when encoding an entity, comprehensively consider the character-level of the entity name and the word-level code of the entity type.

However, the knowledge graph has less type information about entities, only one level of coding is considered to be insufficient to solve the problem of entity confusion, and the character-level coding of the problem can lose some important semantic information.

In recent years, some neural network models combining attention mechanism have been proposed, in entity linkage, the main task of the model is to make vectorization of the problem better show information related to the entity, and the part of the problem related to the entity can be utilized to the maximum, but the model is generally complex and cannot well deal with the problem of entity confusion.

Disclosure of Invention

The invention aims to provide an improved method for entity linkage in simple question answering based on a knowledge graph, which solves the technical problems that the semantic information is not lost while the OOV problem is solved, and entity confusion is well processed due to the consideration of the information of three layers of entities.

In order to achieve the purpose, the invention adopts the following technical scheme:

an improved method for entity linkage in simple question answering based on knowledge graph includes the following steps:

step 1: establishing a central server and a problem input client, wherein the problem input client is used for collecting problem data and transmitting the problem data to the central server through the Internet for processing;

establishing an entity detection module, an entity candidate set module, a knowledge graph retrieval module and an entity matching module in a central server;

the knowledge-map retrieval module is used for being in butt joint with the open-source knowledge-map KG and providing retrieval service related to the open-source knowledge-map KG;

step 2: after the central server receives the question data, the entity detection module detects the question data and predicts the subject words of the question in the question data, and the steps are as follows:

step A1: establishing a BILSTM-CRF model for sequence labeling problem;

step A2: according to the BILSTM-CRF model, two labels "i" and "o" are used as labels for each word in the question data, wherein the corresponding word is part of the question subject word as denoted by "i";

step A3: obtaining each question subject word in the question data through the method of the step A1 and the step A2;

and step 3: the knowledge map retrieval module transmits all entities corresponding to the entity names which can be perfectly matched with the subject words of the questions to the entity candidate set module through retrieving the knowledge map;

and 4, step 4: the entity candidate set module establishes an entity candidate set, and all the entities retrieved in the step 3 are placed into the entity candidate set for storage after being screened, namely, all the entities corresponding to the entity names which can be partially matched with the n grams of the question subject words are kept in the entity candidate set, wherein the value of n is from high to low, if the n grams are not the question subject words per se and the number of the matched entities is more than 50, the n grams are discarded;

and 5: the entity matching module reads the entity candidate set and selects n entities with the highest matching scores with the problem data from the entity candidate set, and the steps are as follows:

step C1: adopting word-level to encode the problem data, obtaining the vector expression of the problem through a pre-training word vector model, then taking the vector expression as the input of a BILSTM, and finally performing max-posing on the hidden vector to obtain the final vector expression of the problem data, namely, the vector encoding of the problem data;

step C2: acquiring an entity candidate set, and performing three-level coding on entities in the entity candidate set, wherein the three-level coding comprises performing word-level coding on names of the entities, performing type-level coding on types of the entities and performing word-level coding on the types of the entities;

obtaining type vector codes of type-level entities in the entity candidate set, vector codes of names of word-level entities and vector codes of types of word-level entities;

step C3: and respectively carrying out similarity calculation on the vector code of the problem data and the type vector code of the type-level entity, the vector code of the name of the word-level entity and the vector code of the type of the word-level entity, and taking n candidate entities with the highest scores as predicted entities.

Preferably, when step 4 is performed, the partial matching has the following limitations: the number of words of the entity name in the knowledge graph cannot exceed the number of words in the problem subject word by one.

Preferably, in executing step C1, when the OOV problem is encountered, word-level encoding is performed on the types of entities that exceed the pre-trained model.

Preferably, in the step C2, the type-level coding is performed by using a bag-of-words model, i.e. the vector dimension is the number of total entity types in the knowledge graph.

Preferably, in step C3, similarity calculation is performed on the three-level vector codes of the entities in the entity candidate set and the vector code of the question data, and finally an average value is obtained, after BILSTM is performed on the vector codes of the question data, each word has hidden layer vectors in two directions, namely, forward and backward, and the hidden layer vectors of each word are obtained by splicing the hidden layer vectors.

Preferably, when step C3 is executed, the hidden layer vector of the right direction of the last word in the question data is concatenated with the hidden vector of the left direction of the first word to serve as the vector code of the question data, so that the coded information of all words in both directions is utilized, and the specific calculation method is as follows:

where qs (q) represents the vector code for the problem data, et(s) represents the type vector code for the entity of type-level, el(s) is the vector code for the name of the entity of word-level, and ew(s) is the vector code for the type of the entity of word-level.

The invention relates to an improved method for entity linkage in simple question answering based on a knowledge graph, which solves the problem of OOV (object oriented programming) without losing semantic information, well solves the technical problem of entity confusion by considering information of three layers of entities, adopts a unique problem coding mode, takes the type of a word which cannot be represented by a vector in a word vector model to code when the problem is coded, retains the semantic information of the word while solving the problem of OOV, and provides a method for coding the entity by three layers to solve the problem of entity confusion, fully utilizes the type information and name information of the entity and combines the coding mode of the problem, thereby effectively solving the problems of entity confusion and OOV.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Fig. 1 shows an improved method for linking entities in a simple knowledge-graph-based question-answer, which includes the following steps:

step A1: establishing a BILSTM-CRF model for sequence labeling problem;

step A3: obtaining subject words of the entity in the question through the methods of the step A1 and the step A2;

in the present embodiment, for example, the question data is "what language is skin map writer in? The term "topic word of the question in the question data is" sketch map ".

This subtask is considered as training a sequence tagging problem, which the present invention solves by training a BILSTM-CRF model. Two labels, "i" and "o" are used as labels for each word in the question, indicating that its corresponding word is part of the question subject word. Through this step, the subject word of the question in each question data can be obtained.

knowledge-graphs contain millions of entities, it is impractical to code them to compare similarity to a problem, and the present invention creates a candidate set based on the results of entity detection.

performing word-level coding on the problem data, and matching the corresponding types of words which cannot obtain corresponding vectors in the pre-training vector model in the problem data in KG (FreeBase is used in the embodiment);

since 88.5% of problem subject words encountering OOV problem can only be matched to one entity in FreeBase, the influence of entity confusion problem is small, if the problem of entity confusion is encountered, the type of the entity with highest frequency in FreeBase triples is used as the code of the problem subject words;

as shown in fig. 1, for the question data "what language is skin map writer in? "where" skope "cannot find the corresponding vector in the pre-trained vector model and is part of the question topic word" skope map "for the question data, the vector of the type of the question topic word" skope map "is used as its vector representation, and it is first necessary to match the" skope map "to the unique entity" m.03c14nk "in FreeBase.

Then, taking the information of word-level of the type of m.03c14nk: the method comprises the steps of connecting vectors obtained by words in a pre-training word vector model to be used as vectorization expression of "skin map", taking the vectorization expression of each word in problem data as input of BILSTM, connecting hidden vectors obtained in a forward sequence with hidden vectors obtained in a backward sequence, and finally obtaining a vector with a fixed dimension through max-posing processing.

in order to solve the entity confusion problem existing in the entity link, the type information of the entity needs to be utilized, but the type information about the entity in the FreeBase is not rich enough (the types of a plurality of entities in the FreeBase are simplified into common/topic), and the problem cannot be effectively solved by utilizing information of one layer alone, so that the type information of the entity needs to be enriched by utilizing multi-layer coding of the type. The method adopts three-level coding for entity names and entity types.

As shown in fig. 1, for an entity "m.03c14nk" in an entity candidate set, firstly, obtaining a name "skope map" of the entity from the attribute "type.object.name" of the entity in FreeBase, then performing word segmentation on the name of the entity to obtain a sequence { skope, map }, wherein the word vector representation cannot be obtained in a pre-training model by the "skope", so that a vector is obtained by performing random initialization on the word vector, the vector is used as an input of a BILSTM, an output hidden vector is subjected to max-posing processing to obtain a vector with fixed dimensions, and at this time, word-level codes of the name of the entity are obtained; the entity attributes "type/object/type" and "common/topic/non-table _ types" are used to derive the type "book/major", "book/periodic", "common/topic" of the entity, using "/", "_" to tokenize the entity type, resulting in the sequence { book, major, periodic, common, topic }, and similarly, a vector representation is derived in GloVe;

and obtaining the word-level codes of the entity types through BILSTM and max-pooling processing, wherein for the codes of the type-level of the entity types, vector dimensionality is fixed to the number of the entity types of FreeBase, and the vector characteristics are extracted without model training.

The traditional technical scheme adopts vector splicing of three levels as a vector of an entity, then similarity is calculated with a vector of a problem, because only the vector of type-level of the entity type has 500 dimensions, and the vector expression of the entity can reach thousands of dimensions by adding loud splicing of other levels, and a great error can be caused.

The traditional technical scheme adopts the concatenation of each word, then uses the pooling operation to obtain the vector of fixed dimension, this kind of method can lead to losing too much information, and the effect is very poor, does not also adopt and utilize the entity vector of three levels to produce the problem vector representation that this level corresponds to, then calculates the similarity, this kind of method can make the encoding of problem be close to the entity information that contains to the greatest extent, but there is a defect, the ability of distinguishing the candidate entity that the similarity is very high is very poor, because many entities in the entity candidate set possess the same name, even some entity's type information is very close.

In the invention, the following technical scheme is adopted for improvement instead of the traditional technical scheme:

The patent relates to simple question answering (SimpleQA), which means that only reasoning needs to be carried out based on a fact in a knowledge graph to answer a question, and the invention uses a plurality of models (Bilstm, Bigru) for deep learning to complete the simple question answering based on the knowledge graph.

Claims

1. An improved method for entity link in simple question answering based on knowledge graph is characterized in that: the method comprises the following steps:

step A1: establishing a BILSTM-CRF model for sequence labeling problem;

step A3: obtaining question subject words in the question data through the methods of the step A1 and the step A2;

and step 3: the knowledge map retrieval module transmits all entities corresponding to the entity names which can be perfectly matched with the problem subject words to the entity candidate set module through retrieving the knowledge map;

and 4, step 4: the entity candidate set module establishes an entity candidate set, and all the entities retrieved in the step 3 are placed into the entity candidate set for storage after being screened, namely, all the entities corresponding to the entity names which can be partially matched with the ngrams of the question subject words are kept into the entity candidate set, wherein the value of n is changed from high to low, and if the ngrams are not the question subject words per se and the number of the matched entities is more than 50, the ngrams is discarded;

2. The method of claim 1, wherein the method comprises the steps of: in performing step 4, the partial match has the following limitations: the number of words of the entity name in the knowledge graph cannot exceed the number of words in the problem subject word by one.

3. The method of claim 1, wherein the method comprises the steps of: in executing step C1, when OOV problem is encountered, word-level encoding is performed on the type of entity beyond the pre-training model.

4. The method of claim 1, wherein the method comprises the steps of: in performing step C2, the type-level encoding is based on the bag-of-words model, i.e., the vector dimension is the number of total entity types in the knowledge graph.

5. The method of claim 1, wherein the method comprises the steps of: and C3, respectively carrying out similarity calculation on the vector codes of the three layers of the entities in the entity candidate set and the vector codes of the problem data, finally solving the average value of the similarity calculation, wherein each word has hidden layer vectors in two forward and backward directions after the BILSTM on the vector codes of the problem data, and splicing the hidden layer vectors to obtain the hidden layer vector of each word.

6. The method of claim 5, wherein the method comprises the steps of: in step C3, the hidden layer vector of the last word in the question data in the right direction and the hidden vector of the first word in the left direction are concatenated to form the vector code of the question data, so that the coded information of all words in both directions is utilized, and the specific calculation method is as follows: