CN114417874B

CN114417874B - Chinese named entity recognition method and system based on graph attention network

Info

Publication number: CN114417874B
Application number: CN202210083152.7A
Authority: CN
Inventors: 唐卓; 王啸; 李肯立; 伍祚瑶; 李虹宇; 向婷; 罗文明; 程欣威
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2024-10-15
Anticipated expiration: 2042-01-25
Also published as: CN114417874A

Abstract

The invention discloses a Chinese named entity recognition method based on a graph attention network, which comprises the following steps: obtaining a Chinese sentence to be identified by a Chinese named entity, constructing a word vector set X corresponding to the Chinese sentence based on the obtained Chinese sentence, and inputting the word vector set X corresponding to the obtained Chinese sentence into a trained Chinese named entity identification model based on a graph attention network to obtain a Chinese named entity label corresponding to the Chinese sentence. The invention can solve the technical problems that word boundaries and entity boundaries are inconsistent and model input characteristics are single in the existing BiLSTM-CRF model and the traditional graph attention computing method in the existing collaborative graph network model based on the graph attention network damages the graph attention expression capability.

Description

Chinese named entity recognition method and system based on graph attention network

Technical Field

The invention belongs to the technical field of entity identification, and particularly relates to a Chinese named entity identification method and system based on a graph attention network.

Background

Named Entity Recognition (NER) is a fundamental problem of natural language processing, and is the first step of a series of downstream tasks, such as relation extraction, knowledge graph construction, intent detection, and the like. The main goal of NER is to identify entities in unstructured text that have a specific meaning, mainly including names of people, places, institutions, proper nouns, etc., as well as words such as time, quantity, currency, scale values, etc.

Early named entity recognition was rolled out around rule-based and dictionary-based methods, but these methods are inefficient, cost-prohibitive, and require a lot of expertise, and models that perform better on NER are currently based on deep learning or statistical learning methods. Of these BiLSTM-CRF is a widely used architecture for the NER in english, which uses word-level representation and takes words as the basic unit of predictive labels, and chinese named entities are more difficult to recognize than the NER in english, thus deriving the use of word segmentation tools first and then execution of a word sequence-based markup model like english. In addition, the collaborative graph network model based on the graph attention network introduces the graph attention mechanism into NER for the first time, and lexical knowledge such as self-matching lexicon and recent context lexicon is integrated into the coding layer, so that the effect of identifying the named entities is further improved.

The traditional BiLSTM-CRF model still has the following problems after being improved by adding a Chinese word segmentation tool, and the performance of the model is poor: the first, word boundary is not necessarily an entity boundary: for example, "Beijing hometown museum" should be the physical location type as a whole, but would be divided into three words, namely, "Beijing", "hometown" and "museum", by the word segmentation tool; second, although Chinese word segmentation has advanced greatly due to the introduction of neural networks, existing models are far from perfect, and the characteristics considered by the models are single, which necessarily results in error propagation.

Collaborative graph network model based on graph attention network when using traditional graph attention network to calculate attention, static attention is obtained due to continuous linear calculation, and the expression capability of graph attention is damaged.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a Chinese naming entity identification method and a Chinese naming entity identification system based on a graph attention network, which aim to solve the technical problems that word boundaries and entity boundaries are inconsistent and model input characteristics are single in the existing BiLSTM-CRF model and the traditional graph attention calculation method in the existing collaborative graph network model based on the graph attention network damages the graph attention expression capability.

To achieve the above object, according to one aspect of the present invention, there is provided a method for identifying chinese named entities based on a graph attention network, comprising the steps of:

(1) And obtaining a Chinese sentence to be identified by the Chinese named entity.

(2) And (3) constructing a word vector set X corresponding to the Chinese sentence based on the Chinese sentence obtained in the step (1).

(3) Inputting the word vector set X corresponding to the Chinese sentence obtained in the step (2) into a trained Chinese named entity recognition model based on a graph attention network to obtain a Chinese named entity label corresponding to the Chinese sentence.

Preferably, step (2) first represents the chinese sentence as a sequence of characters s= { s ₁,s₂,…,s_m }, where s _m represents the mth character in the chinese sentence, where M e [1 ], the total number of characters M in the chinese sentence; then, for each character in the character sequence, the character is represented as a word vector X _m＝f(s_m by looking up a character embedding matrix), and all word vectors form a word vector set X corresponding to the chinese sentence, where f is a character embedding lookup table, and is trained by the continuous word bag model CBOW.

Preferably, the Chinese named entity recognition model based on the graph attention network in the step (3) is obtained through training by the following steps:

And (3-1) acquiring a Chinese named entity recognition data set marked by adopting a BIOES marking scheme, and mapping the text of each Chinese sentence in the Chinese named entity recognition data set into a word vector to obtain a word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set.

And (3-2) inputting the word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1) into a bidirectional long-short-time memory BiLSTM model to obtain a preliminary feature vector of the word vector, and inputting the preliminary feature vector of the word vector into an improved graph and note force network GAT model to obtain a final feature vector corresponding to the Chinese sentence.

And (3-3) inputting the final feature vector corresponding to the Chinese sentence obtained in the step (3-2) into a conditional random field model for decoding to obtain a Chinese named entity label corresponding to the Chinese sentence, calculating a loss function of a Chinese named entity recognition model based on a graph attention network by using a labeling result, and training parameters of a BiLSTM model and a GAT model to obtain a trained Chinese named entity recognition model based on the graph attention network, wherein the trained Chinese named entity recognition model comprises the BiLSTM model in the step (3-2), the GAT model and the conditional random field model in the step (3-3).

Preferably, step (3-1) comprises the steps of:

(3-1-1) obtaining Chinese named entity recognition data sets of a plurality of fields, and labeling the Chinese named entity recognition data sets by using BIOES labeling schemes to obtain labeled Chinese named entity recognition data sets;

(3-1-2) constructing a word vector set X corresponding to each Chinese sentence in the Chinese named entity recognition data set based on the labeled Chinese named entity recognition data set obtained in the step (3-1-1).

Preferably, step (3-2) comprises in particular the following sub-steps:

(3-2-1) initially modeling each word vector in a word vector set X corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1) by using a BiLSTM model to obtain two different forward and backward feature representations, and splicing the two feature representations to obtain a Chinese sentence feature vector corresponding to the word vector containing a context feature, wherein the Chinese sentence feature vector corresponding to the word vector set corresponding to the Chinese sentence forms a Chinese sentence feature vector set h= { H ₁,h₂,…,h_m } corresponding to the Chinese sentence, wherein M e [1 ], and the total number of characters M in the Chinese sentence;

(3-2-2) constructing a word-character interaction diagram G= (V, E) corresponding to each Chinese sentence by utilizing the word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1).

V is a node set, wherein the node set comprises all characters and self-matching words in a word vector set corresponding to a Chinese sentence; e is an edge set, wherein the edge comprises a connection relation between characters in a character vector set, a containing relation between the characters and self-matching words and a connection relation between the self-matching words;

(3-2-3) obtaining a word information fusion correlation coefficient matrix e corresponding to the Chinese sentence according to the Chinese sentence feature vector set H corresponding to the Chinese sentence obtained in the step (3-2-1) and the word-character interaction graph G corresponding to the Chinese sentence constructed in the step (3-2-2).

(3-2-4) Carrying out normalization processing on each element e (h _i,h_j) in the word information fusion correlation coefficient matrix e corresponding to the Chinese sentence obtained in the step (3-2-3) so as to obtain an attention coefficient alpha _ij between every two nodes in the word-character interaction diagram G corresponding to the Chinese sentence;

(3-2-5) obtaining a feature vector K _i of each node in the word-character interaction diagram G corresponding to the Chinese sentence based on the attention coefficient alpha _ij between every two nodes in the word-character interaction diagram G corresponding to the Chinese sentence obtained in the step (3-2-4) by adopting a cardinal-reserved graph attention network calculation method, wherein the feature vectors K _i of all nodes in the word-character interaction diagram G corresponding to the Chinese sentence form a feature vector set K corresponding to the word-character interaction diagram G corresponding to the Chinese sentence;

(3-2-6) carrying out weighted summation on the feature vector set K corresponding to the word-character interaction diagram G corresponding to the Chinese sentence obtained in the step (3-2-5) and the Chinese sentence feature vector set H corresponding to the Chinese sentence obtained in the step (3-2-1) to obtain a final feature vector R=W ₁H+W₂ K corresponding to the Chinese sentence, wherein W ₁ and W ₂ are trainable matrixes.

Preferably, the feature vector h _m of the Chinese sentence corresponding to the mth word vector in the Chinese sentence in step (3-2-1) is given by:

Wherein, The hidden layer output at time t representing the forward LSTM,Hidden layer output at time t, which represents reverse LSTM, h _m representsAndX _m represents the mth word vector in the Chinese sentence;

The element of the j-th column of the i-th row in the word information fusion correlation coefficient matrix e in the step (3-2-3), namely the word information fusion correlation coefficient e (h _i,h_j) between the node i and the node i in the word-character interaction diagram G, is given by the following formula:

e(h_i,h_j)＝a^TLeaky ReLU(Wh_i||Wh_j)

wherein the Leaky ReLU is an activation function, a and W are both learnable parameter matrices, i and j are both E [1 ], the total number of nodes N in the word-character interaction graph G ].

The attention coefficient alpha _ij between the node i and the node j in the graph G in the step (3-2-4) is normalized by adopting a soft max normalization function:

α_ij＝soft max(e(h_i,h_j))

Preferably, the feature vector k _i of the i-th node in the word-character interaction graph G in step (3-2-5) is calculated using the following formula:

where N represents the total number of nodes in the word-character interaction graph G, w is a matrix of learnable parameters, and where, by weight, k _j represents the feature vector of the j-th node in the word-character interaction graph G.

Preferably, step (3-3) comprises in particular the following sub-steps:

(3-3-1) decoding the final feature vector R corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-2) by adopting CRF so as to obtain a labeling result corresponding to the Chinese sentence;

(3-3-2) calculating a loss function of a Chinese named entity recognition model based on the graph attention network according to the labeling result Y of each entity of the Chinese sentence obtained in the step (3-3-1), and iterating the training model to obtain a trained Chinese named entity recognition model based on the graph attention network.

Preferably, for the final feature vector R corresponding to each chinese sentence, the entity labeling result obtained after decoding is y= { Y ₁,y₂,…,y_m }, and the probability P (Y _m|s_m) that the labeling result is Y _m, where Y _m represents the labeling result of the mth character in the chinese sentence;

The training process of the model in step (3-3-2) optimizes the model using L2 regularization to minimize log likelihood loss, the loss function being defined as:

where γ is the regularization parameter of L2, preferably a value of 0.5, and θ is a parameter of all trainable sets.

According to another aspect of the present invention, there is provided a chinese named entity recognition system based on a graph attention network, comprising:

the first module is used for acquiring a Chinese sentence to be identified by a Chinese named entity.

And the second module is used for constructing a word vector set X corresponding to the Chinese sentence based on the Chinese sentence obtained by the first module.

And the third module is used for inputting the word vector set X corresponding to the Chinese sentence obtained by the second module into a trained Chinese named entity recognition model based on the graph attention network so as to obtain a Chinese named entity label corresponding to the Chinese sentence.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

1. the invention adopts the step (3-1) which adopts the marked Chinese naming entity to identify the data set, thereby solving the technical problem that the word boundary and the entity boundary are inconsistent in the existing BiLSTM-CRF model.

2. The invention adopts the step (3-1) and the step (3-2), which combine word segmentation characteristics and character characteristics of Chinese sentences, thereby solving the technical problem of single model input characteristics in the existing BiLSTM-CRF model.

3. The invention adopts the step (3-2) to calculate the graph attention by adopting the graph attention network calculation method with reserved base numbers, so that the problem that the traditional graph attention calculation method impairs the graph attention expression capability in the existing collaborative graph network model based on the graph attention network can be solved.

Drawings

FIG. 1 is a flow chart of a method for identifying Chinese named entities based on a graph attention network according to the present invention;

FIG. 2 is a schematic diagram of the operation of the Chinese named entity recognition model based on the graph attention network of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1, the invention provides a Chinese named entity recognition method based on a graph attention network, which comprises the following steps:

Specifically, the step is to represent the Chinese sentence as a character sequence s= { s ₁,s₂,…,s_m }, where s _m represents the mth character in the Chinese sentence, where m∈ [1 ], the total number of characters M in the Chinese sentence; then, for each character in the character sequence, the character is represented as a Word vector X _m＝f(s_m by searching a character embedding matrix, and all Word vectors form a Word vector set X corresponding to the Chinese sentence, wherein f is a character embedding lookup table, and the character embedding lookup table is trained by a Continuous Bag-of-Word Model (CBOW).

(3) Inputting the word vector set X corresponding to the Chinese sentence obtained in the step (2) into a trained Chinese named entity recognition model (shown in figure 2) based on a graph attention network to obtain a Chinese named entity label corresponding to the Chinese sentence.

In the step (3), the Chinese named entity recognition model based on the graph attention network is obtained through training the following steps:

(3-1) obtaining a Chinese named entity recognition dataset marked by adopting a BIOES (B-begin, I-side, E-end, S-single, BIOES for short) marking scheme, and mapping the text of each Chinese sentence in the Chinese named entity recognition dataset into a word vector so as to obtain a word vector set corresponding to each Chinese sentence in the Chinese named entity recognition dataset.

And (3-2) inputting the word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1) into a Bi-directional Long Short-Term Memory (BiLSTM) model to obtain a preliminary feature vector of the word vector, and inputting the preliminary feature vector of the word vector into an improved graph annotation meaning network (Graph Attention Network, GAT) model to obtain a final feature vector (which contains richer semantic information) corresponding to the Chinese sentence.

Preferably, the series of preprocessing of the named entity recognition data set of the Chinese language recited in the step (3-1) comprises the steps of:

(3-1-1) obtaining Chinese named entity recognition data sets of a plurality of fields, and labeling the Chinese named entity recognition data sets by using BIOES labeling schemes to obtain labeled Chinese named entity recognition data sets.

The chinese named entity recognition dataset includes news, social media, and chinese resume, and its actual types include GPE (geopolitical entity), LOC (location), PER (person), ORG (organization) CONT (country), and EDU (educational background).

In the BIOES notation, the first character in an entity is labeled B-X, where X is the entity type. Similarly, the last character and the inner characters in the entity are labeled E-X and I-X, respectively, S-X indicating that the word itself is an entity X, and the remaining non-entity characters are labeled O.

Specifically, a word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set is constructed by expressing the Chinese sentence as a character sequence s= { s ₁,s₂,…,s_m }, wherein s _m represents the mth character in the Chinese sentence, M is [ 1], and the total number of characters in the Chinese sentence is M ]; then, for each character in the character sequence, the character is represented as a Word vector X _m＝f(s_m by looking up a character embedding matrix, and the Word vectors corresponding to all the characters in the Chinese sentence form a Word vector set X corresponding to the Chinese sentence, wherein f is a character embedding lookup table, and the character embedding lookup table is trained by a Continuous Bag-of-Word Model (CBOW).

Preferably, step (3-2) comprises in particular the following sub-steps:

(3-2-1) for each word vector in the set of word vectors X corresponding to each chinese sentence in the chinese named entity recognition data set obtained in step (3-1), initially modeling the word vector using BiLSTM model to obtain two different forward and backward feature representations, and stitching the two feature representations to obtain a chinese sentence feature vector corresponding to the word vector containing the context feature, where the chinese sentence feature vector corresponding to the set of word vectors corresponding to the chinese sentence constitutes a set of chinese sentence feature vectors h= { H ₁,h₂,…,h_m } corresponding to the chinese sentence, where M e 1, and the total number M of characters in the chinese sentence.

The feature vector h _m of the Chinese sentence corresponding to the mth word vector in the Chinese sentence is given by:

Wherein, The hidden layer output at time t representing the forward LSTM,Hidden layer output at time t, which represents reverse LSTM, h _m representsAndX _m represents the mth word vector in the chinese sentence.

Wherein V is a node set, the node set comprises all characters and self-matching words in a character vector set corresponding to the Chinese sentence (namely, each word segmentation of the Chinese sentence, the word segmentation result of the Chinese sentence can be directly obtained by the Chinese named entity identification data set marked by the BIOES marking scheme); e is an edge set, and the edges comprise connection relations among characters, inclusion relations among the characters and self-matching words and connection relations among the self-matching words in the character vector set.

Specifically, the element of the ith row and jth column in the word information fusion correlation coefficient matrix e, namely the word information fusion correlation coefficient e (h _i,h_j) between the node i and the node i in the word-character interaction diagram G, is given by the following formula:

e(h_i,h_j)＝a^TLeaky ReLU(Wh_i||Wh_j)

The method for normalizing attention coefficient alpha _ij between node i and node j in the graph G by adopting soft max normalization function comprises the following steps:

α_ij＝soft max(e(h_i,h_j))

specifically, the method for calculating the graph meaning network with the reserved base in the step can be specifically described in Improving Attention MECHANISM IN GRAPH Neural Networks VIA CARDINALITY Preservation, page 4 of the document written in Shuo Zhang.

The feature vector k _i of the ith node in the word-character interaction graph G is calculated using the following formula:

(3-2-6) Carrying out weighted summation on the feature vector set K corresponding to the word-character interaction diagram G corresponding to the Chinese sentence obtained in the step (3-2-5) and the Chinese sentence feature vector set H corresponding to the Chinese sentence obtained in the step (3-2-1) to obtain a final feature vector R=W ₁H+W₂ K corresponding to the Chinese sentence, wherein W ₁ and W ₂ are trainable matrixes;

specifically, the final feature vector R is a feature vector of a chinese sentence containing more abundant semantic information, which is an input vector as a conditional random field model.

Preferably, step (3-3) comprises in particular the following sub-steps:

And (3-3-1) decoding the final feature vector R corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-2) by adopting a conditional random field model (Conditional random field, abbreviated as CRF) so as to obtain a labeling result corresponding to the Chinese sentence.

For the final feature vector R corresponding to each chinese sentence, the entity labeling result of the decoded chinese sentence is y= { Y ₁,y₂,…,y_m }, and the probability P (Y _m|s_m) that the labeling result is Y _m, where Y _m represents the labeling result of the mth character in the chinese sentence, s _m represents the mth character in the chinese sentence, M e 1, and the total number M of characters in the chinese sentence.

The training process of the model optimizes the model by adopting L2 regularization to minimize log likelihood loss, and a loss function is defined as:

Where γ is the regularization parameter of L2, preferably with a value of 0.5, θ is the parameter of all trainable sets, corresponding to all trainable parameters and matrices mentioned in the above process.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The Chinese named entity recognition method based on the graph attention network is characterized by comprising the following steps of:

(1) Acquiring a Chinese sentence to be identified by a Chinese named entity;

(2) Constructing a word vector set X corresponding to the Chinese sentence based on the Chinese sentence obtained in the step (1);

(3) Inputting the word vector set X corresponding to the Chinese sentence obtained in the step (2) into a trained Chinese named entity recognition model based on a graph attention network to obtain a Chinese named entity label corresponding to the Chinese sentence; the Chinese named entity recognition model based on the graph attention network in the step (3) is obtained through training the following steps:

(3-1) acquiring a Chinese named entity recognition data set marked by adopting a BIOES marking scheme, and mapping the text of each Chinese sentence in the Chinese named entity recognition data set into a word vector so as to obtain a word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set; step (3-1) comprises the steps of:

(3-1-2) constructing a word vector set X corresponding to each Chinese sentence in the Chinese named entity recognition data set based on the labeled Chinese named entity recognition data set obtained in the step (3-1-1);

(3-2) inputting the word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1) into a bidirectional long-short-term memory BiLSTM model to obtain a preliminary feature vector of the word vector, and inputting the preliminary feature vector of the word vector into an improved graph-meaning network GAT model to obtain a final feature vector corresponding to the Chinese sentence; the step (3-2) specifically comprises the following substeps:

(3-2-2) constructing a word-character interaction diagram G= (V, E) corresponding to each Chinese sentence by utilizing the word vector set corresponding to each Chinese sentence in the Chinese named entity recognition data set obtained in the step (3-1);

(3-2-3) obtaining a word information fusion correlation coefficient matrix e corresponding to the Chinese sentence according to the Chinese sentence feature vector set H corresponding to the Chinese sentence obtained in the step (3-2-1) and the word-character interaction graph G corresponding to the Chinese sentence constructed in the step (3-2-2);

2. The method of claim 1, wherein step (2) first represents a chinese sentence as a sequence of characters s= { s ₁,s₂,…,s_m }, where s _m represents an mth character in the chinese sentence, where M e [1 ], the total number of characters M in the chinese sentence; then, for each character in the character sequence, the character is represented as a word vector X _m＝f(s_m by looking up a character embedding matrix), and all word vectors form a word vector set X corresponding to the chinese sentence, where f is a character embedding lookup table, and is trained by the continuous word bag model CBOW.

3. The method for identifying Chinese named entities based on graph attention network of claim 2, wherein,

The feature vector h _m of the Chinese sentence corresponding to the mth word vector in the Chinese sentence in the step (3-2-1) is given by the following formula:

e(h_i,h_j)＝a^TLeaky ReLU(Wh_i||Wh_j)

Wherein, the Leaky ReLU is an activation function, a and W are both learnable parameter matrixes, i and j are both epsilon [1 ], and the total number of nodes N in the word-character interaction diagram G is equal to N;

α_ij＝soft max(e(h_i,h_j))。

4. a graph-attention-network-based chinese named entity recognition method of claim 3 wherein the feature vector k _i of the i-th node in the word-character interaction graph G in step (3-2-5) is calculated using the following formula:

5. The method for identifying chinese named entities based on graph attention network of claim 4 wherein step (3-3) comprises the sub-steps of:

6. The method for identifying Chinese named entities based on graph attention network of claim 5, wherein,

For the final feature vector R corresponding to each Chinese sentence, obtaining an entity labeling result of the Chinese sentence as Y= { Y ₁,y₂,…,y_m }, and a probability P (Y _m∣s_m) of labeling the Chinese sentence as Y _m after decoding, wherein Y _m represents the labeling result of the mth character in the Chinese sentence;

7. A graph attention network-based chinese named entity recognition system, comprising:

The first module is used for acquiring a Chinese sentence to be identified by a Chinese named entity;

the second module is used for constructing a word vector set X corresponding to the Chinese sentence based on the Chinese sentence obtained by the first module;

The third module is used for inputting the word vector set X corresponding to the Chinese sentence obtained by the second module into a trained Chinese named entity recognition model based on the graph attention network so as to obtain a Chinese named entity label corresponding to the Chinese sentence; the Chinese named entity recognition model based on the graph attention network in the third module is obtained through training the following steps: