CN114004233A

CN114004233A - Remote supervision named entity recognition method based on semi-training and sentence selection

Info

Publication number: CN114004233A
Application number: CN202111644281.0A
Authority: CN
Inventors: 李劲松; 辛然; 田雨; 周天舒; 阮彤; 王凯
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-02-01
Anticipated expiration: 2041-12-30
Also published as: CN114004233B

Abstract

The invention discloses a remote supervision named entity recognition method based on semi-training and sentence selection, which comprises the steps of firstly training a balance F score of a bidirectional cyclic neural network and conditional random field mixed model on an artificial labeling data set to a preset semi-training interval through a semi-training strategy; then, FNN is used as a strategy network in reinforcement learning to select sentences in the remote supervision data set; selecting sentences with confidence degrees larger than a threshold value by adopting soft probability; then, merging the screened sentences and the manual labeling data set to serve as a new training set; and finally, training the two-way circulation neural network and conditional random field hybrid model by using a new training set, and updating the strategy network. The method can effectively improve the performance of the named entity recognition model based on remote supervision.

Description

Remote supervision named entity recognition method based on semi-training and sentence selection

Technical Field

The invention belongs to the technical field of natural language processing, particularly relates to the technical field of named entity recognition, and more particularly relates to a remote supervision named entity recognition method based on semi-training and sentence selection.

Background

Named entity recognition is a fundamental task in the field of natural language processing, which aims at locating named entities from plain text while classifying them in predefined entity classes. It is a subtask of information extraction, and the information extraction has a series of important downstream applications, such as: question-answering system, search engine and knowledge map, etc. Traditional named entity recognition methods such as conditional random fields and hidden markov models require a large number of manually designed features. In recent years, with the development of deep neural networks, manually designed features have become unnecessary. One standard deep learning named entity recognition method uses a bidirectional recurrent neural network (BiRNN) as a feature extractor, while using a Conditional Random Field (CRF) as a decoder.

Although no manually designed features are required, most deep learning models require a large amount of annotation data to train. However, manual high-quality annotation data is often difficult to obtain in large quantities. Especially in some specific fields, only experienced domain experts can label the plain text correctly. However, small amounts of high quality annotation data are relatively easily acquired.

The remote supervision method can conveniently generate a large amount of label data from plain text by utilizing a dictionary or a knowledge base. The remote monitoring method is widely applied to the relation extraction task and achieves good effect. In the remote supervised named entity recognition task, data annotation is typically performed using entities in the dictionary to match corresponding fields in plain text. Due to the limitation of the capacity of the dictionary, the data set generated by the remote supervision method contains a large amount of false negative data. Meanwhile, entity tagging using string matching techniques introduces false positive data. Such incorrectly labeled data can severely impact the performance of the named entity recognition model.

In view of the above, it is desirable to design a new method for identifying a remote supervised named entity to solve the above problems.

Disclosure of Invention

In view of the above, the invention discloses a remote supervision named entity recognition method based on semi-training and sentence selection. Firstly, a reinforcement learning strategy and a high-confidence selection algorithm are combined to solve the problem of noise labeling in remote supervision named entity identification. And secondly, a semi-training strategy is proposed to solve the cold start problem in a reinforcement learning strategy and a high-confidence selection algorithm.

The purpose of the invention is realized by the following technical scheme: a remote supervision named entity recognition method based on semi-training and sentence selection comprises the following steps:

s1, manually labeling a small amount of plain text to form an manually labeled data set

Tagging data sets by hand

The entity field in the dictionary to construct a dictionary;

s2, labeling in plain text by utilizing dictionary and character string matching technology, and generating a remote supervision data set

；

S3, labeling the data set by manual work through a semi-training strategy

Training a two-way cyclic neural network and conditional random field hybrid model until the model is artificially labeled with a data set

The balance F score reaches a preset half training interval;

s4, adopting feed Forward Neural Network (FNN) as strategy network for reinforcement learning, and remotely supervising data set

The selected sentence is used as a data set

；

S5, calculating the soft probability of each sentence by utilizing the output of the bidirectional recurrent neural network and the conditional random field mixed model, and selecting a data set based on the soft probability

The median confidence is greater than the confidence threshold

The selected sentence is compared with the manually labeled data set

Are combined to be used as a new training set

；

S6, utilizing the new training set

For bidirectional circulatory nerveTraining a network and conditional random field hybrid model, and updating a strategy network;

s7, taking the trained bidirectional cyclic neural network and conditional random field mixed model as a named entity recognition model, and performing label prediction on the word blocks token in the unmarked plain text data.

Further, the step S1 specifically includes:

s11, labeling a small amount of plain texts in a sequence labeling mode to generate an artificial labeling data set

；

S12, extracting the artificial annotation data set

All the entity fields in the database are subjected to duplicate removal processing simultaneously;

and S13, storing all non-repeated entity fields in the dictionary.

Further, the step S2 specifically includes:

s21, matching corresponding fields in the plain text by using a character string matching technology based on the dictionary;

s22, labeling the entity fields on the matching in a sequence labeling mode to generate a remote supervision data set

。

Further, in step S2, the string matching technique employs a bidirectional maximum matching algorithm, a forward maximum matching algorithm, or a reverse maximum matching algorithm.

Further, the step S3 specifically includes:

s31, marking data set by using manual work

Training a two-way cyclic neural network and conditional random field hybrid model from an initial state until the model is manually labeled in a data set

Stopping training when the balance F score reaches a preset half training interval;

s32, providing reward values for the strategy network of the reinforcement learning by taking the semi-trained bidirectional cyclic neural network and the conditional random field mixed model as initial models

And environmental state

。

Further, in the step S3, the preset half training interval is 0.85 to 0.95.

Further, the step S4 specifically includes:

s41, adopting FNN as a strategy network for reinforcement learning, wherein the strategy network is a remote supervision data set

Generates an action for each sentence

；

S42 for the first

The sentence, the policy network formally expressed as:

wherein

Is a function of the sigmoid and is,

is a parameter of the policy network;

is a policy network pair

An action of sentence generation;

is the first

The state of each sentence, namely the environmental state of reinforcement learning;

is a parameter of

For a state of

Make an action in a sentence

The probability of (d);

s43, using the selected sentence as data set

。

Further, in step S42, the action space comprises two actions, namely selecting a sentence and discarding the sentence, when the action space is selected

When the sentence is selected, when

Discarding the sentence;

from the first

The sentence is spliced by the output of the bidirectional cyclic neural network layer and the output of the hidden layer in the conditional random field layer after passing through the bidirectional cyclic neural network and conditional random field mixed model.

Further, the step S5 specifically includes:

s51, obtaining the first hidden layer of the conditional random field layer of the two-way circulation neural network and the conditional random field hybrid model

A sentence is first

The individual block token belongs to

Probability value corresponding to class time

Wherein

Is the first

A vector representation of the sentence;

s52, calculating

A sentence is first

Soft probability of individual token

：

Wherein

，

Is a data set

The total number of sentences in (1) is,

is the total number of tokens in each sentence,

is the total number of label categories;

s53, calculating confidence coefficient of each token according to the soft probability

；

S54, calculating the confidence of each sentence according to the barrel principle

：

S55, if it is

The confidence of each sentence is greater than the confidence threshold

Then it is compared with the manually labeled data set

Are combined to be used as a new training set

Wherein

Is an interval of [0,1 ]]Is constant.

Further, in step S6, the policy network updating step includes:

s61 reward value for policy network

，

Formally expressed as:

wherein

Is from a new training set

The set of sentences of one batch taken out,

is a set

The total number of sentences of (a),

is a two-way cyclic neural network and conditional random field hybrid model based on

Vector representation of a sentence

Marking the sentence into

The probability of (d);

s62, strategy network

Updating parameters, wherein the updating mode is formally expressed as:

wherein

Is the learning rate.

The invention has the beneficial effects that:

1. the invention provides a completely different sentence selection method aiming at a naming recognition task of remote supervision, firstly, a strategy network of reinforcement learning is used for selecting sentences which are remotely supervised, and then the sentences are screened based on soft probability, thereby improving the quality of sentence selection. By using the data set selected by the sentence selection strategy, the prediction performance of the trained named entity recognition model is improved to a certain extent.

2. The semi-training strategy provided by the invention effectively solves the problems caused by cold start in a reinforcement learning strategy network and a soft probability screening algorithm, and improves the prediction performance of the finally trained named entity recognition model.

Drawings

The various aspects of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the accompanying drawings, which are some examples of the invention and from which others may be derived by those of ordinary skill in the art without inventive faculty.

Fig. 1 is a schematic flowchart of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention.

Fig. 2 is a schematic block diagram of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention.

Fig. 3 is a block diagram of a remote supervised named entity recognition device based on semi-training and sentence selection in accordance with the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a schematic flowchart of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention. Fig. 2 is a schematic block diagram of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention. The embodiment of the invention provides a remote supervision named entity recognition method based on semi-training and sentence selection, which specifically comprises the following steps:

Tagging data sets by hand

The entity field in the dictionary to construct a dictionary; this step can be achieved by the following substeps:

；

S12, extracting the artificial annotation data set

All entity fields in (1), proceeding simultaneouslyCarrying out duplicate removal treatment;

and S13, storing all non-repeated entity fields in the dictionary.

(ii) a This step can be achieved by the following substeps:

specifically, the character string matching technology may use a bidirectional maximum matching algorithm, or may use matching algorithms such as forward maximum matching and reverse maximum matching;

Sequence labels include BIO, BIOES, etc.;

s3, labeling the data set by manual work through a semi-training strategy

Training a bidirectional cyclic neural network and conditional random field mixed model (marked as BiRNN + CRF model) until the model is manually marked on a data set

The above balance F score (also called F1 value) reaches the preset half training interval; this step can be achieved by the following substeps:

s31, marking data set by using manual work

Stopping training when the balance F score reaches a preset half training interval; the preset half training interval is preferably 0.85-0.95;

And environmental state

。

The selected sentence is used as a data set

(ii) a This step can be achieved by the following substeps:

Generates an action for each sentence

；

S42 for the first

The sentence, the policy network formally expressed as:

wherein

Is a sigmoid function;

is a parameter of the policy network;

is a policy network pair

An action generated from a sentence, the action space consisting of two actions, the selection of the sentence and the rejection of the sentence, is typically used in a particular application

When is coming into contact with

When the sentence is selected, when

Discarding the sentence;

is the first

The state of each sentence, i.e. the environmental state of reinforcement learning,

from the first

After the sentences pass through a two-way cyclic neural network and conditional random field mixed model, the output of a two-way cyclic neural network layer is spliced with the output of a hidden layer in a conditional random field layer;

is a parameter of

For a state of

Make an action in a sentence

The probability of (d);

s43, using the selected sentence as data set

。

The median confidence is greater than the confidence threshold

The selected sentence is compared with the manually labeled data set

Are combined to be used as a new training set

(ii) a This step can be achieved by the following substeps:

A sentence is first

Individual character block tokenBelong to the first

Probability value corresponding to class time

Wherein

Is the first

A vector representation of the sentence;

s52, calculating

A sentence is first

Soft probability of individual token

：

Wherein

，

Is a data set

The total number of sentences in (1) is,

is the total number of tokens in each sentence, is usually set to a constant value according to the principle of "more truncations and less zero padding",

is the total number of label categories;

；

：

S55, if it is

The confidence of each sentence is greater than the confidence threshold

Then it is compared with the manually labeled data set

Are combined to be used as a new training set

Wherein

Is an interval of [0,1 ]]Is constant.

S6, utilizing the new training set

Training a two-way cyclic neural network and a conditional random field hybrid model, and updating a strategy network, wherein the strategy network updating steps are as follows:

s61 reward value for policy network

，

Formally expressed as:

wherein

Is from a new training set

The set of sentences of one batch taken out,

is a set

The total number of sentences of (a),

Vector representation of a sentence

Marking the sentence into

The probability of (d);

s62, strategy network

Updating parameters, the updating mode is formally expressed as

Wherein

Is the learning rate.

The sentence selection, model training and strategy network update process in steps S4-S6 above may be performed multiple times.

Examples

Applicants conducted experiments on open source e-commerce text data sets. The data set contains 2400 manually labeled samples in total, each sample consisting of one sentence. There are 1200 samples in the training set, 400 samples in the validation set, and 800 samples in the testing set. 927 entities were collected from the data set and 2500 samples labeled by remote surveillance methods were obtained. A100-dimensional word embedding vector is used and is trained from 100 million unlabeled sentences by the word2vec method. Adam and RMSprop are used as optimizers of a strategy network and a named entity recognition model respectively in an experiment, the learning rates of the two optimizers are both 0.001, and the maximum iteration number is 500. The results of the experiment are as follows:

in the table, HATS: the invention relates to a remote supervision named entity recognition method based on semi-training and sentence selection; HATS w/o RL: there is no HATS method using a reinforcement learning strategy.

Corresponding to the embodiment of the remote supervision named entity recognition method based on semi-training and sentence selection, the invention also provides an embodiment of a remote supervision named entity recognition device based on semi-training and sentence selection.

Referring to fig. 3, an embodiment of the present invention provides a semi-training and sentence selection based remote supervised named entity recognition apparatus, which includes a memory and one or more processors, where the memory stores executable codes, and the processors execute the executable codes to implement the semi-training and sentence selection based remote supervised named entity recognition method in the foregoing embodiments.

The embodiments of the present invention based on semi-training and sentence selection remote supervised named entity recognition apparatus can be applied to any data processing capable device, such as a computer or other like device or apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 3, a hardware structure diagram of any device with data processing capability where the remote supervision named entity recognition apparatus based on semi-training and sentence selection is located in the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, in an embodiment, any device with data processing capability where the apparatus is located may generally include other hardware according to an actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for identifying the remote supervision named entity based on semi-training and sentence selection in the embodiment is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

Hereinbefore, specific embodiments of the present invention are described with reference to the drawings. However, those skilled in the art will appreciate that various modifications and substitutions can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. A remote supervision named entity recognition method based on semi-training and sentence selection is characterized by comprising the following steps:

Tagging data sets by hand

The entity field in (1) constructs a dictionary;

；

S3, labeling the data set by manual work through a semi-training strategy

Training the two-way cyclic neural network and conditional random field hybrid model until the two-way cyclic neural network and conditional random field hybrid model are manually marked on the data set

The balance F score reaches a preset half training interval;

s4, adopting feedforward neural network FNN as strategy network for reinforcement learning, and remotely supervising data set

The selected sentence is used as a data set

；

The median confidence is greater than the confidence threshold

The selected sentence is compared with the manual workAnnotating a data set

Are combined to be used as a new training set

；

S6, utilizing the new training set

Training a bidirectional cyclic neural network and a conditional random field hybrid model, and updating a strategy network;

2. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S1 specifically comprises:

；

S12, extracting the artificial annotation data set

Carrying out duplicate removal treatment on all the entity fields in the database;

and S13, storing all non-repeated entity fields in the dictionary.

3. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S2 specifically comprises:

。

4. The semi-training and sentence selection based remote supervised named entity recognition method of claim 1, wherein in the step S2, the string matching technique employs a two-way maximum matching algorithm, a forward maximum matching algorithm, or a reverse maximum matching algorithm.

5. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S3 specifically comprises:

s31, marking data set by using manual work

Training a two-way cyclic neural network and conditional random field hybrid model from an initial state until the two-way cyclic neural network and the conditional random field hybrid model are manually labeled with a data set

And environmental state

。

6. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the preset semi-training interval is 0.85-0.95 in step S3.

7. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S4 specifically comprises:

s41, adopting the feedforward neural network FNN as a strategy network for reinforcement learning, wherein the strategy network is a remote supervision data set

Generates an action for each sentence

；

S42 for the first

The sentence, policy network is represented as:

wherein

Is a function of the sigmoid and is,

is a parameter of the policy network;

is a policy network pair

An action of sentence generation;

is the first

The status of the individual sentence;

is a parameter of

For a state of

Make an action in a sentence

The probability of (d);

s43, using the selected sentence as data set

。

8. The method according to claim 7, wherein the action space comprises two actions of selecting sentences and discarding sentences in step S42

When selecting a sentence, when

Discarding sentences in time;

from the first

9. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S5 specifically comprises:

A sentence is first

The individual block token belongs to

Probability value corresponding to class time

，

，

Is the total number of label categories,

is the first

A vector representation of the sentence;

s52, calculating

A sentence is first

Soft probability of individual token

；

；

；

S55, if it is

The confidence of each sentence is greater than the confidence threshold

Then will be

Individual sentences and artificially labeled data sets

Are combined to be used as a new training set

。

10. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 7, wherein the step of updating the policy network in step S6 is as follows:

s61 reward value of strategy network

Expressed as: