[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114004233A - Remote supervision named entity recognition method based on semi-training and sentence selection - Google Patents

Remote supervision named entity recognition method based on semi-training and sentence selection Download PDF

Info

Publication number
CN114004233A
CN114004233A CN202111644281.0A CN202111644281A CN114004233A CN 114004233 A CN114004233 A CN 114004233A CN 202111644281 A CN202111644281 A CN 202111644281A CN 114004233 A CN114004233 A CN 114004233A
Authority
CN
China
Prior art keywords
training
sentence
data set
semi
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111644281.0A
Other languages
Chinese (zh)
Other versions
CN114004233B (en
Inventor
李劲松
辛然
田雨
周天舒
阮彤
王凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111644281.0A priority Critical patent/CN114004233B/en
Publication of CN114004233A publication Critical patent/CN114004233A/en
Application granted granted Critical
Publication of CN114004233B publication Critical patent/CN114004233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a remote supervision named entity recognition method based on semi-training and sentence selection, which comprises the steps of firstly training a balance F score of a bidirectional cyclic neural network and conditional random field mixed model on an artificial labeling data set to a preset semi-training interval through a semi-training strategy; then, FNN is used as a strategy network in reinforcement learning to select sentences in the remote supervision data set; selecting sentences with confidence degrees larger than a threshold value by adopting soft probability; then, merging the screened sentences and the manual labeling data set to serve as a new training set; and finally, training the two-way circulation neural network and conditional random field hybrid model by using a new training set, and updating the strategy network. The method can effectively improve the performance of the named entity recognition model based on remote supervision.

Description

Remote supervision named entity recognition method based on semi-training and sentence selection
Technical Field
The invention belongs to the technical field of natural language processing, particularly relates to the technical field of named entity recognition, and more particularly relates to a remote supervision named entity recognition method based on semi-training and sentence selection.
Background
Named entity recognition is a fundamental task in the field of natural language processing, which aims at locating named entities from plain text while classifying them in predefined entity classes. It is a subtask of information extraction, and the information extraction has a series of important downstream applications, such as: question-answering system, search engine and knowledge map, etc. Traditional named entity recognition methods such as conditional random fields and hidden markov models require a large number of manually designed features. In recent years, with the development of deep neural networks, manually designed features have become unnecessary. One standard deep learning named entity recognition method uses a bidirectional recurrent neural network (BiRNN) as a feature extractor, while using a Conditional Random Field (CRF) as a decoder.
Although no manually designed features are required, most deep learning models require a large amount of annotation data to train. However, manual high-quality annotation data is often difficult to obtain in large quantities. Especially in some specific fields, only experienced domain experts can label the plain text correctly. However, small amounts of high quality annotation data are relatively easily acquired.
The remote supervision method can conveniently generate a large amount of label data from plain text by utilizing a dictionary or a knowledge base. The remote monitoring method is widely applied to the relation extraction task and achieves good effect. In the remote supervised named entity recognition task, data annotation is typically performed using entities in the dictionary to match corresponding fields in plain text. Due to the limitation of the capacity of the dictionary, the data set generated by the remote supervision method contains a large amount of false negative data. Meanwhile, entity tagging using string matching techniques introduces false positive data. Such incorrectly labeled data can severely impact the performance of the named entity recognition model.
In view of the above, it is desirable to design a new method for identifying a remote supervised named entity to solve the above problems.
Disclosure of Invention
In view of the above, the invention discloses a remote supervision named entity recognition method based on semi-training and sentence selection. Firstly, a reinforcement learning strategy and a high-confidence selection algorithm are combined to solve the problem of noise labeling in remote supervision named entity identification. And secondly, a semi-training strategy is proposed to solve the cold start problem in a reinforcement learning strategy and a high-confidence selection algorithm.
The purpose of the invention is realized by the following technical scheme: a remote supervision named entity recognition method based on semi-training and sentence selection comprises the following steps:
s1, manually labeling a small amount of plain text to form an manually labeled data set
Figure 434917DEST_PATH_IMAGE001
Tagging data sets by hand
Figure 319696DEST_PATH_IMAGE001
The entity field in the dictionary to construct a dictionary;
s2, labeling in plain text by utilizing dictionary and character string matching technology, and generating a remote supervision data set
Figure 383467DEST_PATH_IMAGE002
S3, labeling the data set by manual work through a semi-training strategy
Figure 113526DEST_PATH_IMAGE001
Training a two-way cyclic neural network and conditional random field hybrid model until the model is artificially labeled with a data set
Figure 782405DEST_PATH_IMAGE001
The balance F score reaches a preset half training interval;
s4, adopting feed Forward Neural Network (FNN) as strategy network for reinforcement learning, and remotely supervising data set
Figure 572506DEST_PATH_IMAGE002
The selected sentence is used as a data set
Figure 857994DEST_PATH_IMAGE003
S5, calculating the soft probability of each sentence by utilizing the output of the bidirectional recurrent neural network and the conditional random field mixed model, and selecting a data set based on the soft probability
Figure 126164DEST_PATH_IMAGE003
The median confidence is greater than the confidence threshold
Figure 180708DEST_PATH_IMAGE004
The selected sentence is compared with the manually labeled data set
Figure 876131DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 648915DEST_PATH_IMAGE005
S6, utilizing the new training set
Figure 720777DEST_PATH_IMAGE006
For bidirectional circulatory nerveTraining a network and conditional random field hybrid model, and updating a strategy network;
s7, taking the trained bidirectional cyclic neural network and conditional random field mixed model as a named entity recognition model, and performing label prediction on the word blocks token in the unmarked plain text data.
Further, the step S1 specifically includes:
s11, labeling a small amount of plain texts in a sequence labeling mode to generate an artificial labeling data set
Figure 629827DEST_PATH_IMAGE001
S12, extracting the artificial annotation data set
Figure 496152DEST_PATH_IMAGE001
All the entity fields in the database are subjected to duplicate removal processing simultaneously;
and S13, storing all non-repeated entity fields in the dictionary.
Further, the step S2 specifically includes:
s21, matching corresponding fields in the plain text by using a character string matching technology based on the dictionary;
s22, labeling the entity fields on the matching in a sequence labeling mode to generate a remote supervision data set
Figure 490652DEST_PATH_IMAGE002
Further, in step S2, the string matching technique employs a bidirectional maximum matching algorithm, a forward maximum matching algorithm, or a reverse maximum matching algorithm.
Further, the step S3 specifically includes:
s31, marking data set by using manual work
Figure 100625DEST_PATH_IMAGE001
Training a two-way cyclic neural network and conditional random field hybrid model from an initial state until the model is manually labeled in a data set
Figure 129761DEST_PATH_IMAGE001
Stopping training when the balance F score reaches a preset half training interval;
s32, providing reward values for the strategy network of the reinforcement learning by taking the semi-trained bidirectional cyclic neural network and the conditional random field mixed model as initial models
Figure DEST_PATH_IMAGE007
And environmental state
Figure 432567DEST_PATH_IMAGE008
Further, in the step S3, the preset half training interval is 0.85 to 0.95.
Further, the step S4 specifically includes:
s41, adopting FNN as a strategy network for reinforcement learning, wherein the strategy network is a remote supervision data set
Figure 179943DEST_PATH_IMAGE002
Generates an action for each sentence
Figure 328027DEST_PATH_IMAGE009
S42 for the first
Figure 211670DEST_PATH_IMAGE010
The sentence, the policy network formally expressed as:
Figure 685376DEST_PATH_IMAGE011
wherein
Figure 920049DEST_PATH_IMAGE012
Is a function of the sigmoid and is,
Figure 606245DEST_PATH_IMAGE013
is a parameter of the policy network;
Figure 609973DEST_PATH_IMAGE014
is a policy network pair
Figure 254581DEST_PATH_IMAGE010
An action of sentence generation;
Figure 445391DEST_PATH_IMAGE015
is the first
Figure 935278DEST_PATH_IMAGE010
The state of each sentence, namely the environmental state of reinforcement learning;
Figure 793513DEST_PATH_IMAGE016
is a parameter of
Figure 620741DEST_PATH_IMAGE013
For a state of
Figure 298847DEST_PATH_IMAGE015
Make an action in a sentence
Figure 592425DEST_PATH_IMAGE017
The probability of (d);
s43, using the selected sentence as data set
Figure 39586DEST_PATH_IMAGE003
Further, in step S42, the action space comprises two actions, namely selecting a sentence and discarding the sentence, when the action space is selected
Figure 760418DEST_PATH_IMAGE018
When the sentence is selected, when
Figure 191399DEST_PATH_IMAGE019
Discarding the sentence;
Figure 23089DEST_PATH_IMAGE015
from the first
Figure 324757DEST_PATH_IMAGE010
The sentence is spliced by the output of the bidirectional cyclic neural network layer and the output of the hidden layer in the conditional random field layer after passing through the bidirectional cyclic neural network and conditional random field mixed model.
Further, the step S5 specifically includes:
s51, obtaining the first hidden layer of the conditional random field layer of the two-way circulation neural network and the conditional random field hybrid model
Figure 216490DEST_PATH_IMAGE020
A sentence is first
Figure 134767DEST_PATH_IMAGE021
The individual block token belongs to
Figure 770148DEST_PATH_IMAGE022
Probability value corresponding to class time
Figure 191902DEST_PATH_IMAGE023
Wherein
Figure 520115DEST_PATH_IMAGE024
Is the first
Figure 394530DEST_PATH_IMAGE020
A vector representation of the sentence;
s52, calculating
Figure 833602DEST_PATH_IMAGE020
A sentence is first
Figure 375442DEST_PATH_IMAGE021
Soft probability of individual token
Figure 608977DEST_PATH_IMAGE025
Figure 236267DEST_PATH_IMAGE026
Wherein
Figure 947871DEST_PATH_IMAGE027
Figure 344218DEST_PATH_IMAGE028
Is a data set
Figure 279813DEST_PATH_IMAGE003
The total number of sentences in (1) is,
Figure 128820DEST_PATH_IMAGE029
is the total number of tokens in each sentence,
Figure 644115DEST_PATH_IMAGE030
is the total number of label categories;
s53, calculating confidence coefficient of each token according to the soft probability
Figure 160547DEST_PATH_IMAGE031
S54, calculating the confidence of each sentence according to the barrel principle
Figure 1464DEST_PATH_IMAGE032
Figure 337767DEST_PATH_IMAGE033
S55, if it is
Figure 922333DEST_PATH_IMAGE020
The confidence of each sentence is greater than the confidence threshold
Figure 27692DEST_PATH_IMAGE034
Then it is compared with the manually labeled data set
Figure 773931DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 597530DEST_PATH_IMAGE035
Wherein
Figure 985786DEST_PATH_IMAGE034
Is an interval of [0,1 ]]Is constant.
Further, in step S6, the policy network updating step includes:
s61 reward value for policy network
Figure 945652DEST_PATH_IMAGE036
Figure 128372DEST_PATH_IMAGE036
Formally expressed as:
Figure 704847DEST_PATH_IMAGE037
wherein
Figure 365635DEST_PATH_IMAGE038
Is from a new training set
Figure 180007DEST_PATH_IMAGE006
The set of sentences of one batch taken out,
Figure 799208DEST_PATH_IMAGE039
is a set
Figure 331820DEST_PATH_IMAGE038
The total number of sentences of (a),
Figure 530720DEST_PATH_IMAGE040
is a two-way cyclic neural network and conditional random field hybrid model based on
Figure 730757DEST_PATH_IMAGE020
Vector representation of a sentence
Figure 520859DEST_PATH_IMAGE041
Marking the sentence into
Figure 275188DEST_PATH_IMAGE042
The probability of (d);
s62, strategy network
Figure 808938DEST_PATH_IMAGE043
Updating parameters, wherein the updating mode is formally expressed as:
Figure 863481DEST_PATH_IMAGE044
wherein
Figure 293326DEST_PATH_IMAGE045
Is the learning rate.
The invention has the beneficial effects that:
1. the invention provides a completely different sentence selection method aiming at a naming recognition task of remote supervision, firstly, a strategy network of reinforcement learning is used for selecting sentences which are remotely supervised, and then the sentences are screened based on soft probability, thereby improving the quality of sentence selection. By using the data set selected by the sentence selection strategy, the prediction performance of the trained named entity recognition model is improved to a certain extent.
2. The semi-training strategy provided by the invention effectively solves the problems caused by cold start in a reinforcement learning strategy network and a soft probability screening algorithm, and improves the prediction performance of the finally trained named entity recognition model.
Drawings
The various aspects of the present invention will become more apparent to the reader after reading the detailed description of the invention with reference to the accompanying drawings, which are some examples of the invention and from which others may be derived by those of ordinary skill in the art without inventive faculty.
Fig. 1 is a schematic flowchart of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention.
Fig. 2 is a schematic block diagram of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention.
Fig. 3 is a block diagram of a remote supervised named entity recognition device based on semi-training and sentence selection in accordance with the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a schematic flowchart of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention. Fig. 2 is a schematic block diagram of a remote supervised named entity recognition method based on semi-training and sentence selection according to an embodiment of the present invention. The embodiment of the invention provides a remote supervision named entity recognition method based on semi-training and sentence selection, which specifically comprises the following steps:
s1, manually labeling a small amount of plain text to form an manually labeled data set
Figure 800530DEST_PATH_IMAGE001
Tagging data sets by hand
Figure 872392DEST_PATH_IMAGE001
The entity field in the dictionary to construct a dictionary; this step can be achieved by the following substeps:
s11, labeling a small amount of plain texts in a sequence labeling mode to generate an artificial labeling data set
Figure 515863DEST_PATH_IMAGE001
S12, extracting the artificial annotation data set
Figure 647767DEST_PATH_IMAGE001
All entity fields in (1), proceeding simultaneouslyCarrying out duplicate removal treatment;
and S13, storing all non-repeated entity fields in the dictionary.
S2, labeling in plain text by utilizing dictionary and character string matching technology, and generating a remote supervision data set
Figure 376688DEST_PATH_IMAGE002
(ii) a This step can be achieved by the following substeps:
s21, matching corresponding fields in the plain text by using a character string matching technology based on the dictionary;
specifically, the character string matching technology may use a bidirectional maximum matching algorithm, or may use matching algorithms such as forward maximum matching and reverse maximum matching;
s22, labeling the entity fields on the matching in a sequence labeling mode to generate a remote supervision data set
Figure 986661DEST_PATH_IMAGE002
Sequence labels include BIO, BIOES, etc.;
s3, labeling the data set by manual work through a semi-training strategy
Figure 15797DEST_PATH_IMAGE001
Training a bidirectional cyclic neural network and conditional random field mixed model (marked as BiRNN + CRF model) until the model is manually marked on a data set
Figure 787444DEST_PATH_IMAGE001
The above balance F score (also called F1 value) reaches the preset half training interval; this step can be achieved by the following substeps:
s31, marking data set by using manual work
Figure 269241DEST_PATH_IMAGE001
Training a two-way cyclic neural network and conditional random field hybrid model from an initial state until the model is manually labeled in a data set
Figure 682905DEST_PATH_IMAGE001
Stopping training when the balance F score reaches a preset half training interval; the preset half training interval is preferably 0.85-0.95;
s32, providing reward values for the strategy network of the reinforcement learning by taking the semi-trained bidirectional cyclic neural network and the conditional random field mixed model as initial models
Figure 300968DEST_PATH_IMAGE036
And environmental state
Figure 509095DEST_PATH_IMAGE046
S4, adopting feed Forward Neural Network (FNN) as strategy network for reinforcement learning, and remotely supervising data set
Figure 212609DEST_PATH_IMAGE002
The selected sentence is used as a data set
Figure 429964DEST_PATH_IMAGE003
(ii) a This step can be achieved by the following substeps:
s41, adopting FNN as a strategy network for reinforcement learning, wherein the strategy network is a remote supervision data set
Figure 902533DEST_PATH_IMAGE002
Generates an action for each sentence
Figure 281562DEST_PATH_IMAGE009
S42 for the first
Figure 472372DEST_PATH_IMAGE010
The sentence, the policy network formally expressed as:
Figure 227839DEST_PATH_IMAGE047
wherein
Figure 820494DEST_PATH_IMAGE012
Is a sigmoid function;
Figure 370424DEST_PATH_IMAGE013
is a parameter of the policy network;
Figure 782951DEST_PATH_IMAGE014
is a policy network pair
Figure 342108DEST_PATH_IMAGE010
An action generated from a sentence, the action space consisting of two actions, the selection of the sentence and the rejection of the sentence, is typically used in a particular application
Figure 789270DEST_PATH_IMAGE048
When is coming into contact with
Figure 510101DEST_PATH_IMAGE049
When the sentence is selected, when
Figure 941083DEST_PATH_IMAGE050
Discarding the sentence;
Figure 772772DEST_PATH_IMAGE015
is the first
Figure 74441DEST_PATH_IMAGE010
The state of each sentence, i.e. the environmental state of reinforcement learning,
Figure 231753DEST_PATH_IMAGE015
from the first
Figure 618872DEST_PATH_IMAGE010
After the sentences pass through a two-way cyclic neural network and conditional random field mixed model, the output of a two-way cyclic neural network layer is spliced with the output of a hidden layer in a conditional random field layer;
Figure 254252DEST_PATH_IMAGE051
is a parameter of
Figure 941585DEST_PATH_IMAGE013
For a state of
Figure 4219DEST_PATH_IMAGE015
Make an action in a sentence
Figure 144214DEST_PATH_IMAGE014
The probability of (d);
s43, using the selected sentence as data set
Figure 317706DEST_PATH_IMAGE003
S5, calculating the soft probability of each sentence by utilizing the output of the bidirectional recurrent neural network and the conditional random field mixed model, and selecting a data set based on the soft probability
Figure 593967DEST_PATH_IMAGE003
The median confidence is greater than the confidence threshold
Figure 93081DEST_PATH_IMAGE034
The selected sentence is compared with the manually labeled data set
Figure 454792DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 431976DEST_PATH_IMAGE005
(ii) a This step can be achieved by the following substeps:
s51, obtaining the first hidden layer of the conditional random field layer of the two-way circulation neural network and the conditional random field hybrid model
Figure 93901DEST_PATH_IMAGE020
A sentence is first
Figure 498338DEST_PATH_IMAGE021
Individual character block tokenBelong to the first
Figure 81766DEST_PATH_IMAGE022
Probability value corresponding to class time
Figure 128219DEST_PATH_IMAGE023
Wherein
Figure 644651DEST_PATH_IMAGE024
Is the first
Figure 485568DEST_PATH_IMAGE020
A vector representation of the sentence;
s52, calculating
Figure 556292DEST_PATH_IMAGE020
A sentence is first
Figure 886997DEST_PATH_IMAGE021
Soft probability of individual token
Figure 257935DEST_PATH_IMAGE025
Figure 4175DEST_PATH_IMAGE026
Wherein
Figure 93353DEST_PATH_IMAGE027
Figure 950451DEST_PATH_IMAGE028
Is a data set
Figure 441475DEST_PATH_IMAGE003
The total number of sentences in (1) is,
Figure 358616DEST_PATH_IMAGE029
is the total number of tokens in each sentence, is usually set to a constant value according to the principle of "more truncations and less zero padding",
Figure 669511DEST_PATH_IMAGE030
is the total number of label categories;
s53, calculating confidence coefficient of each token according to the soft probability
Figure 330300DEST_PATH_IMAGE052
S54, calculating the confidence of each sentence according to the barrel principle
Figure 675830DEST_PATH_IMAGE032
Figure 763872DEST_PATH_IMAGE033
S55, if it is
Figure 562064DEST_PATH_IMAGE020
The confidence of each sentence is greater than the confidence threshold
Figure 26543DEST_PATH_IMAGE034
Then it is compared with the manually labeled data set
Figure 961001DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 485523DEST_PATH_IMAGE035
Wherein
Figure 771011DEST_PATH_IMAGE034
Is an interval of [0,1 ]]Is constant.
S6, utilizing the new training set
Figure 304761DEST_PATH_IMAGE035
Training a two-way cyclic neural network and a conditional random field hybrid model, and updating a strategy network, wherein the strategy network updating steps are as follows:
s61 reward value for policy network
Figure 93725DEST_PATH_IMAGE036
Figure 789149DEST_PATH_IMAGE036
Formally expressed as:
Figure 561933DEST_PATH_IMAGE037
wherein
Figure 368215DEST_PATH_IMAGE038
Is from a new training set
Figure 11686DEST_PATH_IMAGE035
The set of sentences of one batch taken out,
Figure 878010DEST_PATH_IMAGE039
is a set
Figure 872511DEST_PATH_IMAGE038
The total number of sentences of (a),
Figure 482484DEST_PATH_IMAGE040
is a two-way cyclic neural network and conditional random field hybrid model based on
Figure 246041DEST_PATH_IMAGE020
Vector representation of a sentence
Figure 548846DEST_PATH_IMAGE041
Marking the sentence into
Figure 765064DEST_PATH_IMAGE042
The probability of (d);
s62, strategy network
Figure 913148DEST_PATH_IMAGE043
Updating parameters, the updating mode is formally expressed as
Figure 531212DEST_PATH_IMAGE053
Wherein
Figure 4918DEST_PATH_IMAGE045
Is the learning rate.
S7, taking the trained bidirectional cyclic neural network and conditional random field mixed model as a named entity recognition model, and performing label prediction on the word blocks token in the unmarked plain text data.
The sentence selection, model training and strategy network update process in steps S4-S6 above may be performed multiple times.
Examples
Applicants conducted experiments on open source e-commerce text data sets. The data set contains 2400 manually labeled samples in total, each sample consisting of one sentence. There are 1200 samples in the training set, 400 samples in the validation set, and 800 samples in the testing set. 927 entities were collected from the data set and 2500 samples labeled by remote surveillance methods were obtained. A100-dimensional word embedding vector is used and is trained from 100 million unlabeled sentences by the word2vec method. Adam and RMSprop are used as optimizers of a strategy network and a named entity recognition model respectively in an experiment, the learning rates of the two optimizers are both 0.001, and the maximum iteration number is 500. The results of the experiment are as follows:
Figure 708432DEST_PATH_IMAGE054
in the table, HATS: the invention relates to a remote supervision named entity recognition method based on semi-training and sentence selection; HATS w/o RL: there is no HATS method using a reinforcement learning strategy.
Corresponding to the embodiment of the remote supervision named entity recognition method based on semi-training and sentence selection, the invention also provides an embodiment of a remote supervision named entity recognition device based on semi-training and sentence selection.
Referring to fig. 3, an embodiment of the present invention provides a semi-training and sentence selection based remote supervised named entity recognition apparatus, which includes a memory and one or more processors, where the memory stores executable codes, and the processors execute the executable codes to implement the semi-training and sentence selection based remote supervised named entity recognition method in the foregoing embodiments.
The embodiments of the present invention based on semi-training and sentence selection remote supervised named entity recognition apparatus can be applied to any data processing capable device, such as a computer or other like device or apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 3, a hardware structure diagram of any device with data processing capability where the remote supervision named entity recognition apparatus based on semi-training and sentence selection is located in the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, in an embodiment, any device with data processing capability where the apparatus is located may generally include other hardware according to an actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for identifying the remote supervision named entity based on semi-training and sentence selection in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
Hereinbefore, specific embodiments of the present invention are described with reference to the drawings. However, those skilled in the art will appreciate that various modifications and substitutions can be made to the specific embodiments of the present invention without departing from the spirit and scope of the invention. Such modifications and substitutions are intended to be included within the scope of the present invention as defined by the appended claims.

Claims (10)

1. A remote supervision named entity recognition method based on semi-training and sentence selection is characterized by comprising the following steps:
s1, manually labeling a small amount of plain text to form an manually labeled data set
Figure 512356DEST_PATH_IMAGE001
Tagging data sets by hand
Figure 960655DEST_PATH_IMAGE001
The entity field in (1) constructs a dictionary;
s2, labeling in plain text by utilizing dictionary and character string matching technology, and generating a remote supervision data set
Figure 5971DEST_PATH_IMAGE002
S3, labeling the data set by manual work through a semi-training strategy
Figure 932339DEST_PATH_IMAGE001
Training the two-way cyclic neural network and conditional random field hybrid model until the two-way cyclic neural network and conditional random field hybrid model are manually marked on the data set
Figure 12290DEST_PATH_IMAGE001
The balance F score reaches a preset half training interval;
s4, adopting feedforward neural network FNN as strategy network for reinforcement learning, and remotely supervising data set
Figure 100332DEST_PATH_IMAGE002
The selected sentence is used as a data set
Figure 632945DEST_PATH_IMAGE003
S5, calculating the soft probability of each sentence by utilizing the output of the bidirectional recurrent neural network and the conditional random field mixed model, and selecting a data set based on the soft probability
Figure 363003DEST_PATH_IMAGE003
The median confidence is greater than the confidence threshold
Figure 297461DEST_PATH_IMAGE004
The selected sentence is compared with the manual workAnnotating a data set
Figure 556404DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 841892DEST_PATH_IMAGE005
S6, utilizing the new training set
Figure 844483DEST_PATH_IMAGE005
Training a bidirectional cyclic neural network and a conditional random field hybrid model, and updating a strategy network;
s7, taking the trained bidirectional cyclic neural network and conditional random field mixed model as a named entity recognition model, and performing label prediction on the word blocks token in the unmarked plain text data.
2. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S1 specifically comprises:
s11, labeling a small amount of plain texts in a sequence labeling mode to generate an artificial labeling data set
Figure 633448DEST_PATH_IMAGE001
S12, extracting the artificial annotation data set
Figure 328871DEST_PATH_IMAGE001
Carrying out duplicate removal treatment on all the entity fields in the database;
and S13, storing all non-repeated entity fields in the dictionary.
3. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S2 specifically comprises:
s21, matching corresponding fields in the plain text by using a character string matching technology based on the dictionary;
s22, labeling the entity fields on the matching in a sequence labeling mode to generate a remote supervision data set
Figure 101655DEST_PATH_IMAGE002
4. The semi-training and sentence selection based remote supervised named entity recognition method of claim 1, wherein in the step S2, the string matching technique employs a two-way maximum matching algorithm, a forward maximum matching algorithm, or a reverse maximum matching algorithm.
5. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S3 specifically comprises:
s31, marking data set by using manual work
Figure 642358DEST_PATH_IMAGE001
Training a two-way cyclic neural network and conditional random field hybrid model from an initial state until the two-way cyclic neural network and the conditional random field hybrid model are manually labeled with a data set
Figure 551408DEST_PATH_IMAGE001
Stopping training when the balance F score reaches a preset half training interval;
s32, providing reward values for the strategy network of the reinforcement learning by taking the semi-trained bidirectional cyclic neural network and the conditional random field mixed model as initial models
Figure 417733DEST_PATH_IMAGE006
And environmental state
Figure 412234DEST_PATH_IMAGE007
6. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the preset semi-training interval is 0.85-0.95 in step S3.
7. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S4 specifically comprises:
s41, adopting the feedforward neural network FNN as a strategy network for reinforcement learning, wherein the strategy network is a remote supervision data set
Figure 22207DEST_PATH_IMAGE002
Generates an action for each sentence
Figure 785763DEST_PATH_IMAGE008
S42 for the first
Figure 822989DEST_PATH_IMAGE009
The sentence, policy network is represented as:
Figure 39207DEST_PATH_IMAGE010
wherein
Figure 452871DEST_PATH_IMAGE011
Is a function of the sigmoid and is,
Figure 70934DEST_PATH_IMAGE012
is a parameter of the policy network;
Figure 279061DEST_PATH_IMAGE013
is a policy network pair
Figure 982575DEST_PATH_IMAGE009
An action of sentence generation;
Figure 934351DEST_PATH_IMAGE014
is the first
Figure 672500DEST_PATH_IMAGE009
The status of the individual sentence;
Figure 51528DEST_PATH_IMAGE015
is a parameter of
Figure 507917DEST_PATH_IMAGE012
For a state of
Figure 997805DEST_PATH_IMAGE014
Make an action in a sentence
Figure 324881DEST_PATH_IMAGE013
The probability of (d);
s43, using the selected sentence as data set
Figure 140390DEST_PATH_IMAGE003
8. The method according to claim 7, wherein the action space comprises two actions of selecting sentences and discarding sentences in step S42
Figure 818496DEST_PATH_IMAGE016
When selecting a sentence, when
Figure 377653DEST_PATH_IMAGE017
Discarding sentences in time;
Figure 824815DEST_PATH_IMAGE014
from the first
Figure 280067DEST_PATH_IMAGE009
The sentence is spliced by the output of the bidirectional cyclic neural network layer and the output of the hidden layer in the conditional random field layer after passing through the bidirectional cyclic neural network and conditional random field mixed model.
9. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 1, wherein the step S5 specifically comprises:
s51, obtaining the first hidden layer of the conditional random field layer of the two-way circulation neural network and the conditional random field hybrid model
Figure 445469DEST_PATH_IMAGE018
A sentence is first
Figure 542738DEST_PATH_IMAGE019
The individual block token belongs to
Figure 109986DEST_PATH_IMAGE020
Probability value corresponding to class time
Figure 736139DEST_PATH_IMAGE021
Figure 388838DEST_PATH_IMAGE022
Figure 289798DEST_PATH_IMAGE023
Is the total number of label categories,
Figure 445972DEST_PATH_IMAGE024
is the first
Figure 508606DEST_PATH_IMAGE025
A vector representation of the sentence;
s52, calculating
Figure 383021DEST_PATH_IMAGE025
A sentence is first
Figure 87672DEST_PATH_IMAGE019
Soft probability of individual token
Figure 363933DEST_PATH_IMAGE026
S53, calculating confidence coefficient of each token according to the soft probability
Figure 597468DEST_PATH_IMAGE027
S54, calculating the confidence of each sentence according to the barrel principle
Figure 224758DEST_PATH_IMAGE028
S55, if it is
Figure 201942DEST_PATH_IMAGE025
The confidence of each sentence is greater than the confidence threshold
Figure 332709DEST_PATH_IMAGE029
Then will be
Figure 2725DEST_PATH_IMAGE025
Individual sentences and artificially labeled data sets
Figure 851732DEST_PATH_IMAGE001
Are combined to be used as a new training set
Figure 367027DEST_PATH_IMAGE005
10. The method for identifying a remotely supervised named entity based on semi-training and sentence selection as claimed in claim 7, wherein the step of updating the policy network in step S6 is as follows:
s61 reward value of strategy network
Figure 617880DEST_PATH_IMAGE006
Expressed as:
Figure 458797DEST_PATH_IMAGE030
wherein
Figure 795100DEST_PATH_IMAGE031
Is from a new training set
Figure 114086DEST_PATH_IMAGE005
The set of sentences of one batch taken out,
Figure 219445DEST_PATH_IMAGE032
is a two-way cyclic neural network and conditional random field hybrid model based on
Figure 231264DEST_PATH_IMAGE025
Vector representation of a sentence
Figure 789284DEST_PATH_IMAGE033
To be connected to
Figure 911961DEST_PATH_IMAGE025
Marking individual sentences into
Figure 137406DEST_PATH_IMAGE034
The probability of (d);
s62, strategy network
Figure 54546DEST_PATH_IMAGE035
Updating parameters, wherein the updating mode is represented as:
Figure 365442DEST_PATH_IMAGE036
wherein
Figure 291809DEST_PATH_IMAGE037
Is the learning rate.
CN202111644281.0A 2021-12-30 2021-12-30 Remote supervision named entity recognition method based on semi-training and sentence selection Active CN114004233B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111644281.0A CN114004233B (en) 2021-12-30 2021-12-30 Remote supervision named entity recognition method based on semi-training and sentence selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111644281.0A CN114004233B (en) 2021-12-30 2021-12-30 Remote supervision named entity recognition method based on semi-training and sentence selection

Publications (2)

Publication Number Publication Date
CN114004233A true CN114004233A (en) 2022-02-01
CN114004233B CN114004233B (en) 2022-05-06

Family

ID=79932356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111644281.0A Active CN114004233B (en) 2021-12-30 2021-12-30 Remote supervision named entity recognition method based on semi-training and sentence selection

Country Status (1)

Country Link
CN (1) CN114004233B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018066B (en) * 2022-05-23 2024-04-09 北京计算机技术及应用研究所 Deep neural network localization training method under side-end mode

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160098645A1 (en) * 2014-10-02 2016-04-07 Microsoft Corporation High-precision limited supervision relationship extractor
CN108845988A (en) * 2018-06-07 2018-11-20 苏州大学 A kind of entity recognition method, device, equipment and computer readable storage medium
CN109255119A (en) * 2018-07-18 2019-01-22 五邑大学 A kind of sentence trunk analysis method and system based on the multitask deep neural network for segmenting and naming Entity recognition
CN109635108A (en) * 2018-11-22 2019-04-16 华东师范大学 A kind of remote supervisory entity relation extraction method based on human-computer interaction
CN110826334A (en) * 2019-11-08 2020-02-21 中山大学 Chinese named entity recognition model based on reinforcement learning and training method thereof
CN110991165A (en) * 2019-12-12 2020-04-10 智器云南京信息科技有限公司 Method and device for extracting character relation in text, computer equipment and storage medium
CN111291195A (en) * 2020-01-21 2020-06-16 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium
CN111832294A (en) * 2020-06-24 2020-10-27 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111914558A (en) * 2020-07-31 2020-11-10 湖北工业大学 Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN112348113A (en) * 2020-11-27 2021-02-09 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of offline meta reinforcement learning model
CN113591478A (en) * 2021-06-08 2021-11-02 电子科技大学 Remote supervision text entity relation extraction method based on deep reinforcement learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160098645A1 (en) * 2014-10-02 2016-04-07 Microsoft Corporation High-precision limited supervision relationship extractor
CN108845988A (en) * 2018-06-07 2018-11-20 苏州大学 A kind of entity recognition method, device, equipment and computer readable storage medium
CN109255119A (en) * 2018-07-18 2019-01-22 五邑大学 A kind of sentence trunk analysis method and system based on the multitask deep neural network for segmenting and naming Entity recognition
CN109635108A (en) * 2018-11-22 2019-04-16 华东师范大学 A kind of remote supervisory entity relation extraction method based on human-computer interaction
CN110826334A (en) * 2019-11-08 2020-02-21 中山大学 Chinese named entity recognition model based on reinforcement learning and training method thereof
CN110991165A (en) * 2019-12-12 2020-04-10 智器云南京信息科技有限公司 Method and device for extracting character relation in text, computer equipment and storage medium
CN111291195A (en) * 2020-01-21 2020-06-16 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium
CN111832294A (en) * 2020-06-24 2020-10-27 平安科技(深圳)有限公司 Method and device for selecting marking data, computer equipment and storage medium
CN111914558A (en) * 2020-07-31 2020-11-10 湖北工业大学 Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN112348113A (en) * 2020-11-27 2021-02-09 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of offline meta reinforcement learning model
CN113591478A (en) * 2021-06-08 2021-11-02 电子科技大学 Remote supervision text entity relation extraction method based on deep reinforcement learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ERXIN YU等: "A Two-Level Noise-Tolerant Model for Relation Extraction with Reinforcement Learning", 《2020 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH (ICKG)》 *
刘鑫: "基于弱监督深度学习的中医文本关系抽取研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *
杨穗珠等: "远程监督关系抽取综述", 《计算机学报》 *
白龙等: "基于远程监督的关系抽取研究综述", 《中文信息学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018066B (en) * 2022-05-23 2024-04-09 北京计算机技术及应用研究所 Deep neural network localization training method under side-end mode

Also Published As

Publication number Publication date
CN114004233B (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN110795911B (en) Real-time adding method and device for online text labels and related equipment
Luan et al. Scientific information extraction with semi-supervised neural tagging
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN107526799A (en) A kind of knowledge mapping construction method based on deep learning
CN109697289B (en) Improved active learning method for named entity recognition
CN110377902B (en) Training method and device for descriptive text generation model
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN110532353A (en) Text entities matching process, system, device based on deep learning
CN109461446A (en) Method, device, system and storage medium for identifying user target request
CN108228758A (en) A kind of file classification method and device
CN116662552A (en) Financial text data classification method, device, terminal equipment and medium
CN111414845B (en) Multi-form sentence video positioning method based on space-time diagram inference network
CN114004233B (en) Remote supervision named entity recognition method based on semi-training and sentence selection
CN112699218A (en) Model establishing method and system, paragraph label obtaining method and medium
CN115994224A (en) Phishing URL detection method and system based on pre-training language model
CN116595189A (en) Zero sample relation triplet extraction method and system based on two stages
CN117787283A (en) Small sample fine granularity text named entity classification method based on prototype comparison learning
CN112131879A (en) Relationship extraction system, method and device
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN113722431B (en) Named entity relationship identification method and device, electronic equipment and storage medium
CN116910251A (en) Text classification method, device, equipment and medium based on BERT model
CN112634869B (en) Command word recognition method, device and computer storage medium
CN115080748A (en) Weak supervision text classification method and device based on noisy label learning
CN116629387B (en) Text processing method and processing system for training under missing condition
CN117932073B (en) Weak supervision text classification method and system based on prompt engineering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant