CN112528653B - Short text entity recognition method and system - Google Patents
Short text entity recognition method and system Download PDFInfo
- Publication number
- CN112528653B CN112528653B CN202011398845.2A CN202011398845A CN112528653B CN 112528653 B CN112528653 B CN 112528653B CN 202011398845 A CN202011398845 A CN 202011398845A CN 112528653 B CN112528653 B CN 112528653B
- Authority
- CN
- China
- Prior art keywords
- word vector
- word
- features
- short text
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 239000013598 vector Substances 0.000 claims abstract description 128
- 238000012549 training Methods 0.000 claims abstract description 26
- 230000011218 segmentation Effects 0.000 claims abstract description 23
- 230000008485 antagonism Effects 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims description 16
- 230000009467 reduction Effects 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 14
- 230000008520 organization Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000003042 antagnostic effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 235000001674 Agaricus brunnescens Nutrition 0.000 description 1
- 230000000981 bystander Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present disclosure provides a short text entity identification method, comprising: acquiring a short text and segmenting the short text; word vector training is carried out on the short text subjected to word segmentation to generate a word vector sequence; performing part-of-speech feature learning on each word vector in the word vector sequence based on the antagonism framework to obtain part-of-speech features of the word vector; extracting local context features of each word vector in the word vector sequence and global semantic features among each word vector; and identifying an entity using the lexical feature, the local context feature, and the global semantic feature.
Description
Technical Field
The present disclosure relates generally to entity recognition, and more particularly to entity recognition for free text.
Background
The task of extracting information from text has a very wide range of application scenarios. Named entity recognition is a basic task in natural language processing, and is the basis of a plurality of natural language processing tasks such as relation extraction, event extraction, knowledge graph, information extraction, question-answering system, syntactic analysis, machine translation and the like.
However, the ease of entity recognition varies from text to text. For short text, due to semantic ambiguity and lack of context information, it is difficult for conventional entity recognition models to get accurate results. Furthermore, the application effect of the common model is greatly reduced due to the sparsity of the data in the short text.
Thus, there is a need in the art for an efficient short text entity recognition scheme to enable accurate entity recognition for short text.
Disclosure of Invention
In order to solve the technical problems, the present disclosure provides an efficient short text entity recognition scheme.
In an embodiment of the present disclosure, there is provided a short text entity identification method, including: acquiring a short text and segmenting the short text; word vector training is carried out on the short text subjected to word segmentation to generate a word vector sequence; performing part-of-speech feature learning on each word vector in the word vector sequence based on the antagonism framework to obtain part-of-speech features of the word vector; extracting local context features of each word vector in the word vector sequence and global semantic features among each word vector; and identifying an entity using the lexical feature, the local context feature, and the global semantic feature.
In another embodiment of the present disclosure, identifying an entity using the lexical feature, the local context feature, and the global semantic feature further comprises: performing dimension reduction screening on word vectors in the word vector sequence by using the word characteristics; and identifying an entity from the word vectors obtained after the dimension reduction screening based on the local context features and the global semantic features.
In yet another embodiment of the present disclosure, fuzzy entity matching is performed on the identified entities.
In another embodiment of the present disclosure, the manner in which short text is segmented is selected based on the language and type of the short text.
In yet another embodiment of the present disclosure, identifying entities from individual word vectors using word characteristics, local context characteristics, and global semantic characteristics is implemented on an antagonistic training framework.
In another embodiment of the present disclosure, extracting local context features and global semantic features is implemented through an Attention mechanism.
In one embodiment of the present disclosure, there is provided a short text entity recognition system including: the word segmentation module is used for obtaining short texts and segmenting the short texts; the word vector generation module is used for carrying out word vector training on the short text subjected to word segmentation so as to generate a word vector sequence; the feature acquisition module is used for performing part-of-speech feature learning on each word vector in the word vector sequence based on the antagonism framework to acquire part-of-speech features of the word vectors, and extracting local context features of each word vector in the word vector sequence and global semantic features among each word vector; and an entity recognition module that recognizes entities from the respective word vectors using the word features, the local context features, and the global semantic features;
in another embodiment of the present disclosure, the entity identification module identifying the entity using the lexical feature, the local context feature, and the global semantic feature further comprises: performing dimension reduction screening on word vectors in the word vector sequence by using the word characteristics; and identifying an entity from the word vectors obtained after the dimension reduction screening based on the local context features and the global semantic features.
In yet another embodiment of the present disclosure, the short text entity recognition system further includes a fuzzy entity matching module that performs fuzzy entity matching on the recognized entities.
In another embodiment of the present disclosure, the manner in which the word segmentation module segments the short text is selected based on the language and type of the short text.
In yet another embodiment of the present disclosure, the entity recognition module uses the word characteristics, the local context characteristics, and the global semantic characteristics to recognize the entity from the respective word vectors on the resistance training framework.
In another embodiment of the present disclosure, the feature acquisition module extracts local context features and global semantic features by an Attention mechanism.
In one embodiment of the present disclosure, a computer-readable storage medium is provided having stored thereon instructions that, when executed, cause a machine to perform a method as previously described.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
The foregoing summary of the disclosure and the following detailed description will be better understood when read in conjunction with the accompanying drawings. It is to be noted that the drawings are merely examples of the claimed invention. In the drawings, like reference numbers indicate identical or similar elements.
FIG. 1 is a flow chart illustrating a short text entity identification method according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a short text entity recognition framework based on countermeasure learning in accordance with an embodiment of the present disclosure;
FIG. 3 is a diagram illustrating an implementation of a short text entity recognition framework based on countermeasure learning in accordance with an embodiment of the present disclosure;
FIG. 4 is a diagram illustrating a process by which transaction cover text is processed by a short text entity recognition framework in accordance with an embodiment of the present disclosure;
fig. 5 is a block diagram illustrating a short text entity identification system according to an embodiment of the present disclosure.
Fig. 6 is a block diagram illustrating a short text entity recognition system applied to fuzzy entity matching in accordance with an embodiment of the present disclosure.
Detailed Description
In order to make the above objects, features and advantages of the present disclosure more comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein, and thus the present disclosure is not limited to the specific embodiments disclosed below.
Short text in this context refers to text of short text size, including, but not limited to, search results from a search engine, anchor text, internet chat information, email topics, forum comment information, merchandise description information, picture descriptions, microblogs, cell phone short messages, document literature summaries, transaction appendages, and the like.
The short text has the characteristics of feature sparsity, real-time performance, dynamic performance, irregularity and the like. The length of each short text form information is shorter and is within 200 words, so that the contained effective information is very little, the characteristics of the sample are very sparse, the dimension of the characteristic set is very high, and accurate and key sample characteristics are difficult to extract from the characteristics for classification learning.
Information in the form of short text that appears on a network or APP is mostly updated in real time, very fast in refresh rate (e.g., chat information, microblog information, comment information, trade notes, etc.), and very large in the amount of text.
Short text forms of information are irregular, contain more popular words, cause quite many noise characteristics, such as "94" for "just", "88" for "bystanders", "shoes for children" for "classmates", and are updated quickly, such as the popular word "blue lean mushrooms" and the like.
Information extraction structured data and specific relationships are extracted from unstructured text. The recognition text mainly comprises a person name, a place name, an organization name, a proper noun and the like, and is called named entity recognition (NER, which is simply called entity recognition in the industry). Briefly, it is the boundary and category of entities in natural text that are identified.
The entity identification has a mature scheme in the industry, and the conventional scheme has a good effect on a long text with a normal semantic structure, and specifically comprises the following steps: early rule-based or dictionary-based schemes; traditional machine learning schemes (e.g., HMM, MEMM, CRF, etc.); traditional machine learning combined with deep learning schemes (e.g., RNN-CRF, CNN-CRF, LSTM-CRF, etc.); recent schemes (e.g., attention mechanisms, transfer learning, far supervised learning, and combined graph algorithms, etc.).
However, for short texts, the effect of the prior art solutions is not ideal due to their feature sparsity, real-time, dynamics, irregularities, especially in the case of very short texts, such as on average only about 20 words of trade notes. That is, for short text with ambiguities in semantics, and even lack of semantics, it is difficult for conventional entity recognition models to get accurate results. Moreover, due to the data sparsity of short text, a large number of new words or very used words result in long tails, and the effect of conventional models is much reduced.
Thus, there is a need in the art for schemes that enable accurate entity recognition for short text that is semantically ambiguous.
Short text entity recognition methods and systems according to various embodiments of the present disclosure will be described in detail below based on the accompanying drawings.
Fig. 1 is a flow chart illustrating a short text entity identification method 100 according to an embodiment of the present disclosure.
At 102, a short text is obtained and segmented.
The word segmentation is to decompose texts such as sentences, paragraphs, articles and the like into data structures taking words as units, so that subsequent processing and analysis work is facilitated. Essentially converting unstructured text into structured data.
The word segmentation method is different for different languages of short text. English has natural space as separator, and English word has rich deformation transformation, so English word segmentation has unique processing steps compared with Chinese, such as morphological reduction and stem extraction; while chinese has no space as a separator and faces chinese to the need to choose different word granularity according to different scenes and requirements. In general, the larger the granularity of the word segmentation, the more accurate the meaning is expressed, but also results in fewer recalls. And will not be described in detail herein.
Thus, after the short text is obtained, the short text may be segmented in a suitable manner. Specifically, the word segmentation method may be selected according to the language employed by the short text and the type of the short text.
In one embodiment, the short text is a transaction statement. English trade conventions will be taken as examples in the presently disclosed embodiments, such as "Cong Zhou A34059300LDA", "00247886Tuition Fee Name linkun Yue Student Number1336623New York State", "0002 01646674Han Xiao PA150347", "000201725428cheng YAN and Jie Zou and Fei HOU", "My Life 'gifts for Valentine's Day", "thanks from zhanglei". The structure is generally 'document number + person name', 'phrase + person name', 'short sentence', 'long sentence', etc., and the existing model is directly used, so that a lot of noise is generated.
Those skilled in the art will appreciate that the short text to which the disclosed techniques are applicable may be other types of short text in other languages (e.g., chinese, japanese, korean, french, german, spanish, etc.).
Thus, in one embodiment, short text may be pre-processed first, including: special character and continuous number substitution; conjunctive segmentation processes, e.g., zhanglei- > zhanglei. Then, rule analysis can be adopted, namely sentence segmentation processing is carried out, and sentences are segmented into different sections according to information such as bill numbers, certificate numbers, numbers and the like in actual transactions, and semantics among the different sections are independent; the semantics in the same subsection are complete, e.g., 0002 01646674Han Xiao PA150347- > Han Xiao.
Word vector training is performed on the segmented short text to generate a word vector sequence at 104.
In one embodiment of the present disclosure, word-embedding (word-embedding) techniques are employed to perform word vector training on short segmented text to generate a word vector sequence. The generated word vector sequence contains semantic information and can measure semantic similarity between words.
At 106, part-of-speech feature learning is performed on each word vector in the sequence of word vectors based on the antagonism framework to obtain part-of-speech features for the word vector.
The word vector sequence generated at 104 by the word embedding technique generally includes frequency information of words, but for short text, often low frequency words are numerous, so that the dimension of the word vector sequence is overlarge, namely, a data sparseness problem is generated.
Thus, to alleviate the data sparseness problem, part-of-speech token learning is introduced to obtain part-of-speech features of word vectors. In one embodiment, a discriminator is introduced to obtain part-of-speech characteristics for each word vector in the sequence of word vectors. Those skilled in the art will appreciate that various part-of-speech tagging techniques may be employed to achieve the acquisition of part-of-speech features.
For example, words may be divided into nouns, adjectives, verbs, adverbs, prepositions, and the like; the nouns may be further divided into personal names, organizational names, place names, etc. for facilitating subsequent entity recognition. As part-of-speech features, for example, a person name may be denoted nr, an organization name may be denoted nt, and a place name may be denoted ns.
At 108, local contextual features of each word vector in the sequence of word vectors and global semantic features between each word vector are extracted.
The word vector sequence generated at 104 by word embedding techniques may also be subject to structural ambiguity, including phrase type ambiguity and phrase boundary ambiguity. In addition, part-of-speech features acquired at 106 by part-of-speech tagging techniques may also be part-of-speech ambiguities. The resolution of structural and part-of-speech ambiguities may be made by introducing context constraints.
Thus, to introduce context constraints, it is necessary to extract local context features for each word vector in the sequence of word vectors and global semantic features between each word vector.
In this disclosure, attention (Attention) mechanisms are introduced to extract local context features and global semantic features of word vectors. The grabbing or extracting of long-distance dependency relationship is enhanced through the attention mechanism. In one embodiment of the present disclosure, a self-attention (self-attention) mechanism is introduced to make each word have global semantic information and long distance dependencies of each word on all words.
At 110, entities are identified from the sequence of word vectors using the word features, the local context features, and the global semantic features.
With the part-of-speech features, local context features, and global semantic features, entities can be identified from the word vector sequence.
In one embodiment, the word vectors in the word vector sequence are dimensionality-reduced screened using a lexical feature. For example, word vectors that feature part of speech as nouns may be filtered out. For another example, word vectors characterized by a person name nr and an organization name nt may be filtered out. Such screening performs different degrees of dimension reduction screening on the word vectors.
Then, an entity is identified from the word vectors obtained after the dimension reduction screening based on the local context features and the global semantic features. For example, to identify an organization located in New York State and a branch office of an organization with headquarters located in New York State, the corresponding entity may be identified based on the local context feature of "New York State" and the global semantic feature of "headquarters located in New York State" from the word vector obtained after screening for the part-of-speech feature for the organization name nt.
In another embodiment, the part-of-speech features, the local context features and the global semantic features can be directly input into the classification model without dimension reduction screening, and the entity meeting the conditions can be identified.
Fig. 2 is a diagram illustrating a short text entity recognition framework based on countermeasure learning according to an embodiment of the present disclosure. FIG. 3 is a diagram of one implementation of the short text entity identification framework.
As shown in fig. 2, the short text is preprocessed, then word segmentation is performed, and then a word vector sequence is obtained through model training. In an embodiment of the present disclosure, word vectors are trained using Word encoding (e.g., word2 Vec), which is not described in detail herein.
The problem with the word vector so obtained is that its representation is relatively fixed and does not have a different representation with its context. Moreover, although Word Embedding has been used to reduce the dimension of Word vector sequences, current Word vector sequences are not directly identifiable by entities due to the high-dimensional and sparse nature of short text (especially ultrashort text).
This is because it is generally desirable that words with similar semantics are close to each other in the embedding space, but word embedding learned in short text entity recognition tasks is biased, and vectors of high frequency words and low frequency words are represented in different sub-regions of the embedding space. Even though the high frequency words and the low frequency words are similar in meaning, they are far away from each other in the word embedding space. In addition, for short text, the number of low frequency words is larger, thereby making the trained word vector sequence inefficient.
Further, to promote word vector representation in short text and solve data sparsity problems, the present disclosure introduces an antagonistic training (Adversorial Training) framework, inputting word vector sequences into the entity recognition model and the discriminant, respectively. Training in the entity recognition model can reduce training loss of specific tasks to the greatest extent, and meanwhile, the discriminators are utilized for part-of-speech tagging, so that dimension can be quickly reduced, the problem of data sparseness is greatly reduced, and word vectors with better expression capability are obtained, namely, expression contains local context features and global semantic features.
Following the principle of resistance training, the entity recognition model and the discriminant are trained with a very small maximum (mini-max) goal, that is, the training loss Lt in the entity recognition model is minimized, that is, the objective function (e.g., taking the negative logarithm of its probability distribution) is maximized (max), while the misclassification loss Ld of the discriminant to distinguish the parts of speech of the word vectors (i.e., to classify the parts of speech of the word vectors) is minimized (mini).
As shown in FIG. 3, the entity recognition model adopts a Bi-LSTM+CRF structure, and a self-attitution layer is added between the Bi-LSTM layer and the CRF layer. For each word, in addition to having local context information for the current word, global semantic information, i.e. the influence of other words/all words on the current word, needs to be considered. These two features are then combined as input to the CRF layer. Specifically, the hidden layer representation after being output through the Bi-LSTM layer is input into the self-saturation layer, and then the BiLSTM representation and the self-saturation representation are subjected to a concat to obtain the feature representation. In this way, each word not only keeps the local feature information learned by the word, but also selects meaningful semantic information in the whole sentence and even the whole short text, thereby improving the model effect. The output of the concat layer is finally input to the CRF layer, and the entity identification result is output.
The arbiter IN fig. 3 is used to classify parts of speech, such as PRP (preposition), IN (conjunctive), NNP (singular proper noun), etc. IN english, or nr (person name), nt (organization name), ns (place name), etc. IN chinese. The implementation of the short text entity recognition framework shown in fig. 3 employs an antagonistic training framework that aims at minimizing training loss Lt in the entity recognition model and minimizing misclassification loss Ld of the arbiter.
The short-contrast text entity recognition framework of fig. 2 and 3 essentially reduces the dimension of the word vector sequence through the extracted part-of-speech features, and then enables the word vector to be expressed better and more accurately through the extracted local context features and global semantic features, thereby improving the accuracy of entity recognition.
In embodiments in which an organization located in New York State and a branch office with a headquarter located in New York State are to be identified, an entity with accurate expression may be identified based on the local context feature of "New York State" and the global semantic feature of "headquarter located in New York State" from the word vector obtained after dimension reduction screening for the part-of-speech feature being the organization name nt.
Fig. 4 is a diagram illustrating a process of transaction cover text being processed by a short text entity recognition framework according to an embodiment of the present disclosure.
Fig. 4 is directed to a fuzzy entity matching application after entity recognition. In this type of short text of the transaction statement, because information such as addresses, company names, etc. is generally manually filled up and uploaded, information is often nonstandard, missing, or even erroneous. There is a large deviation in which risk scans are to be performed. Thus, fuzzy entity matching is required.
Take the trade attachment "00247886Tuition Fee Name linkun Yue Student Number1336623New York State" as an example. The short text is first pre-processed, such as sequential digital substitution, i.e. "Digit" for "00247886" and "1336623". The pre-processed short text is then input into the antagonistic entity identification framework as described above, identifying the entity "per: link Yue".
The resulting entity names are then subjected to fuzzy entity matching. Firstly, word segmentation is carried out on an input name, index matching is carried out on each word/word by using a character string matching algorithm, and a plurality of basic matching algorithms are adopted, including: prefix tree matching algorithm, dictionary tree matching algorithm, character string similarity matching algorithm (Simstring algorithm) and pronunciation similarity matching algorithm to obtain top k follow-up of the check name; and secondly, obtaining similarity through nickname, name, cross language and other fuzzy characteristic calculation, and then sequencing and outputting to obtain a final matching result of the fuzzy entity.
According to the technical scheme, the risk entity can be accurately identified in the transaction accessory entity scanning application scene, and compared with a direct keyword matching scanning mode, the false hit rate is doubled. Those skilled in the art will appreciate that accurate entity identification is not limited to identification of person names, institution names, place names, but may also be applied to identification of products, drugs/diseases, materials, songs/movies/short videos, etc. The application scene is not limited to risk identification, but can be applied to a wider range of scenes, such as: information extraction, information retrieval, machine translation, question-answering systems, and the like.
Fig. 5 is a block diagram illustrating a short text entity identification system 500 according to an embodiment of the present disclosure.
The short text entity recognition system 500 includes a word segmentation module 502, a word vector generation module 504, a feature acquisition module 506, and an entity recognition module 508.
The word segmentation module 502 obtains a short text and segments the short text. After the short text is obtained, the short text may be segmented in a suitable manner. Specifically, the word segmentation method may be selected according to the language employed by the short text and the type of the short text.
The word vector generation module 504 performs word vector training on the segmented short text to generate a word vector sequence. In one embodiment of the present disclosure, word-embedding (word-embedding) techniques are employed to perform word vector training on short segmented text to generate a word vector sequence. The generated word vector sequence contains semantic information and can measure semantic similarity between words.
Due to the high-dimensional and sparse nature of short text (especially, hypertext), word-embedded trained word vector sequences are not yet directly identifiable for entities. Feature acquisition module 506 is also required to perform part-of-speech feature learning on each word vector in the sequence of word vectors based on the antagonism framework to acquire part-of-speech features of the word vector, and extract local context features of each word vector in the sequence of word vectors and global semantic features between each word vector.
To promote word vector representation in short text and solve the data sparsity problem, in one embodiment of the present disclosure, feature acquisition module 506 introduces an antagonistic training (Adversorial Training) framework to input word vector sequences into the entity recognition model and the discriminant, respectively. Training in the entity recognition model can reduce training loss of specific tasks to the greatest extent, and meanwhile, the discriminators are utilized for part-of-speech tagging, so that dimension can be quickly reduced, the problem of data sparseness is greatly reduced, and word vectors with better expression capability are obtained, namely, expression contains local context features and global semantic features.
The entity identification module 508 identifies entities from the respective word vectors using the word features, the local context features, and the global semantic features.
In one embodiment, the word vectors in the word vector sequence are dimensionality-reduced screened using a lexical feature. Then, an entity is identified from the word vectors obtained after the dimension reduction screening based on the local context features and the global semantic features.
Therefore, in this embodiment, the entity recognition module 508 reduces the dimension of the word vector sequence through the extracted part-of-speech features, and then makes the word vector have better and more accurate expression through the extracted local context features and global semantic features, thereby improving the accuracy of entity recognition.
In another embodiment, the part-of-speech features, the local context features and the global semantic features can be directly input into the classification model without dimension reduction screening, and the entity meeting the conditions can be identified.
Fig. 6 is a block diagram illustrating a short text entity recognition system 600 applied to fuzzy entity matching in accordance with an embodiment of the present disclosure.
Fig. 6 is a block diagram illustrating a fuzzy entity matching application of the short text entity recognition system 500 after entity recognition. In addition to the word segmentation module 502, the word vector generation module 504, the feature acquisition module 506, and the entity recognition module 508, the short text entity recognition system 600 for fuzzy entity matching further includes an entity matching module 610.
Entities identified by the short text entity identification system 500 often include missing, ambiguous, or even erroneous information for a variety of reasons. Thus, the entity matching module 610 first performs word segmentation on the input name, and uses a string matching algorithm to perform index matching on each word/word, and adopts a plurality of basic matching algorithms including: prefix tree matching algorithm, dictionary tree matching algorithm, character string similarity matching algorithm (Simstring algorithm) and pronunciation similarity matching algorithm, and top k follow-up of the check-in name is obtained.
Next, the entity matching module 610 calculates the similarity through nickname, name and cross-language fuzzy features, and then sorts and outputs the similarity to obtain the final matching result of the fuzzy entity.
The various steps and modules of the short text entity recognition method and system described above may be implemented in hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with the invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic component, a hardware component, or any combination thereof. A general purpose processor may be a processor, microprocessor, controller, microcontroller, state machine, or the like. If implemented in software, the various illustrative steps, modules, described in connection with the invention may be stored on or transmitted as one or more instructions or code on a computer readable medium. Software modules implementing various operations of the invention may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, cloud storage, etc. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium, as well as execute corresponding program modules to implement the various steps of the present invention. Moreover, software-based embodiments may be uploaded, downloaded, or accessed remotely via suitable communication means. Such suitable communication means include, for example, the internet, world wide web, intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave and infrared communications), electronic communications, or other such communication means.
It is also noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of the operations may be rearranged.
The disclosed methods, apparatus, and systems should not be limited in any way. Rather, the invention encompasses all novel and non-obvious features and aspects of the various disclosed embodiments (both alone and in various combinations and subcombinations with one another). The disclosed methods, apparatus and systems are not limited to any specific aspect or feature or combination thereof, nor do any of the disclosed embodiments require that any one or more specific advantages be present or that certain or all technical problems be solved.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which fall within the scope of the present invention.
Claims (9)
1. A short text entity identification method, comprising:
acquiring a short text and segmenting the short text;
word vector training is carried out on the short text subjected to word segmentation to generate a word vector sequence;
performing part-of-speech feature learning on each word vector in the word vector sequence based on an antagonism framework to obtain part-of-speech features of the word vector;
extracting local context features of each word vector in the word vector sequence and global semantic features among each word vector;
performing dimension reduction screening on word vectors in the word vector sequence by using the part-of-speech features;
identifying an entity from the word vectors obtained after the dimension reduction screening based on the local context features and the global semantic features, and
fuzzy entity matching is performed on the identified entities to obtain entity identification results,
wherein the entity recognition model and the discriminant used by the short text entity recognition method are trained with very small and very large targets, wherein the entity recognition model adopts a Bi-LSTM+CRF structure, and the discriminant is used for part-of-speech classification.
2. The method of claim 1, wherein the manner in which the short text is segmented is selected based on the language and type of the short text.
3. The method of claim 1, identifying entities from the respective word vectors using the part-of-speech features, the local context features, and the global semantic features is implemented on an resistance training framework.
4. The method of claim 1, extracting the local context features and the global semantic features is implemented by an Attention mechanism.
5. A short text entity recognition system, comprising:
the word segmentation module is used for obtaining short texts and segmenting the short texts;
the word vector generation module is used for carrying out word vector training on the short text subjected to word segmentation so as to generate a word vector sequence;
the feature acquisition module is used for performing part-of-speech feature learning on each word vector in the word vector sequence based on an antagonism framework to acquire part-of-speech features of the word vectors, and extracting local context features of each word vector in the word vector sequence and global semantic features among each word vector; and
the entity identification module uses the part-of-speech features to perform dimension reduction screening on word vectors in the word vector sequence; identifying an entity from word vectors obtained after dimension reduction screening based on the local context features and the global semantic features; and
a fuzzy entity matching module, which performs fuzzy entity matching on the identified entity to obtain an entity identification result,
wherein the entity recognition model and the discriminant used by the short text entity recognition system are trained with very small and very large targets, wherein the entity recognition model adopts a Bi-lstm+crf structure, and the discriminant is used for part-of-speech classification.
6. The system of claim 5, wherein the word segmentation module is configured to segment the short text in a manner selected based on a language and a type of the short text.
7. The system of claim 5, the entity recognition module to identify entities from the respective word vectors using the part-of-speech features, the local context features, and the global semantic features is implemented on an resistance training framework.
8. The system of claim 5, wherein the feature extraction module extracts the local context features and the global semantic features by an Attention mechanism.
9. A computer readable storage medium storing instructions that, when executed, cause a machine to perform the method of any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011398845.2A CN112528653B (en) | 2020-12-02 | 2020-12-02 | Short text entity recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011398845.2A CN112528653B (en) | 2020-12-02 | 2020-12-02 | Short text entity recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112528653A CN112528653A (en) | 2021-03-19 |
CN112528653B true CN112528653B (en) | 2023-11-28 |
Family
ID=74997287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011398845.2A Active CN112528653B (en) | 2020-12-02 | 2020-12-02 | Short text entity recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112528653B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377953B (en) * | 2021-05-31 | 2022-06-21 | 电子科技大学 | Entity fusion and classification method based on PALC-DCA model |
CN114048747B (en) * | 2021-11-11 | 2024-07-26 | 国网江苏省电力有限公司营销服务中心 | Electric power marketing entity identification method and system based on context response and structure enhancement |
CN114943236A (en) * | 2022-06-30 | 2022-08-26 | 北京金山数字娱乐科技有限公司 | Keyword extraction method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8452795B1 (en) * | 2010-01-15 | 2013-05-28 | Google Inc. | Generating query suggestions using class-instance relationships |
CN106598952A (en) * | 2016-12-23 | 2017-04-26 | 大连理工大学 | System for detecting Chinese fuzzy constraint information scope based on convolutional neural network |
CN106844346A (en) * | 2017-02-09 | 2017-06-13 | 北京红马传媒文化发展有限公司 | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec |
WO2018028077A1 (en) * | 2016-08-11 | 2018-02-15 | 中兴通讯股份有限公司 | Deep learning based method and device for chinese semantics analysis |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN110442684A (en) * | 2019-08-14 | 2019-11-12 | 山东大学 | A kind of class case recommended method based on content of text |
WO2019229769A1 (en) * | 2018-05-28 | 2019-12-05 | Thottapilly Sanjeev | An auto-disambiguation bot engine for dynamic corpus selection per query |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180232443A1 (en) * | 2017-02-16 | 2018-08-16 | Globality, Inc. | Intelligent matching system with ontology-aided relation extraction |
-
2020
- 2020-12-02 CN CN202011398845.2A patent/CN112528653B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8452795B1 (en) * | 2010-01-15 | 2013-05-28 | Google Inc. | Generating query suggestions using class-instance relationships |
WO2018028077A1 (en) * | 2016-08-11 | 2018-02-15 | 中兴通讯股份有限公司 | Deep learning based method and device for chinese semantics analysis |
CN106598952A (en) * | 2016-12-23 | 2017-04-26 | 大连理工大学 | System for detecting Chinese fuzzy constraint information scope based on convolutional neural network |
CN106844346A (en) * | 2017-02-09 | 2017-06-13 | 北京红马传媒文化发展有限公司 | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
WO2019229769A1 (en) * | 2018-05-28 | 2019-12-05 | Thottapilly Sanjeev | An auto-disambiguation bot engine for dynamic corpus selection per query |
CN110442684A (en) * | 2019-08-14 | 2019-11-12 | 山东大学 | A kind of class case recommended method based on content of text |
Non-Patent Citations (4)
Title |
---|
BERT-BiLSTM-CRF模型的中文实体识别;谢腾等;计算机系统应用;第48-55页 * |
Named Entity Recognition From Biomedical Texts Using a Fusion Attention-Based BiLSTM-CRF;Wei, H等;IEEE ACCESS;全文 * |
基于HBase的列存储压缩策略的选择优化;孙靖超;芦天亮;;计算机应用研究(第05期);全文 * |
基于注意力机制的特征融合序列标注模型;王旭强;岳顺民;张亚行;刘杰;王扬;杨青;;山东科技大学学报(自然科学版)(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112528653A (en) | 2021-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2628436C1 (en) | Classification of texts on natural language based on semantic signs | |
US9626358B2 (en) | Creating ontologies by analyzing natural language texts | |
Zhou et al. | Named entity recognition using BERT with whole world masking in cybersecurity domain | |
CN109002473B (en) | Emotion analysis method based on word vectors and parts of speech | |
CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
CN112528653B (en) | Short text entity recognition method and system | |
Rahimi et al. | An overview on extractive text summarization | |
CN111344695B (en) | Facilitating domain and client specific application program interface recommendations | |
CN111386524A (en) | Facilitating domain and client specific application program interface recommendations | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
Liu et al. | Open intent discovery through unsupervised semantic clustering and dependency parsing | |
CN110879834A (en) | Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof | |
CN113434858A (en) | Malicious software family classification method based on disassembly code structure and semantic features | |
CN113468339B (en) | Label extraction method and system based on knowledge graph, electronic equipment and medium | |
CN114372475A (en) | Network public opinion emotion analysis method and system based on RoBERTA model | |
Chen et al. | Clause sentiment identification based on convolutional neural network with context embedding | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
US20190095525A1 (en) | Extraction of expression for natural language processing | |
Han et al. | An attention-based neural framework for uncertainty identification on social media texts | |
CN111859950A (en) | Method for automatically generating lecture notes | |
CN111325033A (en) | Entity identification method, entity identification device, electronic equipment and computer readable storage medium | |
Chou et al. | Boosted web named entity recognition via tri-training | |
CN114995903A (en) | Class label identification method and device based on pre-training language model | |
CN112182159A (en) | Personalized retrieval type conversation method and system based on semantic representation | |
Arbaatun et al. | Hate speech detection on Twitter through Natural Language Processing using LSTM model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40048347 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |