CN111581970B - Text recognition method, device and storage medium for network context - Google Patents
Text recognition method, device and storage medium for network context Download PDFInfo
- Publication number
- CN111581970B CN111581970B CN202010396183.9A CN202010396183A CN111581970B CN 111581970 B CN111581970 B CN 111581970B CN 202010396183 A CN202010396183 A CN 202010396183A CN 111581970 B CN111581970 B CN 111581970B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- window
- short
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 140
- 238000012549 training Methods 0.000 claims abstract description 37
- 238000004364 calculation method Methods 0.000 claims description 16
- 230000011218 segmentation Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 2
- 238000004904 shortening Methods 0.000 claims description 2
- 235000019580 granularity Nutrition 0.000 abstract description 4
- 230000006855 networking Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 24
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 241000167854 Bourreria succulenta Species 0.000 description 4
- 235000019693 cherries Nutrition 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a text recognition method, a text recognition device and a storage medium for a network context, wherein the method comprises the following steps: constructing a style semantic model based on a long text window, and constructing a component-level semantic model based on a short text window; training a corpus of the network context based on a style semantic model vector model and a radical semantic model to obtain a Chinese word vector model of the network context; and recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result. The invention uses two different windows when dividing words, the long window is used for extracting semantic information of a networking style, the short window of the text is used for extracting semantic features of different fine granularities, and the two are combined in a training stage to obtain more accurate word vector expression so as to improve the text recognition rate of a network context.
Description
Technical Field
The invention relates to the technical field of text data processing, in particular to a text recognition method, a text recognition device and a storage medium for a network context.
Background
Text vectorization representation is always an important direction for the research of computer technology and artificial intelligence technology, and is also a significant challenge in natural language analysis processing. The quality of the text vectorization representation directly influences the performance of the subsequent natural language analysis model. The vectorization representation of the text adopts a one-hot coding (one-hot) model at the earliest time and gradually evolves into a Bag-of-Words (Bag of Words) model, the representation methods have simple and clear thinking, the representation problem of the text in a computer is better solved, but the semantic correlation between word contexts in the language and the time sequence characteristics of the language are not considered, the coherent semantic information between the word contexts is split, and the problems of serious vector sparsity and dimension disaster exist. With the development of the deep neural network technology, researchers provide a method for representing a distributed word vector, each word is mapped to a low-dimensional real number space, the relevance among the words is mined while the dimension is reduced, and the word vector has richer semantic information. Mikolov et al propose a classical distributed Word vector model Word2Vec, which includes two models, CBOW (Continuous Bag of Words) and Skip-gram, the former predicting Words through context, and the latter predicting context information through Words. Word2vec learns dense Word vectors by using co-occurrence information between context and words, and the model is simple and practical and is widely applied to various natural language processing tasks.
The main drawbacks of the prior art are as follows:
for social media such as microblogs and posts, unlike a formal and official description of news texts, social text contents usually have spoken and networked styles, for example, "a home held by an intelligent robot is really a wrong one". In a social language context, a word may be given a new meaning or a new network expression may be generated directly. Word2vec word vectors obtained through training of a standard corpus (such as encyclopedia and dictionary) cannot accurately express network meanings of words, and a great influence is generated on analysis tasks of social texts.
The CBOW model is designed in an English language system, but is also suitable for expression of Chinese word vectors. However, compared with english, the semantic composition of chinese is more complex, chinese words and phrases are formed by chinese characters, and the semantics of chinese characters are generally related to the meanings of character-forming components thereof, and if a CBOW model is directly used to learn chinese word vectors, the latent semantic information of chinese characters is ignored, and the generalization capability of the obtained word vector model is weak. And the recognition of the text is not necessarily effective, the convergence speed of the model is slow during training, and the like, and a new Chinese word vector model is urgently needed to solve one or more technical defects.
Disclosure of Invention
The present invention proposes the following technical solutions to address one or more technical defects in the prior art.
A method of text recognition of a network context, the method comprising:
a modeling step, namely constructing a style semantic model based on a text long window and constructing a component-level semantic model based on a text short window;
training, namely training a corpus of the network context based on a style semantic model vector model and a radical semantic model to obtain a Chinese word vector model of the network context;
and a step of recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result.
Furthermore, the corpus sequence obtained by segmenting any corpus s in the corpus is s = { w = { (w) } 1 ,…w t-1 ,w t ,w t+1 ,…w N In which w t Setting w for the t-th word in the sequence after word segmentation t T =1, \8230forthe target word to be predicted, and N is the total word number in the corpus sequence; with the target word w t Constructing a text window for the center, and defining the text short window as follows:
wherein d is s Representing words in a short window of text to a target word w t Let the distance threshold of the text short window be theta, window s Representing the target word w by the neighborhood t A set of words consisting of the context of (a);
define a text long window as
Wherein, d l Representing words in a long window of text to a target word w t The minimum value is theta +1, the maximum value is beta, beta is less than or equal to N, and the window l Representing the target word w by distance t Context that is far away is composed and does not include content in the short window of text.
Furthermore, the process of constructing the style semantic model based on the long text window is as follows: window for long text l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j To what is providedCorresponding code vector, beta represents the target word w in the long window of text t And context w t+j The total length of the text window is 2 beta.
Furthermore, the process of constructing the radical-level semantic model based on the text short window is as follows:
Converting the radical r into Chinese character r with corresponding semantics by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
and (3) carrying out weighted fusion coding on the Chinese characters and the radicals corresponding to the words by adopting a self-attention mechanism, wherein the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the escape of the radicals, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the encoding vector of each word in the short text window is:
v x =∑ i α i v i
wherein v is i The coding vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
the coding vector v derived from attention x Inputting CBOW, calculating output vector of hidden layer
In the formula, θ represents the target word w in the short window of the text t And context w t+j The total length of the text short window is 2 theta,and representing the coding vector corresponding to the context of the t-th target word in the short text window.
Further, the training step operates to:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w And calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of target words w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context can be calculated by the softmax function,
wherein,denotes the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,for the output vector of the radical level semantic model hidden layer,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m 。
The invention also proposes a device for text recognition of a network context, comprising:
the modeling unit is used for constructing a style semantic model based on the text long window and constructing a component level semantic model based on the text short window;
the training unit is used for training a corpus of the network context based on the style semantic model vector model and the radical semantic model to obtain a Chinese word vector model of the network context;
and the recognition unit is used for recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result.
Furthermore, the corpus sequence obtained by segmenting any corpus s in the corpus is s = { w = { (w) } 1 ,…w t-1 ,w t ,w t+1 ,…w N In which w t Setting w for the t-th word in the sequence after word segmentation t T =1, \8230forthe target word to be predicted, and N is the total word number in the corpus sequence; with the target word w t Build a text for the centerThe window defines a text short window as follows:
wherein, d s Representing words in a short window of text to a target word w t Let the distance threshold of the text short window be theta, window s Representing the target word w by the neighborhood t A set of words consisting of the context of (a);
define a text long window as
Wherein d is l Representing words in a long window of text to a target word w t The minimum value is theta +1, the maximum value is beta, beta is less than or equal to N, and the window l Representing target words w by distance t Context components that are far away and do not include content in the short window of text.
Furthermore, the process of constructing the style semantic model based on the long text window is as follows: windowing a long text window l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j Corresponding code vector, beta represents the target word w in the long window of the text t And context w t+j The total length of the text long window is 2 beta.
Furthermore, the process of constructing the radical-level semantic model based on the text short window is as follows:
Converting the radical r into Chinese character r with corresponding semantic meaning by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
and (3) carrying out weighted fusion coding on the Chinese characters and the radicals corresponding to the words by adopting a self-attention mechanism, wherein the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the radical meaning, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the code vector for each word within the short window of text is:
v x =∑ i α i v i
wherein v is i The coding vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
In the formula, θ represents the target word w in the short window of the text t And context w t+j The total length of the text short window is 2 theta,and representing the coding vector corresponding to the context of the t-th target word in the short text window.
Further, the training unit performs the operations of:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w Calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of a target word w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context can be calculated by the softmax function,
wherein,denotes the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,the output vector of the hidden layer of the radical level semantic model,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m . The present invention also proposes a computer-readable storage medium having stored thereon computer program code which, when executed by a computer, performs the method of any of the above.
The invention has the technical effects that: the invention relates to a text recognition method of a network context, which comprises the following steps: a modeling step, namely constructing a style semantic model based on a text long window and constructing a component-level semantic model based on a text short window; training, namely training a corpus of the network context based on a style semantic model vector model and a radical semantic model to obtain a Chinese word vector model of the network context; and a step of recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result. The invention aims to solve the problem of accuracy of text recognition of a network context, creatively provides that two different windows, namely a long text window and a short text window, are used during word segmentation, the long meaning of the long window is longer than the short window relative to the short window, the long window is used for extracting semantic information of a networked style, the short text window is used for extracting semantic features of different fine granularities, the long window and the short text window are combined in a training stage to obtain more accurate word vector expression so as to improve the text recognition rate of the network context, and in order to learn and obtain the text style, the style trend of the linguistic data in a larger range needs to be focused on by a model rather than the context near a target word, so that the long text window is adopted.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
Fig. 1 is a flowchart of a text recognition method for a network context according to an embodiment of the present invention.
Fig. 2 is a block diagram of a network context text recognition apparatus according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows a text recognition method of a network context according to the present invention, which includes:
a modeling step S101, namely constructing a style semantic model based on a text long window and constructing a component-level semantic model based on a text short window;
a training step S102, wherein a corpus of the network context is used for training based on a style semantic model vector model and a radical semantic model to obtain a Chinese word vector model of the network context;
and a recognition step S103 of recognizing the input text of the network context using the chinese word vector model of the network context and outputting a recognition result.
The invention aims to solve the problem of accuracy of text recognition of a network context, and creatively provides that two different windows, namely a long text window and a short text window, are used during word segmentation, the long meaning of the long window is longer than that of the short window, the long window is used for extracting semantic information of a networked style, the short text window is used for extracting semantic features of different fine granularities, and the two are combined in a training stage to obtain more accurate word vector expression so as to improve the text recognition rate of the network context, which is one of important invention points of the invention.
In one embodiment, the corpus sequence obtained by segmenting any corpus s in the corpus is s = { w = { (w) 1 ,…w t-1 ,w t ,w t+1 ,…w N The corpus comprises at least one corpus, wherein w t Setting w for the t-th word in the sequence after word segmentation t The target words to be predicted are t =1, \ 8230, and N, N is the total word number in the corpus sequence; with the target word w t Constructing a text window for the center, and defining a text short window as follows:
wherein d is s Representing words in a short window of text to a target word w t The distance of (2) is set as a distance threshold value of a text short window theta (the experimental value is between 1 and 3.), and window s Representing the target word w by the neighborhood t A set of words consisting of the context of (a);
define a text long window as
Wherein d is l In a long window representing textTo the target word w t The minimum value of the distance is theta +1, the maximum value is beta (the empirical value in the experiment is between 4 and 7), beta is less than or equal to N, and window l Representing the target word w by distance t Context components that are far away and do not include content in the short window of text.
For example, suppose θ =2 and β =6, for the corpus "today i go to the park together with friends to view beautiful scene with cherry blossom" if the target word w t For "park", the short window of text contains 2 words "go to", "watch", "cherry blossom", and the long window of text contains 8 words "today", "me", "and", "friend", "hold", "beautiful", "scene".
In the training process, the target words to be predicted are all words in the corpus S, the sliding step length of the long and short text windows is 1, and each word of each corpus sequence in the corpus is traversed.
In one embodiment, the process of constructing the style semantic model based on the text long window is as follows: window for long text l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j Corresponding code vector, beta represents the target word w in the long window of the text t And context w t+j The total length of the text long window is 2 beta.
In order to obtain Chinese word vectors suitable for the social context style, the CBOW model is improved, namely the style semantic model is based on CBOW, the context window range is expanded, text contents near target words are ignored, the understanding capability of the model on the whole text style is improved while context semantic information is weakened, and the Chinese word vectors suitable for the network context are generated. The invention adopts the long text window, so that the text style is learned, the style trend of the corpus in a wider range is concerned, and the text is not focused on the context near the target word, which is another important invention point of the invention.
In one embodiment, in order to extract multi-level semantic features, the invention introduces radical components to enhance word vector semantic information. The radical level semantic model adopts a text short window s And weighting and fusing semantic information of Chinese characters and radicals in a text window by using a self-attention (self-attention) mechanism, and calculating word vectors by using a CBOW (cubic low-web ontology) model. The process of constructing the radical level semantic model based on the text short window comprises the following steps:
Converting the radical r into Chinese character r with corresponding semantics by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
and (3) carrying out weighted fusion coding on the Chinese characters and the radicals corresponding to the words by adopting a self-attention mechanism, wherein the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the escape of the radicals, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the encoding vector of each word in the short text window is:
v x =∑ i α i v i
wherein v is i The coded vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
the coding vector v derived from attention x Inputting CBOW, calculating output vector of hidden layer
In the formula, theta represents a target word w in a short window of text t And context w t+j The total length of the text short window is 2 theta,and representing the coding vector corresponding to the context of the t-th target word in the short text window.
The component level semantic model is explained below with a text short window distance threshold θ = 1.
(1) Words w in a short window of text t-1 And w t+1 Dividing into Chinese characters to obtain short text character sequence c = { c t-1 ,c t+1 }. For example, the word "illness" is split into two words, disease and emotion.
(2) Extracting the radical r of each Chinese character in the short text character sequence c t-1 、r t+1 . For example, the radical corresponding to "illness" and "situation" is "30098" and "24245818.
(3) In order to clarify semantic information contained in the radical, the radical r is converted into Chinese character r with corresponding semantic by character escape dictionary * . For example, "\30098" and "\24245851" resulted in "disease" and "heart" after being transferred. The character escape dictionary is shown in table 1 below.
TABLE 1 radical character escape table
(5) And adopting a self-attention (self-attention) mechanism to carry out weighted fusion coding on the Chinese characters and the radicals corresponding to the words, mining the potential semantics of the radicals and enhancing the semantic information of the character sequence. The calculation formula of the self-attention weight alpha is as follows
α i =softmax(f(x T x i ))
Wherein x is i Represents a word sequence corresponding to the ith word in the short window of the text, i = { t ± d s |1<d s ≤θ},x T Is x i The similarity calculation function f is in the form of dot product.
The code vector of each word in the text short window is v x =∑ i α i v i
Coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
In the formula, theta is the distance between the word and the target word in the short text window.
In the character forming part, the radical usually has clear semantic information, which is helpful for understanding the meaning of the character, therefore, the invention introduces the radical to enhance the word vector semantic information, thereby extracting multi-level semantic features to improve the accuracy of text classification, which is another important invention point of the invention.
In one embodiment, the operation of the training step is:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w Calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of a target word w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context can be calculated by the softmax function,
wherein,denotes the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,the output vector of the hidden layer of the radical level semantic model,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m I.e. m-dimensional word vectors corresponding to all words w in the corpus.
The invention accelerates the model training speed and improves the training efficiency by training the social linguistic data and optimizing the target function L(s), combines the style characteristics of the text and the characteristics of the radicals during training, and establishes a method for transferring the radicals, thereby improving the recognition rate of the text, which is another important invention point of the invention.
Fig. 2 shows a network context text recognition apparatus according to the present invention, which includes:
the modeling unit 201 is used for constructing a style semantic model based on a text long window and constructing a component level semantic model based on a text short window;
the training unit 202 is used for training a corpus of the network context based on the style semantic model vector model and the radical semantic model to obtain a Chinese word vector model of the network context;
and the recognition unit 203 recognizes the input text of the network context by using the Chinese word vector model of the network context and outputs a recognition result.
The invention aims to solve the problem of accuracy of text recognition of a network context, and creatively provides that two different windows, namely a long text window and a short text window, are used during word segmentation, the long meaning of the long window is longer than that of the short window, the long window is used for extracting semantic information of a networked style, the short text window is used for extracting semantic features of different fine granularities, and the two are combined in a training stage to obtain more accurate word vector expression so as to improve the text recognition rate of the network context, which is one of important invention points of the invention.
In one embodiment, the corpus sequence obtained by segmenting any corpus s in the corpus is s = { w = { (w) 1 ,…w t-1 ,w t ,w t+1 ,…w N A corpus comprises at least one corpus, w t Setting w for the t-th word in the sequence after word segmentation t T =1, \8230forthe target word to be predicted, and N is the total word number in the corpus sequence; with the target word w t Constructing a text window for the center, and defining the text short window as follows:
wherein, d s Representing words in a short window of text to a target word w t The distance of (2) is set as a distance threshold value of a text short window theta (the experimental value is between 1 and 3.), and window s Representing the word by adjacent objects w t A set of words consisting of the context of (a);
define a text long window as
Wherein, d l Representing words in a long window of text to a target word w t The minimum value of the distance is theta +1, the maximum value is beta (the empirical value in the experiment is between 4 and 7), beta is less than or equal to N, and window l Representing target words w by distance t Context that is far away is composed and does not include content in the short window of text.
For example, suppose θ =2 and β =6, for the corpus "today i go to the park together with friends to view beautiful scene with cherry blossom" if the target word w t For "park", the short window of text contains 2 words "together", "go to""watch", "cherry blossom", contains 8 words "today", "i", "and", "friend", "hold", "beautiful", "scene" within the text long window.
In the training process, the target words to be predicted are all words in the corpus S, the sliding step length of the long and short text windows is 1, and each word of each corpus sequence in the corpus is traversed.
In one embodiment, the process of constructing the style semantic model based on the text long window is as follows: windowing a long text window l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j Corresponding code vector, beta represents the target word w in the long window of the text t And context w t+j The total length of the text window is 2 beta.
In order to obtain Chinese word vectors suitable for the social context style, the CBOW model is improved, namely the style semantic model is based on CBOW, the context window range is expanded, text contents near target words are ignored, the understanding capability of the model on the whole text style is improved while context semantic information is weakened, and the Chinese word vectors suitable for the network context are generated. The invention adopts the long text window, so that the text style is learned, the style trend of the corpus in a larger range is concerned, and the text style is not focused on the context near the target word, which is another important invention point of the invention.
In one embodiment, to extract multi-level semantic features, the invention introduces radical to enhance word vector semanticsAnd (4) information. The radical level semantic model adopts a text short window s And weighting and fusing semantic information of the Chinese characters and the radicals in the text window by using a self-attention (self-attention) mechanism, and calculating word vectors by using a CBOW model. The process of constructing the radical level semantic model based on the text short window comprises the following steps:
shortening words in text windowDividing into Chinese characters to obtain short text character sequence
Converting the radical r into Chinese character r with corresponding semantic meaning by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
and (3) carrying out weighted fusion coding on the Chinese characters and the radicals corresponding to the words by adopting a self-attention mechanism, wherein the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the radical meaning, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the code vector for each word within the short window of text is:
v x =∑ i α i v i
wherein v is i The coding vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
In the formula, theta represents a target word w in a short window of text t And context w t+j The total length of the text short window is 2 theta,and representing the coding vector corresponding to the context of the t-th target word in the short text window.
The following explains the component level semantic model with a text short window distance threshold θ = 1.
(1) Words w in a short window of text t-1 And w t+1 Dividing into Chinese characters to obtain short text character sequence c = { c t-1 ,c t+1 }. For example, the word "disease" is divided into two words, disease "and" condition ".
(2) Extracting the radical r of each Chinese character in the short text character sequence c t-1 、r t+1 . For example, the radical corresponding to "illness" and "situation" is "30098" and "24245818.
(3) In order to clarify semantic information contained in the components and radicals, the components and radicals r are converted into Chinese characters r with corresponding semantics through a character escape dictionary * . For example, "disease" and "heart" were obtained after the escape of "\30098" and "\24245817". The character escape dictionary is shown in table 1 above.
(5) And adopting a self-attention (self-attention) mechanism to carry out weighted fusion coding on the Chinese characters and the radicals corresponding to the words, mining the potential semantics of the radicals and enhancing the semantic information of the character sequence. The calculation formula of the self-attention weight alpha is as follows
α i =softmax(f(x T x i ))
Wherein x is i Represents a word sequence corresponding to the ith word in the short window of the text, i = { t ± d s |1<d s ≤θ},x T Is x i The similarity calculation function f is in the form of dot product.
The code vector of each word in the text short window is v x =∑ i α i v i
Coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
Where θ is the distance between the word and the target word in the short window of text.
In the character forming part, the radical usually has clear semantic information, which is helpful for understanding the meaning of the character, therefore, the invention introduces the radical to enhance the word vector semantic information, thereby extracting multi-level semantic features to improve the accuracy of text classification, which is another important invention point of the invention.
In one embodiment, the training unit performs the operations of:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w Calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of a target word w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context can be calculated by the softmax function,
wherein,represents the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,the output vector of the hidden layer of the radical level semantic model,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m I.e. m-dimensional word vectors corresponding to all words w in the corpus.
The invention leads the model training speed to be accelerated and improves the training efficiency by training the social linguistic data and optimizing the target function L(s), and combines the style characteristics of the text and the characteristics of the radicals and establishes a method for transferring the meaning of the radicals during training, thereby improving the recognition rate of the text, which is another important invention point of the invention.
For convenience of description, the above devices are described as being divided into various units for separate description. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially implemented or the portions that contribute to the prior art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the apparatuses described in the embodiments or some portions of the embodiments of the present application.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention any modifications and equivalents.
Claims (5)
1. A method for text recognition of a network context, the method comprising:
a modeling step, namely constructing a style semantic model based on a text long window and constructing a component-level semantic model based on a text short window;
training, namely training a corpus of the network context based on a style semantic model vector model and a radical semantic model to obtain a Chinese word vector model of the network context;
a recognition step, recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result;
dividing any corpus s into words to obtain a corpus sequence s = { w = { (w) } 1 ,…w t-1 ,w t ,w t+1 ,…w N In which w t Setting w for the t-th word in the sequence after word segmentation t The target words to be predicted are t =1, \ 8230, and N, N is the total word number in the corpus sequence; with the target word w t Constructing a text window for the center, and defining the text short window as follows:
wherein d is s Representing words in a short window of text to a target word w t Let the distance threshold of the text short window be theta, window s Representing the target word w by the neighborhood t A set of words consisting of the context of (a);
define a text long window as
Wherein d is l Representing words in a long window of text to a target word w t The minimum value is theta +1, the maximum value is beta, beta is less than or equal to N, and the window l Representing the target word w by distance t Context composition at a longer distance and does not include content in a text short window;
the process of constructing the style semantic model based on the text long window comprises the following steps: window for long text l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j Corresponding code vector, beta represents the target word w in the long window of the text t And context w t+j The total length of the text long window is 2 beta;
the process of constructing the radical level semantic model based on the text short window comprises the following steps:
Converting the radical r into Chinese character r with corresponding semantic meaning by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
the Chinese characters and the radicals corresponding to the words are weighted and fused and coded by adopting a self-attention mechanism, and the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the escape of the radicals, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the code vector for each word within the short window of text is:
v x =∑ i α i v i
wherein alpha is i Self-attention weight, v, representing the ith word in the word sequence corresponding to the word x in the short window of text i The coded vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
2. The method of claim 1, wherein the training step operates to:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w And calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of a target word w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context is calculated by the softmax function,
wherein,represents the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,for the output vector of the radical level semantic model hidden layer,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m 。
3. An apparatus for network context text recognition, the apparatus comprising:
the modeling unit is used for constructing a style semantic model based on the text long window and constructing a component-level semantic model based on the text short window;
the training unit is used for training a corpus of the network context based on the style semantic model vector model and the radical semantic model to obtain a Chinese word vector model of the network context;
the recognition unit is used for recognizing the input text of the network context by using the Chinese word vector model of the network context and outputting a recognition result;
the corpus has a corpus sequence s = { w } obtained after any corpus s is participled 1 ,…w t-1 ,w t ,w t+1 ,…w N In which w t Setting w for the t-th word in the sequence after word segmentation t The target words to be predicted are t =1, \ 8230, and N, N is the total word number in the corpus sequence; with the target word w t Constructing a text window for the center, and defining the text short window as follows:
wherein, d s Representing words in a short window of text to a target word w t Let the distance threshold of the text short window be theta, window s Representing the word by adjacent objects w t A set of words consisting of the context of (a);
define a text long window as
Wherein d is l Representing words in a long window of text to a target word w t The minimum value is theta +1, the maximum value is beta, beta is less than or equal to N, and the window l Representing the target word w by distance t Context components with longer distance and do not include the content in the text short window;
the process of constructing the style semantic model based on the text long window comprises the following steps: window for long text l Computing hidden layer vectors as input to CBOW
In the formula,context w representing a target word within a long window of text t+j Corresponding code vector, beta represents the target word w in the long window of the text t And context w t+j The total length of the text long window is 2 beta;
the process of constructing the radical level semantic model based on the text short window comprises the following steps:
shortening words in text windowDividing into Chinese characters to obtain short text character sequence
Converting the radical r into Chinese character r with corresponding semantic meaning by character escape dictionary * Obtaining the word sequence x after the short text and the escape of the radicals,
and (3) carrying out weighted fusion coding on the Chinese characters and the radicals corresponding to the words by adopting a self-attention mechanism, wherein the calculation formula of the self-attention weight alpha is as follows:
α i =softmax(f(x T x i ))
wherein x is i Representing the short text corresponding to the ith word in the short text window and the word sequence after the escape of the radicals, i belongs to { t +/-d ∈ s |1<d s ≤θ},x T Is x i The similarity calculation function f adopts a dot product form;
the encoding vector of each word in the short text window is:
v x =∑ i α i v i
wherein alpha is i Self-attention weight, v, representing the ith word in the word sequence corresponding to word x in a short window of text i The coded vector represents the ith word in the word sequence corresponding to the word x in the short window of the text;
coding vector v to be derived from attention x Inputting CBOW, calculating output vector of hidden layer
4. The apparatus of claim 3, wherein the training unit performs the operations of:
randomly generating a m-dimensional vector v for each word after word segmentation in the corpus w And calculating a log-likelihood function of the corpus sequence s:
wherein, L (w) t ) Is composed of target words w t The log-likelihood function with the context conditional probability is calculated,
and the target word w t The conditional probability of the corresponding context is calculated by the softmax function,
wherein,denotes the transpose of the k-th hidden layer vector, k =1,2,the output vector of the hidden layer for the style semantic model,the output vector of the hidden layer of the radical level semantic model,a word vector representing the target word,the word vector representing the context is trained and optimized by an objective function L(s), and model parameters are updated to obtain a final Chinese word vector model v w ∈R m 。
5. A computer-readable storage medium, characterized in that the storage medium has stored thereon computer program code which, when executed by a computer, performs the method of any of claims 1-2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010396183.9A CN111581970B (en) | 2020-05-12 | 2020-05-12 | Text recognition method, device and storage medium for network context |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010396183.9A CN111581970B (en) | 2020-05-12 | 2020-05-12 | Text recognition method, device and storage medium for network context |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111581970A CN111581970A (en) | 2020-08-25 |
CN111581970B true CN111581970B (en) | 2023-01-24 |
Family
ID=72112139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010396183.9A Active CN111581970B (en) | 2020-05-12 | 2020-05-12 | Text recognition method, device and storage medium for network context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111581970B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765989B (en) * | 2020-11-17 | 2023-05-12 | 中国信息通信研究院 | Variable-length text semantic recognition method based on representation classification network |
CN113190643B (en) * | 2021-04-13 | 2023-02-03 | 安阳师范学院 | Information generation method, terminal device, and computer-readable medium |
CN113449490B (en) * | 2021-06-22 | 2024-01-26 | 上海明略人工智能(集团)有限公司 | Document information summarizing method, system, electronic equipment and medium |
CN113408289B (en) * | 2021-06-29 | 2024-04-16 | 广东工业大学 | Multi-feature fusion supply chain management entity knowledge extraction method and system |
CN114970456B (en) * | 2022-05-26 | 2024-09-24 | 厦门市美亚柏科信息股份有限公司 | Chinese word vector compression method, system and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN107305768A (en) * | 2016-04-20 | 2017-10-31 | 上海交通大学 | Easy wrongly written character calibration method in interactive voice |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN109918677A (en) * | 2019-03-21 | 2019-06-21 | 广东小天才科技有限公司 | English word semantic parsing method and system |
CN111091001A (en) * | 2020-03-20 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for generating word vector of word |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9575965B2 (en) * | 2013-03-13 | 2017-02-21 | Red Hat, Inc. | Translation assessment based on computer-generated subjective translation quality score |
-
2020
- 2020-05-12 CN CN202010396183.9A patent/CN111581970B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107305768A (en) * | 2016-04-20 | 2017-10-31 | 上海交通大学 | Easy wrongly written character calibration method in interactive voice |
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN109918677A (en) * | 2019-03-21 | 2019-06-21 | 广东小天才科技有限公司 | English word semantic parsing method and system |
CN111091001A (en) * | 2020-03-20 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Method, device and equipment for generating word vector of word |
Non-Patent Citations (2)
Title |
---|
基于深度学习的短文本分类研究;胡可奇;《中国优秀硕士学位论文全文数据库》;20181015;全文 * |
词向量语义表示研究进展;李枫林等;《情报科学》;20190501(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111581970A (en) | 2020-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502749B (en) | Text relation extraction method based on double-layer attention mechanism and bidirectional GRU | |
CN111581970B (en) | Text recognition method, device and storage medium for network context | |
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN108763284B (en) | Question-answering system implementation method based on deep learning and topic model | |
WO2021072875A1 (en) | Intelligent dialogue generation method, device, computer apparatus and computer storage medium | |
CN107943784B (en) | Relationship extraction method based on generation of countermeasure network | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
US20220343139A1 (en) | Methods and systems for training a neural network model for mixed domain and multi-domain tasks | |
US11475225B2 (en) | Method, system, electronic device and storage medium for clarification question generation | |
CN110196980A (en) | A kind of field migration based on convolutional network in Chinese word segmentation task | |
CN114579699B (en) | Training method and device for pre-training language model | |
CN113655893B (en) | Word and sentence generation method, model training method and related equipment | |
CN114495118B (en) | Personalized handwritten character generation method based on countermeasure decoupling | |
CN114239574A (en) | Miner violation knowledge extraction method based on entity and relationship joint learning | |
CN109308316B (en) | Adaptive dialog generation system based on topic clustering | |
CN111368531A (en) | Translation text processing method and device, computer equipment and storage medium | |
CN112131367A (en) | Self-auditing man-machine conversation method, system and readable storage medium | |
CN112528989B (en) | Description generation method for semantic fine granularity of image | |
CN114781375A (en) | Military equipment relation extraction method based on BERT and attention mechanism | |
CN114417872A (en) | Contract text named entity recognition method and system | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
CN115600582B (en) | Controllable text generation method based on pre-training language model | |
CN113095063B (en) | Two-stage emotion migration method and system based on shielding language model | |
CN112528653A (en) | Short text entity identification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |