[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108205524B - Text data processing method and device - Google Patents

Text data processing method and device Download PDF

Info

Publication number
CN108205524B
CN108205524B CN201611180235.9A CN201611180235A CN108205524B CN 108205524 B CN108205524 B CN 108205524B CN 201611180235 A CN201611180235 A CN 201611180235A CN 108205524 B CN108205524 B CN 108205524B
Authority
CN
China
Prior art keywords
features
gram
named entity
vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611180235.9A
Other languages
Chinese (zh)
Other versions
CN108205524A (en
Inventor
高维国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201611180235.9A priority Critical patent/CN108205524B/en
Publication of CN108205524A publication Critical patent/CN108205524A/en
Application granted granted Critical
Publication of CN108205524B publication Critical patent/CN108205524B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text data processing method and device, and relates to the field of data processing. The text data processing method comprises the following steps: segmenting phrases to be identified in the text according to characters; extracting n-gram features from the phrases to be recognized; determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics; inputting the extracted n-gram features and the vector features of the n-gram features into a named entity recognition model; and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model. According to the method and the device, the n-gram features and the vector features of the n-gram features in the phrase to be recognized after being segmented according to the characters are extracted, the correlation between adjacent characters in the phrase to be recognized can be reflected, the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, and therefore the accuracy of named entity recognition can be improved.

Description

Text data processing method and device
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for processing text data.
Background
A named entity refers to an entity identified by a name, such as a person's name, organization's name, place name, and so forth. By identifying the named entities, the search intention of the user, the attribute of the extracted text and the like can be obtained, so that the search efficiency is improved, and the accurate pushing of the user is realized.
At present, named entity recognition technology mainly recognizes a fixed part and a variable part in a phrase by a rule method according to a preset rule. For example, for named entities of the number type, the existing recognition method is to separate the number and the quantifier, recognize the number by matching the numbers in the text, including the basic word (e.g. 10 ten thousand), the ordinal word (1 st), the pure number (1.5), the percentage (50%), etc., and recognize the quantifier by establishing the vocabulary, such as kg, g, p, etc.
The format of named entities is diversified due to uncertainty in user input and text content. For example, 500ML may also be expressed as 500ML, five hundred ML, about 500ML, around 500ML, 500ML +, 2 PM may also be expressed as 14:00 PM, 2:00PM, two PM, 2:00PM, and so on. Due to the diversified forms of the named entities, the accuracy of the rule identification method is low.
Disclosure of Invention
The embodiment of the invention aims to solve the technical problem that: how to improve the accuracy of named entity recognition.
According to an aspect of an embodiment of the present invention, there is provided a text data processing method including: segmenting phrases to be identified in the text according to characters; extracting n-gram features from the phrases to be recognized; determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics; carrying out number quantization coding on the extracted n-gram characteristics; inputting the coded n-gram features and the vector features of the n-gram features into a named entity recognition model; and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.
In one embodiment, the method further comprises: determining sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized; inputting the extracted n-gram features and vector features of the n-gram features into the named entity recognition model comprises: and inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.
In one embodiment, further comprising: training the named entity recognition model through training data; wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.
In one embodiment, obtaining training samples comprises: labeling target named entities in the training data; replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking the non-target named entity; and taking the marked target named entities and non-target named entities as training samples.
In one embodiment, determining vector features of n-gram features from the extracted n-gram features comprises: in n-gram features other than uni-grams, vector features of the n-gram features are determined from word vectors of individual words in the n-gram features.
In one embodiment, a word vector of words is obtained using the following method: obtaining a word vector training corpus comprising a target named entity; segmenting the word vector training corpus according to characters; inputting a word vector training corpus segmented according to words into a word2vec algorithm for training; a word vector for each word output by the word2vec algorithm is obtained.
In one embodiment, the following method is used to segment phrases by word: the consecutive numbers are split into individual words.
In one embodiment, the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or the named entity identification model is a boosted tree model, a convolutional neural network model, or a recursive neural network model.
According to a second aspect of the embodiments of the present invention, there is provided a text data processing apparatus including: the phrase segmentation module is used for segmenting phrases to be identified in the text according to characters; the n-gram feature extraction module is used for extracting n-gram features from the phrases to be recognized; the vector feature generation module is used for determining the vector features of the n-gram features according to the extracted n-gram features; the coding module is used for carrying out number quantization coding on the extracted n-gram characteristics; the data to be tested input module is used for inputting the extracted n-gram features and the vector features of the n-gram features into the named entity recognition model; and the named entity recognition module is used for determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.
In one embodiment, the apparatus further comprises: the sentence vector characteristic determining module is used for determining the sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized; and the data to be tested input module is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.
In one embodiment, the apparatus further comprises a training data generation module for generating training data for training the named entity recognition model; the generated training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.
In one embodiment, the apparatus further comprises: the target named entity marking module is used for marking the target named entities in the training data; the non-target named entity acquiring module is used for replacing part of words in the target named entity in the training data with other words to obtain a non-target named entity and marking the non-target named entity; and the training sample obtaining module is used for taking the marked target named entities and the marked non-target named entities as training samples.
In one embodiment, the vector feature generation module is further configured to determine, among n-gram features other than the uni-gram, vector features of the n-gram features from word vectors of respective words in the n-gram features.
In one embodiment, the apparatus further comprises: the training corpus obtaining module is used for obtaining a word vector training corpus containing a target named entity; the corpus segmentation module is used for segmenting the word vector training corpus according to characters; the segmentation corpus input module is used for inputting a word vector training corpus segmented according to words into a word2vec algorithm for training; and the word vector obtaining module is used for obtaining the word vector of each word output by the word2vec algorithm.
In one embodiment, the phrase segmentation module is further configured to segment consecutive digits into individual words.
In one embodiment, the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or the named entity identification model is a boosted tree model, a convolutional neural network model, or a recursive neural network model.
According to a third aspect of the embodiments of the present invention, there is provided a text data processing apparatus, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the foregoing text data processing methods based on instructions stored in the memory.
According to the method and the device, the n-gram features and the vector features of the n-gram features in the phrase to be recognized after being segmented according to the characters are extracted, the correlation between adjacent characters in the phrase to be recognized can be reflected, the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, and therefore the accuracy of named entity recognition can be improved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIGS. 1A and 1B are flow diagrams of different embodiments of a text data processing method for identifying a named entity according to the present invention.
Fig. 2A and 2B are flowcharts of different embodiments of a training data generation method of a named entity recognition model according to the present invention.
FIG. 3 is a block diagram of an embodiment of a text data processing apparatus according to the present invention.
Fig. 4 is a block diagram of another embodiment of a text data processing apparatus of the present invention.
FIG. 5 is a block diagram of another embodiment of a text data processing apparatus according to the present invention.
Fig. 6 is a block diagram of still another embodiment of a text data processing apparatus of the present invention.
Fig. 7 is a block diagram of still another embodiment of a text data processing apparatus of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The identification of named entities can be used in a variety of application scenarios.
For example, in the search technology field, taking e-commerce as an example, an e-commerce website may identify the number of phrases in the commodity information so as to know the size, model, capacity, etc. of the commodity, and when used for searching in the e-commerce website, may also identify the number of phrases in the user's search content so as to return the matched commodity of size, model, capacity, etc. for the user first in the search result.
For another example, the publication time of news in a website is often relatively standard, and is usually displayed in a uniform format at a uniform position behind a title, so that the news is easy to identify. However, the article may have the occurrence times of multiple sub-events contained in the news, and the writing format of the occurrence times is not strictly and uniformly specified, so that the named entity identification method can be adopted to obtain the occurrence times of the multiple sub-events, which is beneficial to further combing the news and associating similar news contents.
For another example, with the development of the travel industry and the internet, many people choose to write notes after traveling and distribute the notes to the internet for reference by other people who want to travel to the same destination. In one travel note, there may be a plurality of place names and organization names included in the destination. By adopting the recognition method of the named entity, the website can capture the information of place name, organization name, time, price and the like contained in the travel notes, so that the travel track of the tourist can be combed, and other users can conveniently use for reference.
Each application scenario is only an exemplary application scenario illustrating the named entity identification of the present invention, and a person skilled in the art may apply the text data processing method of the present invention to other scenarios as needed, which is not described herein again.
FIGS. 1A and 1B are flow diagrams of an embodiment of a method for processing text data for identifying a named entity according to the invention. As shown in fig. 1A and 1B, the method of this embodiment includes:
and step S102, segmenting phrases to be recognized in the text according to characters.
The text may be, for example, textual content information obtained from a website, database, or the like.
The phrase to be recognized refers to a phrase to be recognized as either a target named entity or a non-target named entity. The type of the target named entity can be set according to needs, and can be a phrase representing the number, a phrase representing the time, a phrase representing the name of a mechanism or a phrase representing the place, and the like.
For example, the phrase to be recognized is "2 pm", it may be segmented into "pm/2/pm".
A part of the named entity includes a large amount of numerical information, such as a phrase indicating the number and a phrase indicating the time, etc. However, in the recognition process, if the numerical value represented by the named entity is not concerned, but only the numerical value appearing at a specific position of the named entity is concerned, the continuous number can be segmented into independent numbers, and the characters except the number can be segmented according to characters. For example, when the phrase to be recognized is "about 500 ML", it may be segmented into "big/about/500/M/L", treating 500 as an independent word. Therefore, the number of the characteristics of the phrases to be recognized can be reduced, and the recognition efficiency is improved under the condition that the accuracy is not influenced.
In addition, positions where the numbers are located after segmentation can be replaced by uniform characters, for example, segmentation results of "about 500 ML" and "about 1.3M" can be "large/about/digit/M/L" and "digit/M/left/right", so that the form of the features of the phrases to be recognized is more standard, and the recognition complexity is reduced.
Step S104, extracting n-gram characteristics from the phrases to be recognized.
An n-gram is a multivariate grammar model. The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words. Thus, the n-gram model can reflect the context between words.
The n-gram feature in the phrase to be recognized refers to a combination of consecutive words in the phrase to be recognized, with n representing the number of words in the feature. For example, for "big/about/two/point," its 1-gram features include big, about, two, and point, and 2-gram features include about, about two, and two points. Generally, n-gram features are extracted from the phrase to be recognized with a number of words within a preset threshold range, which may be, for example, 1-3.
The content of the n-gram features is characters or character combinations with set lengths in the phrases to be recognized, so that the literal features of the phrases to be recognized can be reflected.
And step S106, determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics.
The vector features of the n-gram features refer to the n-gram features expressed in a vector form. A method for acquiring vector features includes calculating the vector features of n-gram features according to word vectors of all words in extracted n-gram features. In one embodiment, the mean or weighted average of the word vectors of individual words in the extracted n-gram features can be used as the vector features of the n-gram features. The generalization feature of the phrase to be recognized can thus be reflected by the word vector of each word in the n-gram feature.
The word vector refers to a vector corresponding to each word, and the word vector may be pre-calculated. For example, the following method may be used to obtain a word vector for a word: firstly, obtaining a word vector training corpus containing a target named entity; then, segmenting the word vector training corpus according to characters; and finally, inputting the word vector training corpus segmented according to the words into a word2vec algorithm for training to obtain the word vector of each word output by the word2vec algorithm.
The word2vec algorithm is an efficient tool for representing words as real numerical vectors, and the processing of text contents can be simplified into vector operation in a K-dimensional vector space through training by utilizing the idea of deep learning. In the present invention, each word defining the input word2vec includes only one word, so that a word vector of each word, i.e., a word vector, can be obtained. Through the training of word2vec, the included angle of the word vectors of similar words is small, and the similarity between the words can be reflected. Other methods may be adopted to obtain the word vector as needed, and are not described herein again.
For example, the word vector training corpus is product information in an e-commerce website, and contains expressions such as "500 mm high calcium nutritious milk of brand a" and "500 ml high calcium nutritious milk of brand a" and the like, so that meanings of "ml" and "ml" are relatively close, an included angle between word vectors of "mm" and "m" is relatively small, and an included angle between word vectors of "l" and "l" is also relatively small through training. Thus, the vector features of the 2-gram features "ml" and "ml" are also relatively close. Therefore, for the phrases to be recognized which comprise completely different characters, the inherent association can be found through the vector characteristics of the n-gram characteristics, and the generalization capability of recognition is improved.
In addition, other ways may also be used to generate the vector features of the n-gram features, for example, word vectors of words corresponding to the n-gram features may be directly calculated, and so on, which are not described herein again.
Vector features of the corresponding n-gram features may not necessarily be generated for each n-gram feature, as desired. For example, in n-gram features other than uni-grams, the vector features of the n-gram features can be determined from the word vectors of individual words in the n-gram features. Because the vector features of the n-gram features reflect the relevance between the words, the vector features of the uni-gram features can not be generated, so that the number of the features can be reduced, and the recognition efficiency is improved.
And step S108, carrying out number quantization coding on the extracted n-gram characteristics.
Since the named entity recognition model processes mainly digitized features, and the n-gram features are composed of characters, the extracted n-gram features can be subjected to quantitative coding. The number quantization coding refers to converting n-gram features into numbers according to a preset rule, for example, a corresponding number may be set for each different n-gram feature. Thus, named entity recognition models can be enabled to process n-gram features.
In one embodiment, one-hot (one-hot) encoding may also be employed. one-hot encoding, also known as one-bit-efficient encoding, mainly uses a bit state register to encode each state, each state having its own independent register bit and only one bit being active at any time. Therefore, the coded n-gram features can be easily processed by the named entity recognition model, and data coded by the one-hot method is very sparse, so that the recognition efficiency is improved.
In addition, optionally, the method may further include step S109, where the method including step S109 may refer to fig. 1B, and the method not including step S109 may refer to fig. 1A.
Step S109, determining sentence vector characteristics of the phrase to be recognized according to the word vector of each word in the phrase to be recognized.
For example, the mean or weighted average of the word vectors of the individual words in the phrase to be recognized may be used as the sentence vector feature of the phrase to be recognized. Thus, the characteristics of the phrase to be recognized can be fully embodied.
Step S110, when step S109 is not included, inputting the coded n-gram feature and the vector feature of the n-gram feature into the named entity recognition model, as shown in FIG. 1A. When the method includes step S109, the n-gram feature of the phrase to be recognized, the vector feature of the n-gram feature, and the sentence vector feature may be collectively input into the named entity recognition model, as shown in fig. 1B.
The named entity recognition model can be a Boosted Trees model, a convolutional neural network model, a recurrent neural network model or the like, and is obtained by training with training samples including target named entity words and non-target named entity words. The boosted tree model may be implemented, for example, using an open-source xgboost library.
The maximum number of words in the n-gram corresponding to the vector features of the n-gram features input into the named entity recognition model may also be different from the maximum number of words in the n-gram features. For example, 1-gram, 2-gram, 3-gram features of the phrase to be recognized, and vector features of the 2-gram, 3-gram, 4-gram features may be input. Thus, the content of the input features can be flexibly defined according to the characteristics of the phrases to be recognized and the business requirements.
And step S112, determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.
For example, whether the phrase to be recognized is the target named entity may be determined according to the classification result output by the named entity recognition model, and whether the phrase to be recognized is the target named entity may also be determined according to the numerical value output by the named entity recognition model. Taking the named entity recognition model as the lifting tree model as an example, the sum of the prediction results output by each tree in the named entity recognition model can be calculated, and if the sum is within the preset range corresponding to the target named entity, the word to be recognized is the target named entity.
By extracting n-gram features and vector features of the n-gram features in the phrase to be recognized after segmentation according to the characters, the correlation between adjacent characters in the phrase to be recognized can be reflected, and the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, so that the accuracy of named entity recognition can be improved.
Typically, the data input during the prediction phase is composed of the same features as the data input during the training phase. Therefore, the method for generating training data used in training the named entity recognition model in the present invention may also refer to the method in the embodiments of fig. 1A and 1B. The following describes a training data generation method of the named entity recognition model according to the present invention with reference to fig. 2A and 2B.
Fig. 2A and 2B are flowcharts of different embodiments of a training data generation method of a named entity recognition model according to the present invention. As shown in fig. 2A and 2B, the method of this embodiment includes:
step S202, a training sample is obtained, and the target named entity phrase and the non-target named entity phrase in the training sample have different marks.
The phrases in the training sample may be extracted from the original corpus. The original corpus may come from the same application scenario as the phrase to be recognized. For example, when the present invention is used to identify phrases representing the number in the search information of the user in the e-commerce website, the original corpus may be the title of the goods from the e-commerce website, the search records of the user, and the like.
After the original corpus is obtained, data cleaning may be performed on the original corpus, for example, punctuation marks, spaces, exclamation words, capitalization and capitalization unification, simplifications and simplifications, and words with low word frequency are removed, and further processing may be performed according to the needs of an application scenario, for example, words that have not been searched by a user may be removed, and the like. Then, several phrases are obtained by word segmentation and put into a set of training samples.
The non-target named entity phrases in the training sample may be extracted from the original corpus or may be randomly generated.
In order to further improve the coverage rate of the negative samples, the following method can be adopted to obtain the training samples: firstly, marking a target named entity in training data; then, replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking; and finally, taking the marked target named entity and the marked non-target named entity as training samples. For example, "north ml" may be generated from "5 ml" in the training sample, or "1 center control" from "1 kilogram", and so on. Therefore, the coverage rate of the negative samples can be improved, and the named entity recognition model generated by training is more accurate.
And step S204, segmenting the phrases in the training samples according to characters.
The slicing method may refer to step S102.
Step S206, extracting n-gram characteristics from the phrases of the training samples.
The extraction method of n-gram features can refer to step S104.
And step S208, determining the vector characteristics of the n-gram characteristics extracted from the phrases of the training samples according to the n-gram characteristics extracted from the phrases of the training samples.
The method for generating the vector feature of the n-gram feature can refer to step S106.
Similar to the process of recognition, the method of generating training data may further include step S209:
step S210, carrying out quantitative coding on n-gram characteristics extracted from phrases of training samples.
The method of the number quantization coding may refer to step S108.
In addition, optionally, the method may further include step S211, where the method including step S211 may refer to fig. 2B, and the method not including step S211 may refer to fig. 2A.
Step S211, determining sentence vector characteristics of the phrases in the training samples according to the word vectors of all the characters in the phrases in the training samples.
The sentence vector feature generation method may refer to step S109.
Step S212, when the method does not include step S211, the vector features of the n-gram features extracted from the phrases of the training samples and the n-gram features extracted from the phrases of the training samples are used as the training data of the named entity recognition model, as shown in FIG. 2A. When the method includes step S211, as shown in fig. 2B, n-gram features extracted from the phrases of the training samples, vector features of the n-gram features extracted from the phrases of the training samples, and sentence vector features of the phrases in the training samples may be used as training data of the named entity recognition model.
By generating training data in the same way as input data generated during recognition of the phrase to be recognized, the generated model can be more matched with the recognition process, and the recognition accuracy is improved.
After the training data is obtained, the named entity recognition model can be trained. For example, the training data may be used to train a plurality of trees in the boosted tree model, obtain weights of nodes of the plurality of trees and leaf nodes of the plurality of trees, and obtain the named entity recognition model.
An application example is described below by taking "about 500 ml" as an example, and a generation process of data to be input during recognition or training data during training corresponding to "about 500 ml" is introduced.
1. The word segmentation of "about 500 ml" is performed by using a segmentation method of segmenting a continuous number into an independent word to obtain a segmentation result of "about/500/m/l", and replacing "500" with a digit to obtain "about/digit/m/l".
2. Extracting n-gram features of about 500ml, wherein the threshold range of the word number of the n-gram features is 2-3, and obtaining [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ].
3. A word vector of "about", "digit", "m", "l" is obtained.
According to the word2vec training result, the included angle between the word vectors corresponding to the numbers is small, namely the word vectors corresponding to the numbers are similar. Therefore, all the numbers can be corresponding to the same word vector, that is, "digit" corresponds to a unique word vector, and the word vector can be the mean value of the word vectors corresponding to the numbers 0 to 9 respectively, and can also be determined by adopting other methods.
Instead of using "digit" instead of consecutive digits, the average of the word vectors of each digit in consecutive digits may be used as the word vector of the independent word corresponding to the digit, and for example, the average of the word vectors corresponding to "5", "0" and "0" may be used as the word vector of "500".
4. The n-gram signature of "about 500 ml" is encoded.
For example, one-hot encoding may be employed to convert [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ] into the form of [10000100,10000100,01001000,00100010,01000010] according to a preset rule. It should be clear to those skilled in the art that the above encoding results are only schematic illustrations of the sparse form after one-hot encoding of n-gram features. In a specific application, the encoding result may be obtained according to a preset corresponding rule.
5. Vector features of "about 500 ml" n-gram features are obtained.
Let the word vectors of "about", "digit", "m" and "l" be [1,1,1], [2,2,2], [3,3,3], [4,4,4] respectively. If the method of taking the mean of the word vectors of each word in the n-gram features as the vector features of the n-gram features is adopted, then the vector features of [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ] are { [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3] }.
6. Sentence vector features of "about 500 ml" are generated.
From the mean of the word vectors of each word in step 5, a sentence vector characteristic of "about 500 ml" of [2.5,2.5,2.5] can be calculated.
7. Inputting n-gram features of 'about 500 ml', vector features of the n-gram features and sentence vector features into the named entity recognition model, and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.
Three types of features of "about 500 ml" may be directly listed and input, and may be, for example, {10000100,10000100,01001000,00100010,01000010, [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3], [2.5,2.5,2.5] }; in addition, three features may be classified as one independent feature to be input into the model, and may be, for example, { [10000100,10000100,01001000,00100010,01000010], [ [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3], [2.5,2.5 ] }.
A text data processing apparatus according to an embodiment of the present invention is described below with reference to fig. 3.
FIG. 3 is a block diagram of an embodiment of a text data processing apparatus according to the present invention. As shown in fig. 3, the text data processing apparatus of this embodiment includes: the phrase segmentation module 31 is configured to segment phrases to be recognized in the text according to characters; an n-gram feature extraction module 32 for extracting n-gram features from the phrases to be recognized; a vector feature generation module 33, configured to determine a vector feature of the n-gram feature according to the extracted n-gram feature; the coding module 34 is used for carrying out number quantization coding on the extracted n-gram characteristics; the data to be tested input module 35 is used for inputting the coded n-gram features and the vector features of the n-gram features into the named entity recognition model; and the named entity recognition module 36 is configured to determine whether the phrase to be recognized is the target named entity according to an output result of the named entity recognition model.
The vector feature generation module 33 may be further configured to determine, in n-gram features other than the uni-gram, vector features of the n-gram features according to word vectors of respective words in the n-gram features.
Wherein the phrase segmenting module 31 is further configured to segment consecutive numbers into an independent word.
The target named entity may be a phrase representing a quantity, a phrase representing a time, a phrase representing a name of a organization, or a phrase representing a place, among others.
The named entity recognition model may be a lifting tree model, a convolutional neural network model, or a recurrent neural network model.
A text data processing apparatus according to another embodiment of the present invention is described below with reference to fig. 4.
Fig. 4 is a block diagram of another embodiment of a text data processing apparatus of the present invention. As shown in fig. 4, the text data processing apparatus of this embodiment further includes: a sentence vector feature determining module 47, configured to determine a sentence vector feature of the phrase to be recognized according to a word vector of each word in the phrase to be recognized; the data to be tested input module 35 is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.
Furthermore, the apparatus may further comprise a training data generation module 48 for generating training data for training the named entity recognition model. Wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.
A text data processing apparatus according to still another embodiment of the present invention is described below with reference to fig. 5.
FIG. 5 is a block diagram of another embodiment of a text data processing apparatus according to the present invention. As shown in fig. 5, the text data processing apparatus of this embodiment further includes: a target named entity tagging module 51, configured to tag a target named entity in the training data; a non-target named entity obtaining module 52, configured to replace part of words in the target named entity in the training data with other words to obtain a non-target named entity and perform labeling; a training sample obtaining module 53, configured to use the labeled target named entity and non-target named entity as training samples.
Further, the apparatus may further include: a corpus obtaining module 54, configured to obtain a word vector corpus including a target named entity; a corpus segmentation module 55, configured to segment the word vector training corpus according to characters; the segmentation corpus input module 56 is configured to input the word vector training corpus segmented according to words into a word2vec algorithm for training; a word vector obtaining module 57, configured to obtain a word vector of each word output by the word2vec algorithm.
Fig. 6 is a block diagram of still another embodiment of a text data processing apparatus of the present invention. As shown in fig. 6, the apparatus 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to execute the text data processing method in any one of the foregoing embodiments based on instructions stored in the memory 610.
Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.
Fig. 7 is a block diagram of still another embodiment of a text data processing apparatus of the present invention. As shown in fig. 7, the apparatus 600 of this embodiment includes: the memory 610 and the processor 620 may further include an input/output interface 730, a network interface 740, a storage interface 750, and the like. These interfaces 730, 740, 750, as well as the memory 610 and the processor 620, may be connected, for example, by a bus 760. The input/output interface 730 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 740 provides a connection interface for various networking devices. The storage interface 750 provides a connection interface for external storage devices such as an SD card and a usb disk.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (16)

1. A text data processing method, comprising:
segmenting phrases to be identified in the text according to characters;
extracting n-gram features from the phrase to be recognized;
determining vector features of the n-gram features according to the extracted n-gram features, comprising: determining vector characteristics of the n-gram characteristics according to word vectors of all words in the n-gram characteristics except the uni-gram characteristics;
carrying out number quantization coding on the extracted n-gram characteristics;
determining sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized;
inputting the coded n-gram features, the vector features of the n-gram features and the sentence vector features into a named entity recognition model;
and determining whether the phrase to be recognized is a target named entity according to the output result of the named entity recognition model, wherein the target named entity is a named entity of a number word type.
2. The method of claim 1, wherein determining vector features of n-gram features from word vectors of respective words of the n-gram features comprises: and taking the mean value or the weighted average value of the word vectors of all the words in the extracted n-gram characteristics as the vector characteristics of the n-gram characteristics.
3. The method of claim 1, further comprising:
training the named entity recognition model through training data;
wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.
4. The method of claim 3, wherein the training samples are obtained by:
labeling target named entities in the training data;
replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking the non-target named entity;
and taking the marked target named entities and non-target named entities as training samples.
5. The method of claim 1, wherein the word vector of the word is obtained by:
obtaining a word vector training corpus comprising a target named entity;
segmenting the word vector training corpus according to characters;
inputting a word vector training corpus segmented according to words into a word2vec algorithm for training;
a word vector for each word output by the word2vec algorithm is obtained.
6. The method according to any of claims 1-4, wherein phrases are cut word by word using the following method:
the consecutive numbers are split into individual words.
7. The method according to any one of claims 1 to 4,
the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or,
the named entity recognition model is a lifting tree model, a convolutional neural network model or a recurrent neural network model.
8. A text data processing apparatus, characterized by comprising:
the phrase segmentation module is used for segmenting phrases to be identified in the text according to characters;
the n-gram feature extraction module is used for extracting n-gram features from the phrases to be recognized;
the vector feature generation module is used for determining the vector features of the n-gram features according to the extracted n-gram features, and comprises the following steps: determining vector characteristics of the n-gram characteristics according to word vectors of all words in the n-gram characteristics except the uni-gram characteristics;
the coding module is used for carrying out number quantization coding on the extracted n-gram characteristics;
a sentence vector feature determination module, configured to determine a sentence vector feature of the phrase to be recognized according to a word vector of each word in the phrase to be recognized;
the data to be tested input module is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model;
and the named entity recognition module is used for determining whether the phrase to be recognized is a target named entity according to the output result of the named entity recognition model, wherein the target named entity is a named entity of a number word type.
9. The apparatus of claim 8, wherein determining vector features of n-gram features from word vectors of respective words of the n-gram features comprises: and taking the mean value or the weighted average value of the word vectors of all the words in the extracted n-gram characteristics as the vector characteristics of the n-gram characteristics.
10. The apparatus of claim 8, further comprising:
the training data generation module is used for generating training data so as to train the named entity recognition model;
wherein the generated training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.
11. The apparatus of claim 10, further comprising:
the target named entity marking module is used for marking the target named entities in the training data;
the non-target named entity acquiring module is used for replacing part of words in the target named entity in the training data with other words to obtain a non-target named entity and marking the non-target named entity;
and the training sample obtaining module is used for taking the marked target named entities and the marked non-target named entities as training samples.
12. The apparatus of claim 8, further comprising:
the training corpus obtaining module is used for obtaining a word vector training corpus containing a target named entity;
the corpus segmentation module is used for segmenting the word vector training corpus according to characters;
the segmentation corpus input module is used for inputting a word vector training corpus segmented according to words into a word2vec algorithm for training;
and the word vector obtaining module is used for obtaining the word vector of each word output by the word2vec algorithm.
13. The apparatus according to any one of claims 8-11,
the phrase segmenting module is further configured to segment consecutive digits into an individual word.
14. The apparatus according to any one of claims 8-11,
the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or,
the named entity recognition model is a lifting tree model, a convolutional neural network model or a recurrent neural network model.
15. A text data processing apparatus, characterized by comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the text data processing method of any one of claims 1-7 based on instructions stored in the memory.
16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a text data processing method according to any one of claims 1 to 7.
CN201611180235.9A 2016-12-20 2016-12-20 Text data processing method and device Active CN108205524B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611180235.9A CN108205524B (en) 2016-12-20 2016-12-20 Text data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611180235.9A CN108205524B (en) 2016-12-20 2016-12-20 Text data processing method and device

Publications (2)

Publication Number Publication Date
CN108205524A CN108205524A (en) 2018-06-26
CN108205524B true CN108205524B (en) 2022-01-07

Family

ID=62601904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611180235.9A Active CN108205524B (en) 2016-12-20 2016-12-20 Text data processing method and device

Country Status (1)

Country Link
CN (1) CN108205524B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019816B (en) * 2018-08-01 2022-11-25 云知声(上海)智能科技有限公司 Rule extraction method and system in text audit
CN111414757B (en) * 2019-01-04 2023-06-20 阿里巴巴集团控股有限公司 Text recognition method and device
CN110991182B (en) * 2019-12-03 2024-01-19 东软集团股份有限公司 Word segmentation method and device for professional field, storage medium and electronic equipment
CN111400429B (en) * 2020-03-09 2023-06-30 北京奇艺世纪科技有限公司 Text entry searching method, device, system and storage medium
CN113609860B (en) * 2021-08-05 2023-09-19 湖南特能博世科技有限公司 Text segmentation method and device and computer equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100877477B1 (en) * 2007-06-28 2009-01-07 주식회사 케이티 Apparatus and method for recognizing the named entity using backoff n-gram features
CN101620669A (en) * 2008-07-01 2010-01-06 邹采荣 Method for synchronously recognizing identities and expressions of human faces
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN104298651A (en) * 2014-09-09 2015-01-21 大连理工大学 Biomedicine named entity recognition and protein interactive relationship extracting on-line system based on deep learning
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4791984B2 (en) * 2007-02-27 2011-10-12 株式会社東芝 Apparatus, method and program for processing input voice
CN104199972B (en) * 2013-09-22 2018-08-03 中科嘉速(北京)信息技术有限公司 A kind of name entity relation extraction and construction method based on deep learning
CN104572616B (en) * 2014-12-23 2018-04-24 北京锐安科技有限公司 The definite method and apparatus of Text Orientation
CN105260361B (en) * 2015-10-28 2019-07-19 南京邮电大学 A kind of the trigger word labeling system and method for biomedicine event

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100877477B1 (en) * 2007-06-28 2009-01-07 주식회사 케이티 Apparatus and method for recognizing the named entity using backoff n-gram features
CN101620669A (en) * 2008-07-01 2010-01-06 邹采荣 Method for synchronously recognizing identities and expressions of human faces
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN104298651A (en) * 2014-09-09 2015-01-21 大连理工大学 Biomedicine named entity recognition and protein interactive relationship extracting on-line system based on deep learning
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Bag of Tricks for Efficient Text Classification;Armand Joulin等;《https://arxiv.org/pdf/1607.01759.pdf》;20160809;1-5 *
Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking;Ikuya Yamada等;《Workshop on Noisy User-generated Text》;20150731;136-140 *
基于混合策略的公众健康领域新词识别方法研究;侯丽等;《图书情报工作》;20151205;第59卷(第23期);115-123 *

Also Published As

Publication number Publication date
CN108205524A (en) 2018-06-26

Similar Documents

Publication Publication Date Title
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
CN108205524B (en) Text data processing method and device
CA2969593C (en) Method for text recognition and computer program product
WO2022083094A1 (en) Text semantic recognition method and apparatus, electronic device, and storage medium
CN111858843B (en) Text classification method and device
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN109858010A (en) Field new word identification method, device, computer equipment and storage medium
CN110245348A (en) A kind of intension recognizing method and system
CN113064964A (en) Text classification method, model training method, device, equipment and storage medium
CN110990532A (en) Method and device for processing text
CN114298035A (en) Text recognition desensitization method and system thereof
CN111666766A (en) Data processing method, device and equipment
Moeng et al. Canonical and surface morphological segmentation for nguni languages
CN113821605A (en) Event extraction method
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN110888983A (en) Positive and negative emotion analysis method, terminal device and storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN112560504A (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN113221553A (en) Text processing method, device and equipment and readable storage medium
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN112307183A (en) Search data identification method and device, electronic equipment and computer storage medium
CN114416923A (en) News entity linking method and system based on rich text characteristics
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1256873

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant