CN108205524B

CN108205524B - Text data processing method and device

Info

Publication number: CN108205524B
Application number: CN201611180235.9A
Authority: CN
Inventors: 高维国
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2022-01-07
Anticipated expiration: 2036-12-20
Also published as: CN108205524A

Abstract

The invention discloses a text data processing method and device, and relates to the field of data processing. The text data processing method comprises the following steps: segmenting phrases to be identified in the text according to characters; extracting n-gram features from the phrases to be recognized; determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics; inputting the extracted n-gram features and the vector features of the n-gram features into a named entity recognition model; and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model. According to the method and the device, the n-gram features and the vector features of the n-gram features in the phrase to be recognized after being segmented according to the characters are extracted, the correlation between adjacent characters in the phrase to be recognized can be reflected, the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, and therefore the accuracy of named entity recognition can be improved.

Description

Text data processing method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for processing text data.

Background

A named entity refers to an entity identified by a name, such as a person's name, organization's name, place name, and so forth. By identifying the named entities, the search intention of the user, the attribute of the extracted text and the like can be obtained, so that the search efficiency is improved, and the accurate pushing of the user is realized.

At present, named entity recognition technology mainly recognizes a fixed part and a variable part in a phrase by a rule method according to a preset rule. For example, for named entities of the number type, the existing recognition method is to separate the number and the quantifier, recognize the number by matching the numbers in the text, including the basic word (e.g. 10 ten thousand), the ordinal word (1 st), the pure number (1.5), the percentage (50%), etc., and recognize the quantifier by establishing the vocabulary, such as kg, g, p, etc.

The format of named entities is diversified due to uncertainty in user input and text content. For example, 500ML may also be expressed as 500ML, five hundred ML, about 500ML, around 500ML, 500ML +, 2 PM may also be expressed as 14:00 PM, 2:00PM, two PM, 2:00PM, and so on. Due to the diversified forms of the named entities, the accuracy of the rule identification method is low.

Disclosure of Invention

The embodiment of the invention aims to solve the technical problem that: how to improve the accuracy of named entity recognition.

According to an aspect of an embodiment of the present invention, there is provided a text data processing method including: segmenting phrases to be identified in the text according to characters; extracting n-gram features from the phrases to be recognized; determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics; carrying out number quantization coding on the extracted n-gram characteristics; inputting the coded n-gram features and the vector features of the n-gram features into a named entity recognition model; and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.

In one embodiment, the method further comprises: determining sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized; inputting the extracted n-gram features and vector features of the n-gram features into the named entity recognition model comprises: and inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.

In one embodiment, further comprising: training the named entity recognition model through training data; wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.

In one embodiment, obtaining training samples comprises: labeling target named entities in the training data; replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking the non-target named entity; and taking the marked target named entities and non-target named entities as training samples.

In one embodiment, determining vector features of n-gram features from the extracted n-gram features comprises: in n-gram features other than uni-grams, vector features of the n-gram features are determined from word vectors of individual words in the n-gram features.

In one embodiment, a word vector of words is obtained using the following method: obtaining a word vector training corpus comprising a target named entity; segmenting the word vector training corpus according to characters; inputting a word vector training corpus segmented according to words into a word2vec algorithm for training; a word vector for each word output by the word2vec algorithm is obtained.

In one embodiment, the following method is used to segment phrases by word: the consecutive numbers are split into individual words.

In one embodiment, the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or the named entity identification model is a boosted tree model, a convolutional neural network model, or a recursive neural network model.

According to a second aspect of the embodiments of the present invention, there is provided a text data processing apparatus including: the phrase segmentation module is used for segmenting phrases to be identified in the text according to characters; the n-gram feature extraction module is used for extracting n-gram features from the phrases to be recognized; the vector feature generation module is used for determining the vector features of the n-gram features according to the extracted n-gram features; the coding module is used for carrying out number quantization coding on the extracted n-gram characteristics; the data to be tested input module is used for inputting the extracted n-gram features and the vector features of the n-gram features into the named entity recognition model; and the named entity recognition module is used for determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.

In one embodiment, the apparatus further comprises: the sentence vector characteristic determining module is used for determining the sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized; and the data to be tested input module is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.

In one embodiment, the apparatus further comprises a training data generation module for generating training data for training the named entity recognition model; the generated training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.

In one embodiment, the apparatus further comprises: the target named entity marking module is used for marking the target named entities in the training data; the non-target named entity acquiring module is used for replacing part of words in the target named entity in the training data with other words to obtain a non-target named entity and marking the non-target named entity; and the training sample obtaining module is used for taking the marked target named entities and the marked non-target named entities as training samples.

In one embodiment, the vector feature generation module is further configured to determine, among n-gram features other than the uni-gram, vector features of the n-gram features from word vectors of respective words in the n-gram features.

In one embodiment, the apparatus further comprises: the training corpus obtaining module is used for obtaining a word vector training corpus containing a target named entity; the corpus segmentation module is used for segmenting the word vector training corpus according to characters; the segmentation corpus input module is used for inputting a word vector training corpus segmented according to words into a word2vec algorithm for training; and the word vector obtaining module is used for obtaining the word vector of each word output by the word2vec algorithm.

In one embodiment, the phrase segmentation module is further configured to segment consecutive digits into individual words.

According to a third aspect of the embodiments of the present invention, there is provided a text data processing apparatus, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the foregoing text data processing methods based on instructions stored in the memory.

According to the method and the device, the n-gram features and the vector features of the n-gram features in the phrase to be recognized after being segmented according to the characters are extracted, the correlation between adjacent characters in the phrase to be recognized can be reflected, the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, and therefore the accuracy of named entity recognition can be improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIGS. 1A and 1B are flow diagrams of different embodiments of a text data processing method for identifying a named entity according to the present invention.

Fig. 2A and 2B are flowcharts of different embodiments of a training data generation method of a named entity recognition model according to the present invention.

FIG. 3 is a block diagram of an embodiment of a text data processing apparatus according to the present invention.

Fig. 4 is a block diagram of another embodiment of a text data processing apparatus of the present invention.

FIG. 5 is a block diagram of another embodiment of a text data processing apparatus according to the present invention.

Fig. 6 is a block diagram of still another embodiment of a text data processing apparatus of the present invention.

Fig. 7 is a block diagram of still another embodiment of a text data processing apparatus of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The identification of named entities can be used in a variety of application scenarios.

For example, in the search technology field, taking e-commerce as an example, an e-commerce website may identify the number of phrases in the commodity information so as to know the size, model, capacity, etc. of the commodity, and when used for searching in the e-commerce website, may also identify the number of phrases in the user's search content so as to return the matched commodity of size, model, capacity, etc. for the user first in the search result.

For another example, the publication time of news in a website is often relatively standard, and is usually displayed in a uniform format at a uniform position behind a title, so that the news is easy to identify. However, the article may have the occurrence times of multiple sub-events contained in the news, and the writing format of the occurrence times is not strictly and uniformly specified, so that the named entity identification method can be adopted to obtain the occurrence times of the multiple sub-events, which is beneficial to further combing the news and associating similar news contents.

For another example, with the development of the travel industry and the internet, many people choose to write notes after traveling and distribute the notes to the internet for reference by other people who want to travel to the same destination. In one travel note, there may be a plurality of place names and organization names included in the destination. By adopting the recognition method of the named entity, the website can capture the information of place name, organization name, time, price and the like contained in the travel notes, so that the travel track of the tourist can be combed, and other users can conveniently use for reference.

Each application scenario is only an exemplary application scenario illustrating the named entity identification of the present invention, and a person skilled in the art may apply the text data processing method of the present invention to other scenarios as needed, which is not described herein again.

FIGS. 1A and 1B are flow diagrams of an embodiment of a method for processing text data for identifying a named entity according to the invention. As shown in fig. 1A and 1B, the method of this embodiment includes:

and step S102, segmenting phrases to be recognized in the text according to characters.

The text may be, for example, textual content information obtained from a website, database, or the like.

The phrase to be recognized refers to a phrase to be recognized as either a target named entity or a non-target named entity. The type of the target named entity can be set according to needs, and can be a phrase representing the number, a phrase representing the time, a phrase representing the name of a mechanism or a phrase representing the place, and the like.

For example, the phrase to be recognized is "2 pm", it may be segmented into "pm/2/pm".

A part of the named entity includes a large amount of numerical information, such as a phrase indicating the number and a phrase indicating the time, etc. However, in the recognition process, if the numerical value represented by the named entity is not concerned, but only the numerical value appearing at a specific position of the named entity is concerned, the continuous number can be segmented into independent numbers, and the characters except the number can be segmented according to characters. For example, when the phrase to be recognized is "about 500 ML", it may be segmented into "big/about/500/M/L", treating 500 as an independent word. Therefore, the number of the characteristics of the phrases to be recognized can be reduced, and the recognition efficiency is improved under the condition that the accuracy is not influenced.

In addition, positions where the numbers are located after segmentation can be replaced by uniform characters, for example, segmentation results of "about 500 ML" and "about 1.3M" can be "large/about/digit/M/L" and "digit/M/left/right", so that the form of the features of the phrases to be recognized is more standard, and the recognition complexity is reduced.

Step S104, extracting n-gram characteristics from the phrases to be recognized.

An n-gram is a multivariate grammar model. The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words. Thus, the n-gram model can reflect the context between words.

The n-gram feature in the phrase to be recognized refers to a combination of consecutive words in the phrase to be recognized, with n representing the number of words in the feature. For example, for "big/about/two/point," its 1-gram features include big, about, two, and point, and 2-gram features include about, about two, and two points. Generally, n-gram features are extracted from the phrase to be recognized with a number of words within a preset threshold range, which may be, for example, 1-3.

The content of the n-gram features is characters or character combinations with set lengths in the phrases to be recognized, so that the literal features of the phrases to be recognized can be reflected.

And step S106, determining the vector characteristics of the n-gram characteristics according to the extracted n-gram characteristics.

The vector features of the n-gram features refer to the n-gram features expressed in a vector form. A method for acquiring vector features includes calculating the vector features of n-gram features according to word vectors of all words in extracted n-gram features. In one embodiment, the mean or weighted average of the word vectors of individual words in the extracted n-gram features can be used as the vector features of the n-gram features. The generalization feature of the phrase to be recognized can thus be reflected by the word vector of each word in the n-gram feature.

The word vector refers to a vector corresponding to each word, and the word vector may be pre-calculated. For example, the following method may be used to obtain a word vector for a word: firstly, obtaining a word vector training corpus containing a target named entity; then, segmenting the word vector training corpus according to characters; and finally, inputting the word vector training corpus segmented according to the words into a word2vec algorithm for training to obtain the word vector of each word output by the word2vec algorithm.

The word2vec algorithm is an efficient tool for representing words as real numerical vectors, and the processing of text contents can be simplified into vector operation in a K-dimensional vector space through training by utilizing the idea of deep learning. In the present invention, each word defining the input word2vec includes only one word, so that a word vector of each word, i.e., a word vector, can be obtained. Through the training of word2vec, the included angle of the word vectors of similar words is small, and the similarity between the words can be reflected. Other methods may be adopted to obtain the word vector as needed, and are not described herein again.

For example, the word vector training corpus is product information in an e-commerce website, and contains expressions such as "500 mm high calcium nutritious milk of brand a" and "500 ml high calcium nutritious milk of brand a" and the like, so that meanings of "ml" and "ml" are relatively close, an included angle between word vectors of "mm" and "m" is relatively small, and an included angle between word vectors of "l" and "l" is also relatively small through training. Thus, the vector features of the 2-gram features "ml" and "ml" are also relatively close. Therefore, for the phrases to be recognized which comprise completely different characters, the inherent association can be found through the vector characteristics of the n-gram characteristics, and the generalization capability of recognition is improved.

In addition, other ways may also be used to generate the vector features of the n-gram features, for example, word vectors of words corresponding to the n-gram features may be directly calculated, and so on, which are not described herein again.

Vector features of the corresponding n-gram features may not necessarily be generated for each n-gram feature, as desired. For example, in n-gram features other than uni-grams, the vector features of the n-gram features can be determined from the word vectors of individual words in the n-gram features. Because the vector features of the n-gram features reflect the relevance between the words, the vector features of the uni-gram features can not be generated, so that the number of the features can be reduced, and the recognition efficiency is improved.

And step S108, carrying out number quantization coding on the extracted n-gram characteristics.

Since the named entity recognition model processes mainly digitized features, and the n-gram features are composed of characters, the extracted n-gram features can be subjected to quantitative coding. The number quantization coding refers to converting n-gram features into numbers according to a preset rule, for example, a corresponding number may be set for each different n-gram feature. Thus, named entity recognition models can be enabled to process n-gram features.

In one embodiment, one-hot (one-hot) encoding may also be employed. one-hot encoding, also known as one-bit-efficient encoding, mainly uses a bit state register to encode each state, each state having its own independent register bit and only one bit being active at any time. Therefore, the coded n-gram features can be easily processed by the named entity recognition model, and data coded by the one-hot method is very sparse, so that the recognition efficiency is improved.

In addition, optionally, the method may further include step S109, where the method including step S109 may refer to fig. 1B, and the method not including step S109 may refer to fig. 1A.

Step S109, determining sentence vector characteristics of the phrase to be recognized according to the word vector of each word in the phrase to be recognized.

For example, the mean or weighted average of the word vectors of the individual words in the phrase to be recognized may be used as the sentence vector feature of the phrase to be recognized. Thus, the characteristics of the phrase to be recognized can be fully embodied.

Step S110, when step S109 is not included, inputting the coded n-gram feature and the vector feature of the n-gram feature into the named entity recognition model, as shown in FIG. 1A. When the method includes step S109, the n-gram feature of the phrase to be recognized, the vector feature of the n-gram feature, and the sentence vector feature may be collectively input into the named entity recognition model, as shown in fig. 1B.

The named entity recognition model can be a Boosted Trees model, a convolutional neural network model, a recurrent neural network model or the like, and is obtained by training with training samples including target named entity words and non-target named entity words. The boosted tree model may be implemented, for example, using an open-source xgboost library.

The maximum number of words in the n-gram corresponding to the vector features of the n-gram features input into the named entity recognition model may also be different from the maximum number of words in the n-gram features. For example, 1-gram, 2-gram, 3-gram features of the phrase to be recognized, and vector features of the 2-gram, 3-gram, 4-gram features may be input. Thus, the content of the input features can be flexibly defined according to the characteristics of the phrases to be recognized and the business requirements.

And step S112, determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.

For example, whether the phrase to be recognized is the target named entity may be determined according to the classification result output by the named entity recognition model, and whether the phrase to be recognized is the target named entity may also be determined according to the numerical value output by the named entity recognition model. Taking the named entity recognition model as the lifting tree model as an example, the sum of the prediction results output by each tree in the named entity recognition model can be calculated, and if the sum is within the preset range corresponding to the target named entity, the word to be recognized is the target named entity.

By extracting n-gram features and vector features of the n-gram features in the phrase to be recognized after segmentation according to the characters, the correlation between adjacent characters in the phrase to be recognized can be reflected, and the literal features and the generalization features of the phrase to be recognized can be embodied through the n-gram features and the vector features of the n-gram features, so that the accuracy of named entity recognition can be improved.

Typically, the data input during the prediction phase is composed of the same features as the data input during the training phase. Therefore, the method for generating training data used in training the named entity recognition model in the present invention may also refer to the method in the embodiments of fig. 1A and 1B. The following describes a training data generation method of the named entity recognition model according to the present invention with reference to fig. 2A and 2B.

Fig. 2A and 2B are flowcharts of different embodiments of a training data generation method of a named entity recognition model according to the present invention. As shown in fig. 2A and 2B, the method of this embodiment includes:

step S202, a training sample is obtained, and the target named entity phrase and the non-target named entity phrase in the training sample have different marks.

The phrases in the training sample may be extracted from the original corpus. The original corpus may come from the same application scenario as the phrase to be recognized. For example, when the present invention is used to identify phrases representing the number in the search information of the user in the e-commerce website, the original corpus may be the title of the goods from the e-commerce website, the search records of the user, and the like.

After the original corpus is obtained, data cleaning may be performed on the original corpus, for example, punctuation marks, spaces, exclamation words, capitalization and capitalization unification, simplifications and simplifications, and words with low word frequency are removed, and further processing may be performed according to the needs of an application scenario, for example, words that have not been searched by a user may be removed, and the like. Then, several phrases are obtained by word segmentation and put into a set of training samples.

The non-target named entity phrases in the training sample may be extracted from the original corpus or may be randomly generated.

In order to further improve the coverage rate of the negative samples, the following method can be adopted to obtain the training samples: firstly, marking a target named entity in training data; then, replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking; and finally, taking the marked target named entity and the marked non-target named entity as training samples. For example, "north ml" may be generated from "5 ml" in the training sample, or "1 center control" from "1 kilogram", and so on. Therefore, the coverage rate of the negative samples can be improved, and the named entity recognition model generated by training is more accurate.

And step S204, segmenting the phrases in the training samples according to characters.

The slicing method may refer to step S102.

Step S206, extracting n-gram characteristics from the phrases of the training samples.

The extraction method of n-gram features can refer to step S104.

And step S208, determining the vector characteristics of the n-gram characteristics extracted from the phrases of the training samples according to the n-gram characteristics extracted from the phrases of the training samples.

The method for generating the vector feature of the n-gram feature can refer to step S106.

Similar to the process of recognition, the method of generating training data may further include step S209:

step S210, carrying out quantitative coding on n-gram characteristics extracted from phrases of training samples.

The method of the number quantization coding may refer to step S108.

In addition, optionally, the method may further include step S211, where the method including step S211 may refer to fig. 2B, and the method not including step S211 may refer to fig. 2A.

Step S211, determining sentence vector characteristics of the phrases in the training samples according to the word vectors of all the characters in the phrases in the training samples.

The sentence vector feature generation method may refer to step S109.

Step S212, when the method does not include step S211, the vector features of the n-gram features extracted from the phrases of the training samples and the n-gram features extracted from the phrases of the training samples are used as the training data of the named entity recognition model, as shown in FIG. 2A. When the method includes step S211, as shown in fig. 2B, n-gram features extracted from the phrases of the training samples, vector features of the n-gram features extracted from the phrases of the training samples, and sentence vector features of the phrases in the training samples may be used as training data of the named entity recognition model.

By generating training data in the same way as input data generated during recognition of the phrase to be recognized, the generated model can be more matched with the recognition process, and the recognition accuracy is improved.

After the training data is obtained, the named entity recognition model can be trained. For example, the training data may be used to train a plurality of trees in the boosted tree model, obtain weights of nodes of the plurality of trees and leaf nodes of the plurality of trees, and obtain the named entity recognition model.

An application example is described below by taking "about 500 ml" as an example, and a generation process of data to be input during recognition or training data during training corresponding to "about 500 ml" is introduced.

1. The word segmentation of "about 500 ml" is performed by using a segmentation method of segmenting a continuous number into an independent word to obtain a segmentation result of "about/500/m/l", and replacing "500" with a digit to obtain "about/digit/m/l".

2. Extracting n-gram features of about 500ml, wherein the threshold range of the word number of the n-gram features is 2-3, and obtaining [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ].

3. A word vector of "about", "digit", "m", "l" is obtained.

According to the word2vec training result, the included angle between the word vectors corresponding to the numbers is small, namely the word vectors corresponding to the numbers are similar. Therefore, all the numbers can be corresponding to the same word vector, that is, "digit" corresponds to a unique word vector, and the word vector can be the mean value of the word vectors corresponding to the numbers 0 to 9 respectively, and can also be determined by adopting other methods.

Instead of using "digit" instead of consecutive digits, the average of the word vectors of each digit in consecutive digits may be used as the word vector of the independent word corresponding to the digit, and for example, the average of the word vectors corresponding to "5", "0" and "0" may be used as the word vector of "500".

4. The n-gram signature of "about 500 ml" is encoded.

For example, one-hot encoding may be employed to convert [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ] into the form of [10000100,10000100,01001000,00100010,01000010] according to a preset rule. It should be clear to those skilled in the art that the above encoding results are only schematic illustrations of the sparse form after one-hot encoding of n-gram features. In a specific application, the encoding result may be obtained according to a preset corresponding rule.

5. Vector features of "about 500 ml" n-gram features are obtained.

Let the word vectors of "about", "digit", "m" and "l" be [1,1,1], [2,2,2], [3,3,3], [4,4,4] respectively. If the method of taking the mean of the word vectors of each word in the n-gram features as the vector features of the n-gram features is adopted, then the vector features of [ about/digit, digit/m, m/l, about/digit/m, digit/m/l ] are { [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3] }.

6. Sentence vector features of "about 500 ml" are generated.

From the mean of the word vectors of each word in step 5, a sentence vector characteristic of "about 500 ml" of [2.5,2.5,2.5] can be calculated.

7. Inputting n-gram features of 'about 500 ml', vector features of the n-gram features and sentence vector features into the named entity recognition model, and determining whether the phrase to be recognized is the target named entity according to the output result of the named entity recognition model.

Three types of features of "about 500 ml" may be directly listed and input, and may be, for example, {10000100,10000100,01001000,00100010,01000010, [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3], [2.5,2.5,2.5] }; in addition, three features may be classified as one independent feature to be input into the model, and may be, for example, { [10000100,10000100,01001000,00100010,01000010], [ [1.5,1.5,1.5], [2.5,2.5,2.5], [3.5,3.5,3.5], [2,2,2], [3,3,3], [2.5,2.5 ] }.

A text data processing apparatus according to an embodiment of the present invention is described below with reference to fig. 3.

FIG. 3 is a block diagram of an embodiment of a text data processing apparatus according to the present invention. As shown in fig. 3, the text data processing apparatus of this embodiment includes: the phrase segmentation module 31 is configured to segment phrases to be recognized in the text according to characters; an n-gram feature extraction module 32 for extracting n-gram features from the phrases to be recognized; a vector feature generation module 33, configured to determine a vector feature of the n-gram feature according to the extracted n-gram feature; the coding module 34 is used for carrying out number quantization coding on the extracted n-gram characteristics; the data to be tested input module 35 is used for inputting the coded n-gram features and the vector features of the n-gram features into the named entity recognition model; and the named entity recognition module 36 is configured to determine whether the phrase to be recognized is the target named entity according to an output result of the named entity recognition model.

The vector feature generation module 33 may be further configured to determine, in n-gram features other than the uni-gram, vector features of the n-gram features according to word vectors of respective words in the n-gram features.

Wherein the phrase segmenting module 31 is further configured to segment consecutive numbers into an independent word.

The target named entity may be a phrase representing a quantity, a phrase representing a time, a phrase representing a name of a organization, or a phrase representing a place, among others.

The named entity recognition model may be a lifting tree model, a convolutional neural network model, or a recurrent neural network model.

A text data processing apparatus according to another embodiment of the present invention is described below with reference to fig. 4.

Fig. 4 is a block diagram of another embodiment of a text data processing apparatus of the present invention. As shown in fig. 4, the text data processing apparatus of this embodiment further includes: a sentence vector feature determining module 47, configured to determine a sentence vector feature of the phrase to be recognized according to a word vector of each word in the phrase to be recognized; the data to be tested input module 35 is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model.

Furthermore, the apparatus may further comprise a training data generation module 48 for generating training data for training the named entity recognition model. Wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in the training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.

A text data processing apparatus according to still another embodiment of the present invention is described below with reference to fig. 5.

FIG. 5 is a block diagram of another embodiment of a text data processing apparatus according to the present invention. As shown in fig. 5, the text data processing apparatus of this embodiment further includes: a target named entity tagging module 51, configured to tag a target named entity in the training data; a non-target named entity obtaining module 52, configured to replace part of words in the target named entity in the training data with other words to obtain a non-target named entity and perform labeling; a training sample obtaining module 53, configured to use the labeled target named entity and non-target named entity as training samples.

Further, the apparatus may further include: a corpus obtaining module 54, configured to obtain a word vector corpus including a target named entity; a corpus segmentation module 55, configured to segment the word vector training corpus according to characters; the segmentation corpus input module 56 is configured to input the word vector training corpus segmented according to words into a word2vec algorithm for training; a word vector obtaining module 57, configured to obtain a word vector of each word output by the word2vec algorithm.

Fig. 6 is a block diagram of still another embodiment of a text data processing apparatus of the present invention. As shown in fig. 6, the apparatus 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to execute the text data processing method in any one of the foregoing embodiments based on instructions stored in the memory 610.

Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

Fig. 7 is a block diagram of still another embodiment of a text data processing apparatus of the present invention. As shown in fig. 7, the apparatus 600 of this embodiment includes: the memory 610 and the processor 620 may further include an input/output interface 730, a network interface 740, a storage interface 750, and the like. These

interfaces

730, 740, 750, as well as the memory 610 and the processor 620, may be connected, for example, by a bus 760. The input/output interface 730 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 740 provides a connection interface for various networking devices. The storage interface 750 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A text data processing method, comprising:

segmenting phrases to be identified in the text according to characters;

extracting n-gram features from the phrase to be recognized;

determining vector features of the n-gram features according to the extracted n-gram features, comprising: determining vector characteristics of the n-gram characteristics according to word vectors of all words in the n-gram characteristics except the uni-gram characteristics;

carrying out number quantization coding on the extracted n-gram characteristics;

determining sentence vector characteristics of the phrase to be recognized according to the word vectors of all the words in the phrase to be recognized;

inputting the coded n-gram features, the vector features of the n-gram features and the sentence vector features into a named entity recognition model;

and determining whether the phrase to be recognized is a target named entity according to the output result of the named entity recognition model, wherein the target named entity is a named entity of a number word type.

2. The method of claim 1, wherein determining vector features of n-gram features from word vectors of respective words of the n-gram features comprises: and taking the mean value or the weighted average value of the word vectors of all the words in the extracted n-gram characteristics as the vector characteristics of the n-gram characteristics.

3. The method of claim 1, further comprising:

training the named entity recognition model through training data;

wherein the training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.

4. The method of claim 3, wherein the training samples are obtained by:

labeling target named entities in the training data;

replacing part of characters in the target named entity in the training data with other characters to obtain a non-target named entity and marking the non-target named entity;

and taking the marked target named entities and non-target named entities as training samples.

5. The method of claim 1, wherein the word vector of the word is obtained by:

obtaining a word vector training corpus comprising a target named entity;

segmenting the word vector training corpus according to characters;

inputting a word vector training corpus segmented according to words into a word2vec algorithm for training;

a word vector for each word output by the word2vec algorithm is obtained.

6. The method according to any of claims 1-4, wherein phrases are cut word by word using the following method:

the consecutive numbers are split into individual words.

7. The method according to any one of claims 1 to 4,

the target named entity is a phrase representing a quantity, a phrase representing a time, a phrase representing a name of an organization, or a phrase representing a place, and/or,

the named entity recognition model is a lifting tree model, a convolutional neural network model or a recurrent neural network model.

8. A text data processing apparatus, characterized by comprising:

the phrase segmentation module is used for segmenting phrases to be identified in the text according to characters;

the n-gram feature extraction module is used for extracting n-gram features from the phrases to be recognized;

the vector feature generation module is used for determining the vector features of the n-gram features according to the extracted n-gram features, and comprises the following steps: determining vector characteristics of the n-gram characteristics according to word vectors of all words in the n-gram characteristics except the uni-gram characteristics;

the coding module is used for carrying out number quantization coding on the extracted n-gram characteristics;

a sentence vector feature determination module, configured to determine a sentence vector feature of the phrase to be recognized according to a word vector of each word in the phrase to be recognized;

the data to be tested input module is used for inputting the extracted n-gram features, the vector features of the n-gram features and the sentence vector features into the named entity recognition model;

and the named entity recognition module is used for determining whether the phrase to be recognized is a target named entity according to the output result of the named entity recognition model, wherein the target named entity is a named entity of a number word type.

9. The apparatus of claim 8, wherein determining vector features of n-gram features from word vectors of respective words of the n-gram features comprises: and taking the mean value or the weighted average value of the word vectors of all the words in the extracted n-gram characteristics as the vector characteristics of the n-gram characteristics.

10. The apparatus of claim 8, further comprising:

the training data generation module is used for generating training data so as to train the named entity recognition model;

wherein the generated training data comprises n-gram features of target named entity phrases and non-target named entities in a training sample and vector features of the n-gram features extracted from the phrases of the training sample, or the training data comprises n-gram features of the target named entity phrases and non-target named entities in the training sample, vector features of the n-gram features extracted from the phrases of the training sample and sentence vector features of the phrases in the training sample.

11. The apparatus of claim 10, further comprising:

the target named entity marking module is used for marking the target named entities in the training data;

the non-target named entity acquiring module is used for replacing part of words in the target named entity in the training data with other words to obtain a non-target named entity and marking the non-target named entity;

and the training sample obtaining module is used for taking the marked target named entities and the marked non-target named entities as training samples.

12. The apparatus of claim 8, further comprising:

the training corpus obtaining module is used for obtaining a word vector training corpus containing a target named entity;

the corpus segmentation module is used for segmenting the word vector training corpus according to characters;

the segmentation corpus input module is used for inputting a word vector training corpus segmented according to words into a word2vec algorithm for training;

and the word vector obtaining module is used for obtaining the word vector of each word output by the word2vec algorithm.

13. The apparatus according to any one of claims 8-11,

the phrase segmenting module is further configured to segment consecutive digits into an individual word.

14. The apparatus according to any one of claims 8-11,

15. A text data processing apparatus, characterized by comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the text data processing method of any one of claims 1-7 based on instructions stored in the memory.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a text data processing method according to any one of claims 1 to 7.