CN114691864A

CN114691864A - Text classification model training method and device and text classification method and device

Info

Publication number: CN114691864A
Application number: CN202011640609.7A
Authority: CN
Inventors: 刘畅; 李长亮; 郭馨泽
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01

Abstract

The application provides a text classification model training method and device and a text classification method and device, wherein the text classification model training method comprises the following steps: constructing a training sample set based on the initial keywords and the initial corpus; extracting a first training sample set and a second training sample set from said set of training samples, wherein the first training sample set comprises a first sample dataset for m categories of said x categories, said second training sample set comprises a second sample dataset for said first sample datasets for said m categories different, m < x; training by using the first training sample to obtain a class recognition model; and verifying the class identification model by using the second training sample, and repeatedly executing the steps until the class identification model is determined to meet verification conditions. According to the text classification model training method, the text classification model can be trained only by a small amount of accurate labeling data.

Description

Text classification model training method and device and text classification method and device

Technical Field

The present application relates to the field of text classification technologies, and in particular, to a method and an apparatus for training a text classification model, a method and an apparatus for text classification, a computing device, and a computer-readable storage medium.

Background

Automatic text classification, text classification for short, refers to automatic classification and marking of a text set by a computer according to a certain classification system or standard. It finds the relation model between the document feature and the document category according to a labeled training document set, then uses the relation model obtained by learning to judge the category of the new document, and determines the category of each document.

In the prior art, the main classification models are models obtained by using a training document set containing a large amount of manually and accurately labeled data to train the models, so that the models capable of classifying texts are obtained.

However, in the prior art, a large amount of manually and accurately labeled data has high training cost and low training efficiency; and the labels of the classified texts are continuously updated, and if the models need to be classified according to the new labels, a large amount of manually labeled data classified according to the new labels are needed to train the new text classification models. Thus, the training cost is increased, a large amount of accurate labeling data is lacked, and the training efficiency is also influenced.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for training a text classification model, a method and an apparatus for classifying a text, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

According to a first aspect of the embodiments of the present application, there is provided a text classification model training method, including:

s1, constructing a training sample set based on the initial keywords and the initial corpora, wherein the training sample set comprises x types of initial corpora, and each initial corpora corresponds to an initial prediction type label;

s2, extracting a first training sample set and a second training sample set from the training sample set, wherein the first training sample set includes a first sample data set of m categories of the x categories, the second training sample set includes a second sample data set of the first sample data sets of m categories, m < x;

s3, training by using the first training sample to obtain a class recognition model;

s4, verifying the class identification model by using the second training sample, and repeatedly executing the steps S2 to S4 until the class identification model is determined to meet the verification condition.

Optionally, the text classification model training method further includes:

receiving a new label of a new category to be identified, acquiring a labeling corpus of the new label, inputting the labeling corpus of the new label into the category identification model, and training the identification model.

Optionally, the process of training to obtain a class recognition model by using the first training sample set includes:

inputting the initial corpus in the first training sample set into a coding layer of the category identification model to obtain a first training sample vector;

inputting the first training sample vector into a classification layer of the class identification model to obtain a first classification vector;

inputting the first classification vector into a relation construction layer of the class recognition model, obtaining a prediction class of the first classification vector, comparing the prediction class with an initial prediction class label to obtain an error, and performing iterative training on the class recognition model based on the error until a training stop condition is reached.

Optionally, the verifying the class identification model by the second training sample includes:

and inputting the second training sample into a class recognition model obtained by training the first training sample, calculating similarity data of the label obtained by the class recognition model and the sample label, and obtaining the trained class recognition model if the similarity data reaches a specified threshold value.

Optionally, the step of obtaining the markup corpus of the new tag includes:

setting a first keyword of a new label;

expanding the first keywords of the new label by using a pre-training word vector to obtain second keywords of the new label;

acquiring a new corpus by using a second keyword of the new label, and extracting the keyword of the new corpus;

and carrying out similarity calculation on the second keyword of the new label and the keyword of the new corpus to obtain the labeled corpus of the new label.

Optionally, the step of constructing a training sample set based on the initial keywords and the initial corpus includes:

setting an initial prediction type label and an initial keyword corresponding to the initial prediction type label;

expanding the initial keywords by using the pre-training word vectors;

vectorizing and representing all initial keywords and initial corpora;

processing an initial keyword vector, and processing an initial corpus based on the processing of the initial keyword vector to obtain the initial prediction category label corresponding to the initial corpus;

and forming the training sample set by the initial corpus with the initial prediction category label.

Optionally, the process of inputting the labeled corpus of the new label into the category recognition model to train the recognition model includes:

inputting the initial corpus in the labeled corpus of the new label into the coding layer of the category identification model to obtain a sample vector of the new label;

inputting the new label sample vector into a classification layer of the class identification model to obtain a new label classification vector;

inputting the new label classification vector into a relation construction layer of the class identification model, obtaining a prediction class of the new label classification vector, comparing the prediction class with an initial prediction class label to obtain an error, and performing iterative training on the class identification model based on the error until a training stop condition is reached.

Optionally, in the process of expanding the first keyword of the new tag according to the pre-training word vector, when it is detected that the first keyword of the expanded new tag corresponds to more than one category, the first keyword of the expanded new tag is deleted in the corresponding category.

Optionally, in the process of expanding the initial keyword according to the pre-training word vector, when it is detected that the expanded initial keyword corresponds to more than one category, the expanded initial keyword is deleted in the corresponding category.

According to a second aspect of the embodiments of the present application, there is provided a text classification model training apparatus, including:

the system comprises a construction module, a prediction module and a prediction module, wherein the construction module is configured to construct a training sample set based on initial keywords and initial corpora, the training sample set comprises x types of initial corpora, and each initial corpora corresponds to an initial prediction type label;

an extraction module configured to extract a first set of training samples and a second set of training samples from the set of training samples, wherein the first set of training samples comprises a first set of sample data of m categories of the x categories, the second set of training samples comprises a second set of sample data of the first set of sample data different from the m categories, m < x;

a training module configured to train with the first training sample to obtain a class recognition model;

and the verification module is configured to verify the class identification model by using the second training sample, and repeatedly execute the extraction module, the training module and the verification module until the class identification model is determined to meet the verification condition.

According to a third aspect of the embodiments of the present application, there is provided a text classification model training method, including:

receiving a text to be classified and performing word segmentation processing to obtain a first word segmentation set;

and inputting the first word segmentation set into a text classification model to obtain the prediction category of the text to be classified, wherein the text classification model is obtained by training according to the method for training the text classification model in the figure 2.

Optionally, inputting the first word set into a text classification model to obtain a corresponding text type includes:

inputting the first word segmentation set into a coding layer of the category identification model to obtain a first text vector;

inputting the first text vector into a classification layer of the class identification model to obtain a first classification vector;

and inputting the first classification vector into a relation construction layer of the class identification model to obtain the prediction class of the first classification vector.

According to a fourth aspect of embodiments of the present application, there is provided a text classification apparatus including:

the processing module is configured to receive a text to be classified and perform word segmentation processing to obtain a first word segmentation set;

and the input module is configured to input the first word segmentation set into a text classification model to obtain a prediction category of a text to be classified, wherein the text classification model is obtained by training according to the text classification model training method shown in fig. 2.

According to a fifth aspect of embodiments of the present application, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the text classification model training method or the text classification steps when executing the instructions.

According to a sixth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer instructions, which when executed by a processor, implement the text classification model training method or the text classification step.

According to a seventh aspect of the embodiments of the present application, there is provided a chip storing computer instructions, which when executed by the chip, implement the text classification model training method or the text classification step.

In the embodiment of the application, a training sample set is constructed based on initial keywords and initial corpora, wherein the training sample set comprises x types of initial corpora, and each initial corpus corresponds to an initial prediction type label; extracting a first training sample set and a second training sample set from said set of training samples, wherein the first training sample set comprises a first sample dataset for m categories of said x categories, said second training sample set comprises a second sample dataset for said first sample datasets for said m categories different, m < x; training by using the first training sample to obtain a class recognition model; and verifying the class identification model by using the second training sample, and repeatedly executing the steps until the class identification model is determined to meet verification conditions. According to the text classification model training method in the embodiment of the application, the model can be trained only by a small amount of accurate labeling data, the text classification model capable of realizing text classification is obtained, manual labeling time is saved, and efficiency of training the classification model is improved.

Drawings

FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;

FIG. 2 is a flowchart of a text classification model training method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a class recognition model of a text classification model training method provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a class identification model coding layer of a text classification model training method according to an embodiment of the present application;

FIG. 5 is a flowchart of a text classification model training method applied to training a news text according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a text classification model training apparatus according to an embodiment of the present application;

FIG. 7 is a flowchart of a text classification method provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

BERT model: BERT stands for the bidirectional coder of transforms, which is designed to pre-train unlabeled text to obtain a bidirectional representation of the depth by the union of left and right contexts. Therefore, only one additional output layer is needed to fine-tune the pre-trained BERT model to create SOTA results for various NLP tasks. The Chinese bert pre-training model comprises simplified Chinese characters and traditional Chinese characters, 12 layers in total, 768 hidden units, 12 Attention heads and 110M parameters.

Pre-training word vectors: the pre-training word vector is a general term of a language model and a characterization learning technology in natural language processing. Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain. The word vector used in the present application may be any existing Chinese word vector, which is not limited in the present application.

K-nearest neighbor algorithm: is a non-parametric statistical method for classification and regression. The method adopts a vector space model for classification, the concept is the cases of the same category, the similarity between the cases is high, and the possible classification of the cases of unknown category can be evaluated by calculating the similarity between the cases of known category.

Learning with few samples: after a machine learning model learns a large amount of data of a certain category, only a small amount of samples are needed for fast learning of a new category.

In the present application, a text classification model training method and apparatus, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments one by one.

FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

The processor 120 may perform the steps in the text classification model training method shown in fig. 2. Fig. 2 shows a flowchart of a text classification model training method according to an embodiment of the present application, including step 202 to step 208.

Step 202: and constructing a training sample set based on the initial keywords and the initial corpora, wherein the training sample set comprises x types of initial corpora, and each initial corpora corresponds to an initial prediction type label.

Training a classification text model requires first constructing a training sample set, wherein the training sample set is composed of initial corpora labeled by initial prediction class labels.

The initial keywords may be keywords of the initial category labels. The initial corpus is a pre-prepared corpus corresponding to the initial category label. The initial prediction category label is a label of a manually set prediction initial corpus category. The x classes of initial corpora may be classified into x classes according to the initial prediction class labels.

For example, in a specific embodiment of the present application, 2 tags are manually set: "entertainment" and "military". Keywords of TV play and movie are set for entertainment, and keywords of airplane and tank are set for military. The manually prepared corpora are as follows: "Lian is the director of the movie" fantasy drift in juveniles ", then in this example, the initial prediction category labels are: "drama", "movie", the initial corpus is: lian is the director of the movie < fantasy drift of juveniles pie >, and the initial keywords are as follows: "drama", "movie", "airplane", "tank".

The step of constructing the training sample set based on the initial keywords and the initial corpus comprises the following steps:

expanding the initial keywords by using pre-training word vectors;

vectorizing and representing all initial keywords and initial corpora;

Specifically, the initial prediction category label is a classification label set manually, and the classification label is a label used for classifying the initial corpus; after the initial category prediction category labels are manually set, the initial keywords of each classification label are manually set corresponding to each classification label.

And expanding the initial keywords set by the human. The method for expanding the keywords is to use a pre-training word vector to find the approximate words of the keywords as the expanded keywords.

The pre-training word vector refers to a word vector with strong universality pre-trained by using a large amount of corpus data, and the pre-training word vector used in the application can be any existing Chinese word vector, and the application does not limit the pre-training word vector.

In the process of expanding the initial keywords according to the pre-training word vector, deleting the expanded initial keywords in the corresponding categories when detecting that the expanded initial keywords correspond to more than one category.

And expanding the manually set initial keywords of each initial category prediction category label. In the process of expansion, if the keyword obtained by expansion is detected to appear in more than one prediction category label, the keyword is deleted in the corresponding labels.

And coding all keywords and the initial corpus by using BERT (binary transcription), and expressing all keywords and the initial corpus in a vectorization way.

The BERT model is a pre-training language model, and model training is carried out by utilizing large-scale unmarked corpora. And inputting a text in the trained model, and outputting a vector with text semantics by the model, namely obtaining the semantic representation of the text.

Marking the initial corpus by using an initial prediction category label, wherein the marking method comprises the following specific steps: and processing the vector of the keyword by using a K-nearest neighbor algorithm to obtain a model M for processing the corpus vector, and classifying and labeling the initial corpus vector by using the model M to obtain the labeled initial corpus.

The K-nearest neighbor algorithm determines the classification of an input sample point by finding that sample points that are close to the input sample point mostly belong to a certain class.

And constructing a training sample set by using the initial corpus labeled with the prediction category label.

Step 204: a first set of training samples and a second set of training samples are extracted from the set of training samples, wherein the first set of training samples comprises a first set of sample data of m classes of the x classes, the second set of training samples comprises a second set of sample data of the first set of sample data of the m classes different, m < x.

The first training sample set is used for training the text classification model, and the second training sample set is used for testing the accuracy of the classification text model trained by the first training sample set.

The first sample data set may be constructed from samples in a first training sample set and the second sample data set may be constructed with samples in a second training sample set.

The first training sample set is formed by randomly extracting m types from x types and partial corpora corresponding to the types as samples, and the samples are used as samples for training a text classification model. The second training sample set is formed by extracting other corpora different from the part of corpora in the first training sample set from the m classes as samples, and the samples are used as samples of the verification text classification model.

Wherein x is the total number of classes in the training sample set, and m is a partial class in the x classes, so m is smaller than the value of x.

For example, in an embodiment of the present application, there are 5 sample classes a1, a2, A3, a4, and a5 in the training sample set, and the first training sample set extracts 3 types a1, a2, and a5 therein, and the part corpora c1-c4 corresponding to a1, a2, and a5, as the samples in the first training sample set, that is, the first sample data set. The second training sample set extracts the same sample types A1, A2, A5 as the first training sample set, and other corpora c5-c8 different from the part of the corpora as the samples in the second training sample set, i.e. the second sample data set. Wherein c1-c8 correspond to corpus samples of three categories, namely A1, A2 and A5.

Step 206: and training by using the first training sample to obtain a class recognition model.

The class identification model in the present application may be implemented based on a low-sample learning Network, for example, an analysis Network.

The first training sample may be a set of samples taken from a set of samples used to train the class recognition model.

The process of training the first training sample set to obtain the class recognition model comprises the following steps:

As shown in the schematic diagram of the class identification model in fig. 3, the class identification model includes an encoding layer, a classification layer, and a relationship building layer. Three layers of the category identification model will be described below based on the use of the category identification model.

After a first training sample set is extracted from a training sample set, inputting initial corpora in the first training sample set into a coding layer of the category identification model to obtain a first training sample vector;

the role of the coding layer is to represent the initial corpus vectorially. In the specific embodiment of this application, the biLSTM + Attention model is used for the coding layer. As shown in the schematic diagram of the coding layer of fig. 4, the embedding layer functions to convert discrete variables into a continuous vector representation; the neural network layer is used for extracting semantic information of the text; the attention layer adds weight values to the information extracted by the neural network layer, and converts the predicted information into vector output.

And inputting the first training sample vector into a classification layer of a class recognition model to obtain a first classification vector.

And the classification layer adopts a Capsule Network to convert the first training sample vector into a class-level vector, namely a first classification vector. And (4) characterizing and converting the samples in each category into the characterization of class-level.

Calculation process of Capsule Network: performing matrix multiplication on the input sample vector; scalar weighting the input sample vector; summing the weighted sample vectors; vector-to-vector non-linearisation processes values into a sample vector.

The Capsule Network can obtain better generalization capability by only needing less data and better deal with ambiguity, thereby playing a good role in training a classification model. The purpose of using the Capsule Network is to characterize class vectors dynamically.

Inputting the first classification vector into a relation construction layer of a class identification model, obtaining the prediction class of the first classification vector, and comparing the prediction class with the initial prediction class label to obtain an error. And iterating the sample classification model for multiple times based on the obtained errors until the classification model meets the training requirement.

The relationship building layer is used for modeling the relationship between the vector conversion and the category of class-level to obtain the relationship score between the vector and the category of class-level, and then calculating the loss value according to the relationship score. If the error is larger than the specified value, the model needs to be trained continuously until the training requirement is met.

Step 208: and verifying the class identification model by using the second training sample, and repeatedly performing the steps of extracting, training and verifying the class identification model until the class identification model is determined to meet the verification condition.

The second training sample verifies the category identification model and comprises the following steps:

The second training sample is the labeled initial corpus extracted from the set of training samples.

And the step of verifying the class identification model by using a second training sample is to input the second training sample into the class identification model obtained by training the first training sample to obtain the corpus with the label. And calculating the similarity between the label obtained by the class identification model and the sample label, and obtaining the trained class identification model if the similarity reaches a specified threshold value.

New text classification labels are also added in the process of classifying text because of the increased classification requirements. The method and the device also provide a solution for the situation of adding the new classified text labels.

And acquiring the corpus marked with the new label according to the requirement of adding the new label of the text, inputting the corpus marked with the new label as a training sample into the category identification model, and training the identification model to ensure that the identification model can identify the new label in the corpus.

The step of obtaining the labeling corpus of the new label comprises the following steps:

setting a first keyword of a new label;

acquiring a new corpus by using a second keyword of the new tag, and extracting a keyword of the new corpus;

The first keyword may be a keyword manually set for the new tag.

And in the process of expanding the first keywords of the new label according to the pre-training word vector, deleting the first keywords of the expanded new label in the corresponding categories when detecting that the first keywords of the expanded new label correspond to more than one category.

The expansion of the first keyword may be an expansion of a keyword of a new tag set manually. In the process of expansion, if the keyword is detected to correspond to one category, the keyword is deleted in all categories.

The second keyword may be a keyword related to the new tag expanded from the keyword manually set for the new tag.

And performing similarity calculation on the keywords of the new corpus and the expanded keywords of the new label to obtain a labeled corpus of the new label, and inputting the labeled corpus of the new label into the category identification model to train the category identification model.

For example, in a specific embodiment of the present application, a new text classification label P is added. Firstly, the keywords k1, k2 and k3 of P are set manually, namely, the first keyword is set. And expanding the keywords k1, k2 and k3 of the P by using the pre-training word vector to obtain expanded keywords k4, k5 and k6, namely the second keywords. And acquiring the corpus corresponding to the new label P according to k4, k5 and k6, and extracting keywords e1, e2 and e3 in the corpus corresponding to the new label P. And (5) carrying out similarity calculation on k4, k5, k6, e1, e2 and e3 to obtain the corresponding corpus of the new label P.

Inputting the labeled corpus of the new label into the category recognition model to train the recognition model, wherein the process comprises the following steps:

And putting the obtained tagged corpus of the new tag into a training sample set to obtain the training sample set with the tagged corpus sample of the new tag. And extracting samples from the training sample set with the new labeled corpus samples for model training.

When training samples are extracted from a training sample set with a new label labeled corpus sample and training samples corresponding to a new category are extracted, the number of the training samples of the new category is more than that of the training samples corresponding to other categories in the training sample set.

And inputting samples extracted from the training sample set with the new label labeled corpus samples into the coding layer of the category identification model to obtain a new label sample vector.

Inputting the new label sample vector into the classification layer of the class identification model to obtain a new label classification vector;

S1, constructing a training sample set based on initial keywords and initial corpora, wherein the training sample set comprises x types of initial corpora, and each initial corpus corresponds to an initial prediction type label; s2, extracting a first training sample set and a second training sample set from the training sample set, wherein the first training sample set includes a first sample data set of m categories of the x categories, the second training sample set includes a second sample data set of the first sample data sets of m categories, m < x; s3, training by using the first training sample to obtain a class recognition model; s4, verifying the class recognition model by using the second training sample, and repeatedly executing the steps S2 to S4 until the class recognition model is determined to meet the verification condition. According to the text classification model training method, the model can be trained only by a small amount of accurate labeling data, the text classification model capable of realizing text classification is obtained, manual labeling time is saved, and labeling efficiency is improved.

Fig. 5 illustrates a text classification model training method according to an embodiment of the present application, which is described by taking a text classification model for training and classifying news text as an example, and includes steps 502 to 508.

Step 502: a training sample set is constructed based on the initial news keywords and the initial news corpora, wherein the training sample set comprises 5 categories of initial news corpora, and each initial news corpus has a corresponding initial prediction news category label.

The method for manually setting the initial prediction news category label of the classified news comprises the following steps: entertainment, military, tourism, recruitment and education. Manually set 3 keywords for each tag, for example, set keywords for travel as: scenic spots, entrance tickets and tour guides. The initial news corpus is the news corresponding to the manually set tags, for example, one news corpus is "open to tourists at multiple scenic spots during holidays".

And searching similar keywords of the manually set keywords by using the existing pre-training word vectors, and expanding the manually set keywords, wherein the upper limit of the expansion is 5 keywords. For example, according to the keywords of the manually set travel tag: scenic spots, entrance tickets and tour guides are expanded based on the pre-training word vectors to obtain 5 expanded keywords: visa, sunstroke prevention, lodging, cate and self-driving tour.

In the process of expanding the keywords, the keyword 'food' expanded to exist under the entertainment tag and the travel tag, and at this time, the keyword 'food' under the entertainment tag and the travel tag needs to be deleted.

All keywords and initial news corpora are converted into vectorized representations using the BERT model. And after the vectorized keywords are obtained, processing by using a K nearest neighbor algorithm to obtain a model capable of classifying and labeling the materials. And then, putting the vectorized initial news corpus into the model to classify and label the initial corpus to obtain the classified and labeled initial news corpus.

Step 504: and extracting a first training sample set and a second training sample set from the training sample set, wherein the first training sample set comprises news corpus samples of 3 news categories of 5 news categories, the second training sample set comprises news corpus samples which are different from the 3 news categories in the first training sample set, and the number of extracted categories 3 is less than the number of categories 5 in the sample set.

And constructing a training sample set by using the classified and labeled initial news corpus obtained in the step 502. And extracting a first training sample set and a second training sample set from the constructed training sample set.

The first training sample set includes 5 classes from the training sample set: 3 types randomly drawn in entertainment, military affairs, tourism, recruitment and education: entertainment, military and tourism. Of the 3 categories, 5 news corpus samples are extracted from each category to form a first training sample set. The first training sample set is used to train a classified news text model.

The second training sample set includes 5 classes from the training sample set: 3 categories extracted in entertainment, military, tourism, recruitment and education, which are the same as the categories in the first training sample set: entertainment, military and tourism. And in each of the 3 categories, 5 news corpus samples different from the samples in the first training sample set are extracted to form a second training sample set. The second training sample set is used to verify the accuracy of the classified news text model trained by the first training sample set.

Step 506: and training by using the first training sample to obtain a class recognition model.

And putting the samples in the first training sample set into the model to train the classified text model, namely inputting news corpora under the labels of entertainment, military affairs and tourism into the text classification model to train.

The initial news corpus in the first training sample set is input into the coding layer of the classified text model. The encoding layer converts the news text into a vector to obtain a first training news sample vector;

inputting the first training news sample vector into a classification layer of a classification text model to obtain a first classification news vector;

inputting a first classification news vector into a relation construction layer of a classification text model, obtaining a prediction news category of the first classification vector, comparing the prediction news category with an initial prediction news category label to obtain an error, and performing iterative training on the text classification model based on the error until a training stop condition is reached.

Step 508: and verifying the class identification model by using the second training sample, and repeatedly performing the steps of extracting, training and verifying the class identification model until the class identification model is determined to meet a verification condition.

And taking the news corpora in the second training sample set as texts to be classified, and inputting the texts into a classified text model obtained by training the first training sample set to obtain labels of the news corpora in the second training sample set. And carrying out similarity calculation on the news labels obtained by classifying the text model and the labels carried by the news corpus. If the calculated similarity does not meet the expected value, continuing training the verification classified text model according to the steps until the similarity meets the expected value, and stopping training to obtain the trained classified text model.

New news tags are added according to the requirements of users: and (5) sports. The text classification model needs to be trained again, and news corpora of sports classification can be obtained. The training process is as follows:

manually setting keywords for newly-added tag sports: football, sports, basketball. Using the pre-training word vectors to the keywords: football, gymnastics, basketball expand, obtain 5 keywords that expand the sports: olympic, Racing, championship, sports, female vola. And searching news corpora corresponding to the expanded keywords by using the expanded keywords. In the process of expanding the keywords, if the keyword 'motion' is detected to appear under the sports tag and also under the entertainment tag, the keyword 'motion' is deleted from both tags. And (4) carrying out similarity calculation on the keywords of the news corpus and the keywords obtained by expanding the keywords of the new tag, wherein the news corpus which meets the requirement of similarity is used as the related corpus of the newly added tag sports.

And inputting the related linguistic data of the newly added label sports into the original training sample set. And extracting a new training sample set from the training sample set with the new corpus samples and inputting the new training sample set into the classification text model for model training.

The original training sample is concentrated with 5 news categories, and after the related corpora of sports are added, the total number of the news categories is 6. Randomly draw 3 categories from a training sample set with 6 news categories: sports, recruitment and education, 5 news corpus samples of each category in 3 categories are extracted. Where sports are used as new tags, more than 5 news corpus samples need to be drawn.

Inputting the extracted news corpus sample into a coding layer of a text classification model to obtain a sample vector;

inputting the obtained sample vector into a classification layer of a text classification model to obtain a classification vector;

inputting the classification vectors into the relation construction layer, obtaining prediction labels of the classification vectors, comparing the prediction labels with the initially set news labels to calculate error values, and performing iterative training on the classification text model based on the error values until conditions are met to obtain the classification text model capable of identifying sports category texts.

S1, constructing a training sample set based on initial keywords and initial corpora, wherein the training sample set comprises x types of initial corpora, and each initial corpus corresponds to an initial prediction type label; s2, extracting a first training sample set and a second training sample set from the training sample set, wherein the first training sample set includes a first sample data set of m categories of the x categories, the second training sample set includes a second sample data set of the first sample data sets of m categories, m < x; s3, training by using the first training sample to obtain a class recognition model; s4, verifying the class identification model by using the second training sample, and repeatedly executing the steps S2 to S4 until the class identification model is determined to meet the verification condition. According to the text classification model training method, the model can be trained only by a small amount of accurate labeling data, the text classification model capable of realizing text classification is obtained, the time for training the text classification model is saved, and the model training efficiency is improved.

Corresponding to the above method embodiment, the present application further provides a text classification model training device embodiment, and fig. 6 shows a schematic structural diagram of the text classification model training device according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes:

a constructing module 602, configured to construct a training sample set based on an initial keyword and an initial corpus, where the training sample set includes x categories of initial corpuses, and each initial corpus corresponds to an initial prediction category tag;

an extraction module 604 configured to extract a first set of training samples and a second set of training samples from the set of training samples, wherein the first set of training samples comprises a first set of sample data of m categories of the x categories, the second set of training samples comprises a second set of sample data of the first set of sample data different from the m categories, m < x;

a training module 606 configured to train with the first training sample to obtain a class recognition model;

a verification module 608 configured to verify the class identification model by using the second training sample, and repeatedly execute the extraction module, the training module, and the verification module until the class identification model is determined to satisfy a verification condition.

Optionally, the apparatus further comprises:

and the receiving module is configured to receive a new label of a new category to be recognized, acquire a labeling corpus of the new label and input the labeling corpus of the new label into the category recognition model to train the recognition model.

Optionally, the training module 606 comprises:

the training submodule is configured to input the initial corpus in the first training sample set into a coding layer of the category identification model to obtain a first training sample vector;

Optionally, the verification module 608 includes:

and the verification sub-module is configured to input the second training sample into a class recognition model obtained by training the first training sample, calculate similarity data between a label obtained by the class recognition model and a sample label, and obtain a trained class recognition model if the similarity data reaches a specified threshold value.

Optionally, the receiving module includes:

the obtaining sub-module is configured to obtain the labeled corpus of the new label, and the step of obtaining the labeled corpus of the new label includes:

setting a first keyword of a new label;

Optionally, the building module 602 includes:

the construction submodule is configured to set an initial prediction category label and an initial keyword corresponding to the initial prediction category label;

expanding the initial keywords by using the pre-training word vectors;

vectorizing and representing all initial keywords and initial corpora;

Optionally, the receiving module includes:

the training submodule is configured to input the initial corpus in the labeled corpus of the new label into the coding layer of the category identification model to obtain a sample vector of the new label;

Optionally, the obtaining sub-module is further configured to, in a process of expanding the first keyword of the new tag according to the pre-training word vector, delete all the expanded first keywords of the new tag in the corresponding categories when it is detected that the expanded first keyword of the new tag corresponds to more than one category.

Optionally, the constructing sub-module is further configured to, in a process of expanding the initial keyword according to the pre-training word vector, delete all the expanded initial keywords in the corresponding categories when it is detected that the expanded initial keyword corresponds to more than one category. According to the text classification model training device in the embodiment of the application, the model can be trained only by a small amount of accurate labeling data, the classification model capable of realizing text classification is obtained, manual labeling time is saved, and file classification efficiency is improved.

According to the text classification model training device in the embodiment of the application, the model can be trained only by a small amount of accurate marking data, so that the model training efficiency is improved; when a new label needs to be added, the model can be trained without a large amount of manually labeled data, so that the cost for training the model is saved.

The above is a schematic scheme of the text classification model training apparatus of this embodiment. It should be noted that the technical solution of the text classification model training apparatus and the technical solution of the text classification model training method belong to the same concept, and details of the technical solution of the text classification model training apparatus, which are not described in detail, can be referred to the description of the technical solution of the text classification model training method.

It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

Processor 120 may perform the text classification method illustrated in fig. 7.

Fig. 7 is a flowchart of a text classification method according to an embodiment of the present application, including steps 702 to 704.

Step 702: and receiving the text to be classified and performing word segmentation processing to obtain a first word segmentation set.

The text to be classified can be a corpus of text that needs to be classified. The first set of participles may be a first set of participles constructed by processing the plain text corpus into participles. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification.

For example, in one embodiment of the present application, the word segmentation process is performed on the plain text "Xiaoming to Beijing". After the processing, three participles of 'Xiaoming', 'arrival' and 'Beijing' are obtained, and the three participles form the first participle set.

Step 704: and inputting the first word segmentation set into a text classification model to obtain a prediction category of the text to be classified, wherein the text classification model is obtained by training according to the text classification model training method shown in the figure 2.

The text classification model is obtained by training through the text classification model training method. Inputting the first word set into a trained text classification model:

inputting a first layer coding layer of a model to obtain a first text vector;

inputting the first text vector into a classification layer of the model to obtain a first classification vector;

and inputting the first classification vector into a relation construction layer of the model to obtain the prediction category of the first classification vector.

For example, in an embodiment of the present application, a first word set composed of three words "xiaoming", "arrive" and "beijing" is input into a trained text classification model, so as to obtain a prediction category of a stored text "xiaoming arrives at beijing".

In the text classification method according to a specific embodiment of the present application, a first word segmentation set is obtained by receiving a text to be classified and performing word segmentation processing, and the first word segmentation set is input to a text classification model to obtain a prediction category of the text to be classified. According to the text classification method, the labels of the plain text corpora are obtained through the trained model, so that the time for manually marking the corpora is saved, and the efficiency of classifying the texts is improved.

Corresponding to the above method embodiment, the present application further provides a text classification device embodiment, and fig. 8 shows a schematic structural diagram of a text classification model training device according to an embodiment of the present application. As shown in fig. 8, the apparatus 800 includes:

a processing module 802 configured to receive a text to be classified and perform word segmentation processing to obtain a first word segmentation set;

an input module 804, configured to input the first word set into a text classification model to obtain a prediction category of a text to be classified, where the text classification model is obtained by training according to the text classification model training method shown in fig. 2.

Optionally, the input module 804 includes:

the input sub-module is configured to input the first word segmentation set into a coding layer of the category identification model to obtain a first text vector; inputting the first text vector into a classification layer of the class identification model to obtain a first classification vector; and inputting the first classification vector into a relation construction layer of the class identification model to obtain the prediction class of the first classification vector.

According to the text classification device, the input text corpora are subjected to word segmentation, the labels of the corpora are obtained in the trained text classification model, the time for manually classifying the texts is saved, and the text classification efficiency is improved.

The above is a schematic scheme of a text classification apparatus of this embodiment. It should be noted that the technical solution of the text classification device and the technical solution of the text classification model training method belong to the same concept, and details that are not described in detail in the technical solution of the text classification device can be referred to the description of the technical solution of the text classification model training method.

An embodiment of the present application further provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor implements the text classification model training method or the text classification method when executing the instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text classification model training method or the text classification method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the text classification model training method or the text classification method.

An embodiment of the present application further provides a computer-readable storage medium, which stores computer instructions, when executed by a processor, for implementing the text classification model training method or the steps of the text classification method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text classification model training method or the text classification method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text classification model training method or the text classification method.

The embodiment of the application discloses a chip, which stores computer instructions, and the instructions are executed by a processor to realize the steps of the text classification model training method or the text classification method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in source code form, object code form, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A text classification model training method is characterized by comprising the following steps:

s4, verifying the class recognition model by using the second training sample, and repeatedly executing the steps S2 to S4 until the class recognition model is determined to meet the verification condition.

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein training with the first training sample set to obtain a class recognition model comprises:

inputting the first classification vector into a relation building layer of the class identification model, obtaining the prediction class of the first classification vector, comparing the prediction class with an initial prediction class label to obtain an error, and performing iterative training on the class identification model based on the error until a training stop condition is reached.

4. The method of claim 1, wherein the second training sample validates the class identification model, comprising:

5. The method according to claim 2, wherein the step of obtaining the markup corpus of the new tag comprises:

setting a first keyword of a new label;

6. The method of claim 1, wherein the step of constructing the training sample set based on the initial keywords and the initial corpus comprises:

expanding the initial keywords by using the pre-training word vectors;

vectorizing and representing all initial keywords and initial corpora;

7. The method according to claim 2, wherein the step of inputting the labeled corpus of new labels into the category recognition model to train the recognition model comprises:

8. The method according to claim 5, wherein in the process of expanding the first keyword of the new tag according to the pre-training word vector, if it is detected that the first keyword of the expanded new tag corresponds to more than one category, the first keyword of the expanded new tag is deleted in the corresponding category.

9. The method of claim 6, wherein in the process of expanding the initial keyword according to the pre-training word vector, when detecting that the expanded initial keyword corresponds to more than one category, the expanded initial keyword is deleted from the corresponding categories.

10. A method of text classification, comprising:

inputting the first word set into a text classification model to obtain a prediction category of a text to be classified, wherein the text classification model is obtained by training according to the method of any one of claims 1 to 9.

11. The method of claim 10, wherein entering the first set of words into a text classification model to obtain a corresponding text type comprises:

12. A text classification model training apparatus comprising:

a validation module configured to validate the class recognition model using the second training sample,

and repeatedly executing the extraction module, the training module and the verification module until the class identification model is determined to meet the verification condition.

13. A text classification apparatus comprising:

an input module configured to input the first set of words into a text classification model to obtain a prediction category of a text to be classified, wherein the text classification model is trained according to the method of any one of claims 1 to 9.

14. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-9 or 10-11 when executing the instructions.

15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1-9 or 10-11.