WO2021128721A1

WO2021128721A1 - Method and device for text classification

Info

Publication number: WO2021128721A1
Application number: PCT/CN2020/092099
Authority: WO
Inventors: 张禄; 及洪泉; 姚晓明; 胡彩娥; 丁屹峰; 王培祎; 马龙飞; 陆斯悦; 王健; 徐蕙
Original assignee: 国网北京市电力公司; 国家电网有限公司
Priority date: 2019-12-25
Filing date: 2020-05-25
Publication date: 2021-07-01
Also published as: CN111209394A

Abstract

A method and device for text classification. The method comprises: acquiring text to be classified (S102); inputting the text into a model, wherein the model is obtained through machine learning and training by using training data (S104); using the output acquired from the model as a category corresponding to the text (S106); and storing the text and the category corresponding thereto (S108). The technical problem that text classification in the prior art relies on a manual mode is solved.

Description

Text classification processing method and device

Technical field

The present invention relates to the field of text classification, and in particular, to a text classification processing method and device.

Background technique

In the context of the ubiquitous power Internet of Things, the 95598 customer service system, as an important part of the ubiquitous power Internet of Things application, has registered massive amounts of customer information. Currently, it mainly relies on manual statistical work order analysis, resulting in related problems such as insufficient efficiency. Due to the large amount of customer demand data in 95598, manual classification efficiency is low, and accurate and efficient classification cannot be achieved.

In view of the above-mentioned problems, no effective solutions have yet been proposed.

Summary of the invention

The embodiments of the present invention provide a text classification processing method and device to at least solve the technical problem of manually classifying text in the prior art.

According to one aspect of the embodiments of the present invention, there is provided a text classification processing method, including: obtaining the text to be classified; inputting the text to be classified into a model, wherein the model uses training data through machine learning Obtained by training; the output obtained from the model is used as the category corresponding to the text to be classified; and the text to be classified and its corresponding category are saved.

Optionally, before acquiring the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain the model.

Optionally, training to obtain the model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain the model, wherein The second corpus includes multiple sets of data, and each set of data includes text and the category corresponding to the text.

Optionally, using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to perform training to obtain the first model through BERT, wherein each of the corpus in the corpus is masked during the training. Part of the content of a piece of corpus. The training is used to predict the concealed content.

Optionally, the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.

According to another aspect of the embodiments of the present invention, there is also provided a text classification processing device, including: an acquisition module for acquiring text to be classified; an input module for inputting the text to be classified into a model, Wherein, the model is obtained through machine learning training using training data; an output module is used to use the output obtained from the model as the category corresponding to the text to be classified; and a storage module is used to save the The text to be classified and its corresponding category.

Optionally, it further includes: a training module, configured to use multiple sets of training data to train through machine learning to obtain the model.

Optionally, the training module includes: a first training unit for pre-training using a first corpus to obtain the first model; a second training unit for iterating on the first model using a second corpus The model is obtained by training, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.

Optionally, the first training unit is configured to: use the first corpus to train through BERT to obtain the first model, wherein part of the content of each corpus in the corpus is masked during the training, so The training is used to predict what is covered.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, the storage medium including a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above The text classification processing method.

According to another aspect of the embodiments of the present invention, there is also provided a processor configured to run a program, wherein the text classification processing method described in any one of the above is executed when the program is running.

In the embodiment of the present invention, the text to be classified is obtained; the text to be classified is input into a model, where the model is obtained through machine learning training using training data; and the text is obtained from the model The output of is used as the category corresponding to the text to be classified; the way to save the text to be classified and its corresponding category, the model obtained through machine learning training recognizes the category corresponding to the text to be classified, and saves it to achieve The purpose of categorizing quickly and accurately is to achieve the technical effect of improving the efficiency of text categorization, thereby solving the technical problem of manually categorizing text in the prior art.

Description of the drawings

The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The exemplary embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:

Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention;

Fig. 2 is a flowchart of training of a classification model according to an optional embodiment of the present invention;

Fig. 3 is a schematic diagram of a text classification processing device according to an embodiment of the present invention.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

Example 1

According to an embodiment of the present invention, an embodiment of a text classification processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, Although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.

Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:

Step S102: Obtain the text to be classified;

The above-mentioned text to be classified includes but is not limited to a work order. Among them, the text to be classified can be obtained in a variety of ways, for example, using crawling software, manual entry, and so on. In the specific implementation process, multiple methods are used to obtain the text to be classified, which can expand the source of the text to be classified to be suitable for a variety of application scenarios.

Step S104, input the text to be classified into the model, where the model is obtained through machine learning training using training data;

When the above model is a classification model, the text to be classified can be processed through the model. Optionally, the model is a work order classification model. It should be noted that the above model is obtained through machine learning training using training data, and can realize automatic text classification.

Step S106: Use the output obtained from the model as the category corresponding to the text to be classified;

Through the above model, the input text to be classified can be output corresponding to its corresponding category. This model can effectively improve the classification accuracy and improve the efficiency of text classification.

Step S108: Save the text to be classified and its corresponding category.

As an optional embodiment, the text to be classified and its corresponding category can be saved in a predetermined format, where the predetermined format includes text attributes and category attributes, and the text to be classified can be saved in the location of the text attribute. , Save the category corresponding to the text to be classified in the location of the category attribute. It should be noted that in the specific implementation process, it is not limited to the above methods.

Through the above steps, the model obtained by machine learning training can be used to identify the category corresponding to the text to be classified, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification and solving the problem. There are technical problems that rely on manual methods to classify text.

Optionally, before obtaining the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain a model.

The above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .

As an optional embodiment, in the process of training the model, the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be unfolded. The Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.

In addition, in the RNN, if the first frame is to be dependent on the tenth frame, then the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated. In the process of this transfer, the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed. In Transformer, due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.

Optionally, training to obtain a model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain a model, wherein the second corpus includes multiple groups Data, each group of data includes text and the category corresponding to the text.

The first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model. Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text. Through the above-mentioned different training methods, the model can be continuously tuned and updated, and the stability of the model can be effectively improved.

Optionally, using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked during training, and training is used What is hidden in the forecast.

The above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.

Optionally, the text includes the work order text, and the category includes: the type of the work order, where the type includes at least one category.

The aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.

An optional implementation manner of the present invention will be described below.

Taking the 95598 work order as an example, Figure 2 is a flowchart of the training of the classification model according to an optional embodiment of the present invention. As shown in Figure 2, when the customer service accesses, the customer service manually enters the content of the work order into two categories: category and text. Part, after the corresponding cleaning and proofreading work is done on the category and the text, the text content enters the already trained classification model. Then the prediction data of the classification model is compared with the manually entered categories, and the evaluation index of the current model is obtained to evaluate the performance of the current model.

At the same time, the current model performance is used to determine whether it is necessary to use the new comparison results and text content to continue to tune and update the model. This can ensure the real-time effect of the model, avoid uncertain model deviations, and provide the model with the possibility of continuous use and optimization.

It should be noted that in the above implementation process, automatic text-based classification functions can be provided for 95598 work orders; real-time monitoring and display of model performance are provided to facilitate model maintenance; models have the ability to continuously update and optimize, Able to continuously optimize in the actual business process; have a certain adaptability to the trend changes of text work orders; the way the model is used in the actual business process.

In addition, through the above-mentioned work order classification model, not only the prediction accuracy is improved, but also the work order classification function required in the business can be realized.

Example 2

According to another aspect of the embodiment of the present invention, there is also provided an embodiment of an apparatus for executing the text classification processing method in the above embodiment 1. FIG. 3 is a schematic diagram of the text classification processing apparatus according to the embodiment of the present invention, such as As shown in FIG. 3, the text classification processing device includes: an acquisition module 302, an input module 304, an output module 306, and a storage module 308. The text classification processing device will be described in detail below.

The obtaining module 302 is used to obtain the text to be classified;

The input module 304, connected to the above-mentioned acquisition module 302, is used to input the text to be classified into the model, where the model is obtained through machine learning training using training data;

The output module 306, connected to the aforementioned input module 304, is configured to use the output obtained from the model as the category corresponding to the text to be classified;

The saving module 308 is connected to the aforementioned output module 306, and is used to save the text to be classified and its corresponding category.

The above device can recognize the category corresponding to the text to be classified through the model obtained by the machine learning training, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification, thereby solving the prior art Rely on manual methods to classify text technical issues.

It should be noted here that the above-mentioned acquisition module 302, input module 304, output module 306, and saving module 308 correspond to steps S102 to S108 in Embodiment 1. The above modules and the corresponding steps implement the same examples and application scenarios. However, it is not limited to the content disclosed in Example 1 above. It should be noted that, as a part of the device, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.

Optionally, it further includes: a training module, configured to use multiple sets of training data to train through machine learning to obtain a model.

As an optional embodiment, in the process of training the model, the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be expanded. The Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.

Optionally, the training module includes: a first training unit for pre-training using the first corpus to obtain the first model; a second training unit for iterative training of the first model using the second corpus to obtain the model, Among them, the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.

Optionally, the first training unit is used to: use the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.

Example 3

According to another aspect of the embodiments of the present invention, a storage medium is also provided. The storage medium includes a stored program, wherein the device where the storage medium is located is controlled to execute any one of the above-mentioned text classification processing methods when the program is running.

Example 4

According to another aspect of the embodiments of the present invention, there is also provided a processor, which is configured to run a program, where any one of the text classification processing methods described above is executed when the program is running.

The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority or inferiority of the embodiments.

In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units may be a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including a number of instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims

A text classification processing method, which is characterized in that it comprises:

Obtain the text to be classified;

Inputting the text to be classified into a model, where the model is obtained through machine learning training using training data;

Use the output obtained from the model as the category corresponding to the text to be classified;

Save the text to be classified and its corresponding category.
The method according to claim 1, characterized in that, before obtaining the text to be classified, the method further comprises:

Multiple sets of training data are used to train through machine learning to obtain the model.
The method according to claim 2, wherein the training to obtain the model through machine learning comprises:

Use the first corpus for pre-training to obtain the first model;

The first model is iteratively trained using a second corpus to obtain the model, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
The method according to claim 3, wherein the pre-training using the first corpus to obtain the first model comprises:

The first model is obtained through BERT training using the first corpus, wherein part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
The method according to any one of claims 1 to 4, wherein the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.
A text classification processing device, characterized in that it comprises:

The obtaining module is used to obtain the text to be classified;

An input module for inputting the text to be classified into a model, where the model is obtained through machine learning training using training data;

An output module, configured to use the output obtained from the model as the category corresponding to the text to be classified;

The saving module is used to save the text to be classified and its corresponding category.
The device according to claim 6, further comprising:

The training module is configured to use multiple sets of training data to train through machine learning to obtain the model.
The device according to claim 7, wherein the training module comprises:

The first training unit is used for pre-training using the first corpus to obtain the first model;

The second training unit is configured to use a second corpus to perform iterative training on the first model to obtain the model, wherein the second corpus includes multiple sets of data, and each set of data includes text and the text The corresponding category.
The device according to claim 8, wherein the first training unit is used for:

The first model is obtained through BERT training using the first corpus, wherein part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
The device according to any one of claims 6 to 9, wherein the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.