[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021128721A1 - Method and device for text classification - Google Patents

Method and device for text classification Download PDF

Info

Publication number
WO2021128721A1
WO2021128721A1 PCT/CN2020/092099 CN2020092099W WO2021128721A1 WO 2021128721 A1 WO2021128721 A1 WO 2021128721A1 CN 2020092099 W CN2020092099 W CN 2020092099W WO 2021128721 A1 WO2021128721 A1 WO 2021128721A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
training
corpus
classified
Prior art date
Application number
PCT/CN2020/092099
Other languages
French (fr)
Chinese (zh)
Inventor
张禄
及洪泉
姚晓明
胡彩娥
丁屹峰
王培祎
马龙飞
陆斯悦
王健
徐蕙
Original Assignee
国网北京市电力公司
国家电网有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国网北京市电力公司, 国家电网有限公司 filed Critical 国网北京市电力公司
Publication of WO2021128721A1 publication Critical patent/WO2021128721A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of text classification, and in particular, to a text classification processing method and device.
  • the 95598 customer service system As an important part of the ubiquitous power Internet of Things application, has registered massive amounts of customer information. Currently, it mainly relies on manual statistical work order analysis, resulting in related problems such as insufficient efficiency. Due to the large amount of customer demand data in 95598, manual classification efficiency is low, and accurate and efficient classification cannot be achieved.
  • the embodiments of the present invention provide a text classification processing method and device to at least solve the technical problem of manually classifying text in the prior art.
  • a text classification processing method including: obtaining the text to be classified; inputting the text to be classified into a model, wherein the model uses training data through machine learning Obtained by training; the output obtained from the model is used as the category corresponding to the text to be classified; and the text to be classified and its corresponding category are saved.
  • the method before acquiring the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain the model.
  • training to obtain the model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain the model, wherein The second corpus includes multiple sets of data, and each set of data includes text and the category corresponding to the text.
  • using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to perform training to obtain the first model through BERT, wherein each of the corpus in the corpus is masked during the training. Part of the content of a piece of corpus. The training is used to predict the concealed content.
  • the text includes a work order text
  • the category includes: a type of the work order, wherein the type includes at least one category.
  • a text classification processing device including: an acquisition module for acquiring text to be classified; an input module for inputting the text to be classified into a model, Wherein, the model is obtained through machine learning training using training data; an output module is used to use the output obtained from the model as the category corresponding to the text to be classified; and a storage module is used to save the The text to be classified and its corresponding category.
  • a training module configured to use multiple sets of training data to train through machine learning to obtain the model.
  • the training module includes: a first training unit for pre-training using a first corpus to obtain the first model; a second training unit for iterating on the first model using a second corpus
  • the model is obtained by training, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first training unit is configured to: use the first corpus to train through BERT to obtain the first model, wherein part of the content of each corpus in the corpus is masked during the training, so The training is used to predict what is covered.
  • the text includes a work order text
  • the category includes: a type of the work order, wherein the type includes at least one category.
  • a storage medium including a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above The text classification processing method.
  • a processor configured to run a program, wherein the text classification processing method described in any one of the above is executed when the program is running.
  • the text to be classified is obtained; the text to be classified is input into a model, where the model is obtained through machine learning training using training data; and the text is obtained from the model
  • the output of is used as the category corresponding to the text to be classified; the way to save the text to be classified and its corresponding category, the model obtained through machine learning training recognizes the category corresponding to the text to be classified, and saves it to achieve
  • the purpose of categorizing quickly and accurately is to achieve the technical effect of improving the efficiency of text categorization, thereby solving the technical problem of manually categorizing text in the prior art.
  • Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention
  • Fig. 2 is a flowchart of training of a classification model according to an optional embodiment of the present invention
  • Fig. 3 is a schematic diagram of a text classification processing device according to an embodiment of the present invention.
  • an embodiment of a text classification processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, Although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.
  • Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:
  • Step S102 Obtain the text to be classified
  • the above-mentioned text to be classified includes but is not limited to a work order.
  • the text to be classified can be obtained in a variety of ways, for example, using crawling software, manual entry, and so on.
  • multiple methods are used to obtain the text to be classified, which can expand the source of the text to be classified to be suitable for a variety of application scenarios.
  • Step S104 input the text to be classified into the model, where the model is obtained through machine learning training using training data;
  • the text to be classified can be processed through the model.
  • the model is a work order classification model. It should be noted that the above model is obtained through machine learning training using training data, and can realize automatic text classification.
  • Step S106 Use the output obtained from the model as the category corresponding to the text to be classified;
  • the input text to be classified can be output corresponding to its corresponding category.
  • This model can effectively improve the classification accuracy and improve the efficiency of text classification.
  • Step S108 Save the text to be classified and its corresponding category.
  • the text to be classified and its corresponding category can be saved in a predetermined format, where the predetermined format includes text attributes and category attributes, and the text to be classified can be saved in the location of the text attribute. , Save the category corresponding to the text to be classified in the location of the category attribute. It should be noted that in the specific implementation process, it is not limited to the above methods.
  • the model obtained by machine learning training can be used to identify the category corresponding to the text to be classified, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification and solving the problem.
  • the method before obtaining the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain a model.
  • the above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
  • the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be unfolded.
  • the Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
  • the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated.
  • the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed.
  • Transformer due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
  • training to obtain a model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain a model, wherein the second corpus includes multiple groups Data, each group of data includes text and the category corresponding to the text.
  • the first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model.
  • Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked during training, and training is used What is hidden in the forecast.
  • the above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
  • the text includes the work order text
  • the category includes: the type of the work order, where the type includes at least one category.
  • the aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
  • Figure 2 is a flowchart of the training of the classification model according to an optional embodiment of the present invention.
  • the customer service accesses, the customer service manually enters the content of the work order into two categories: category and text. Part, after the corresponding cleaning and proofreading work is done on the category and the text, the text content enters the already trained classification model. Then the prediction data of the classification model is compared with the manually entered categories, and the evaluation index of the current model is obtained to evaluate the performance of the current model.
  • the current model performance is used to determine whether it is necessary to use the new comparison results and text content to continue to tune and update the model. This can ensure the real-time effect of the model, avoid uncertain model deviations, and provide the model with the possibility of continuous use and optimization.
  • automatic text-based classification functions can be provided for 95598 work orders; real-time monitoring and display of model performance are provided to facilitate model maintenance; models have the ability to continuously update and optimize, Able to continuously optimize in the actual business process; have a certain adaptability to the trend changes of text work orders; the way the model is used in the actual business process.
  • FIG. 3 is a schematic diagram of the text classification processing apparatus according to the embodiment of the present invention, such as As shown in FIG. 3, the text classification processing device includes: an acquisition module 302, an input module 304, an output module 306, and a storage module 308. The text classification processing device will be described in detail below.
  • the obtaining module 302 is used to obtain the text to be classified
  • the input module 304 connected to the above-mentioned acquisition module 302, is used to input the text to be classified into the model, where the model is obtained through machine learning training using training data;
  • the saving module 308 is connected to the aforementioned output module 306, and is used to save the text to be classified and its corresponding category.
  • the above device can recognize the category corresponding to the text to be classified through the model obtained by the machine learning training, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification, thereby solving the prior art Rely on manual methods to classify text technical issues.
  • the above-mentioned acquisition module 302, input module 304, output module 306, and saving module 308 correspond to steps S102 to S108 in Embodiment 1.
  • the above modules and the corresponding steps implement the same examples and application scenarios. However, it is not limited to the content disclosed in Example 1 above. It should be noted that, as a part of the device, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.
  • a training module configured to use multiple sets of training data to train through machine learning to obtain a model.
  • the above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
  • the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be expanded.
  • the Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
  • the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated.
  • the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed.
  • Transformer due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
  • the training module includes: a first training unit for pre-training using the first corpus to obtain the first model; a second training unit for iterative training of the first model using the second corpus to obtain the model, Among them, the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model.
  • Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first training unit is used to: use the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
  • the above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
  • the text includes the work order text
  • the category includes: the type of the work order, where the type includes at least one category.
  • the aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
  • a storage medium includes a stored program, wherein the device where the storage medium is located is controlled to execute any one of the above-mentioned text classification processing methods when the program is running.
  • a processor which is configured to run a program, where any one of the text classification processing methods described above is executed when the program is running.
  • the disclosed technical content can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units may be a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and device for text classification. The method comprises: acquiring text to be classified (S102); inputting the text into a model, wherein the model is obtained through machine learning and training by using training data (S104); using the output acquired from the model as a category corresponding to the text (S106); and storing the text and the category corresponding thereto (S108). The technical problem that text classification in the prior art relies on a manual mode is solved.

Description

文本分类处理方法和装置Text classification processing method and device 技术领域Technical field
本发明涉及文本分类领域,具体而言,涉及一种文本分类处理方法和装置。The present invention relates to the field of text classification, and in particular, to a text classification processing method and device.
背景技术Background technique
在泛在电力物联网的大背景下,95598客户服务系统作为泛在电力物联网应用的重要组成部分,登记了海量客户信息。当前主要依靠人工统计工单分析,产生效率不足等相关问题。由于在95598客户诉求数据量较大,人工分类效率低,无法做到精准高效分类。In the context of the ubiquitous power Internet of Things, the 95598 customer service system, as an important part of the ubiquitous power Internet of Things application, has registered massive amounts of customer information. Currently, it mainly relies on manual statistical work order analysis, resulting in related problems such as insufficient efficiency. Due to the large amount of customer demand data in 95598, manual classification efficiency is low, and accurate and efficient classification cannot be achieved.
针对上述的问题,目前尚未提出有效的解决方案。In view of the above-mentioned problems, no effective solutions have yet been proposed.
发明内容Summary of the invention
本发明实施例提供了一种文本分类处理方法和装置,以至少解决现有技术依靠人工方式对文本进行分类的技术问题。The embodiments of the present invention provide a text classification processing method and device to at least solve the technical problem of manually classifying text in the prior art.
根据本发明实施例的一个方面,提供了一种文本分类处理方法,包括:获取待分类的文本;将所述待分类的文本输入到模型当中,其中,所述模型为使用训练数据通过机器学习训练所得到的;将从所述模型中获取的输出作为所述待分类的文本对应的类别;保存所述待分类的文本和其对应的类别。According to one aspect of the embodiments of the present invention, there is provided a text classification processing method, including: obtaining the text to be classified; inputting the text to be classified into a model, wherein the model uses training data through machine learning Obtained by training; the output obtained from the model is used as the category corresponding to the text to be classified; and the text to be classified and its corresponding category are saved.
可选地,在获取所述待分类的文本之前,所述方法还包括:使用多组训练数据通过机器学习进行训练得到所述模型。Optionally, before acquiring the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain the model.
可选地,通过机器学习进行训练得到所述模型包括:使用第一语料集进行预训练得到第一模型;使用第二语料集对所述第一模型进行迭代训练得到所述模型,其中,所述第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。Optionally, training to obtain the model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain the model, wherein The second corpus includes multiple sets of data, and each set of data includes text and the category corresponding to the text.
可选地,使用所述第一语料集进行预训练得到第一模型包括:通过BERT使用所述第一语料集进行训练得到所述第一模型,其中,在所述训练中掩盖语料集中的每一条语料的部分内容,所述训练用于预测所掩盖的内容。Optionally, using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to perform training to obtain the first model through BERT, wherein each of the corpus in the corpus is masked during the training. Part of the content of a piece of corpus. The training is used to predict the concealed content.
可选地,所述文本包括工单文本,所述类别包括:工单的类型,其中,所述类型包括至少一类。Optionally, the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.
根据本发明实施例的另一方面,还提供了一种文本分类处理装置,包括:获取模 块,用于获取待分类的文本;输入模块,用于将所述待分类的文本输入到模型当中,其中,所述模型为使用训练数据通过机器学习训练所得到的;输出模块,用于将从所述模型中获取的输出作为所述待分类的文本对应的类别;保存模块,用于保存所述待分类的文本和其对应的类别。According to another aspect of the embodiments of the present invention, there is also provided a text classification processing device, including: an acquisition module for acquiring text to be classified; an input module for inputting the text to be classified into a model, Wherein, the model is obtained through machine learning training using training data; an output module is used to use the output obtained from the model as the category corresponding to the text to be classified; and a storage module is used to save the The text to be classified and its corresponding category.
可选地,还包括:训练模块,用于使用多组训练数据通过机器学习进行训练得到所述模型。Optionally, it further includes: a training module, configured to use multiple sets of training data to train through machine learning to obtain the model.
可选地,所述训练模块包括:第一训练单元,用于使用第一语料集进行预训练得到第一模型;第二训练单元,用于使用第二语料集对所述第一模型进行迭代训练得到所述模型,其中,所述第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。Optionally, the training module includes: a first training unit for pre-training using a first corpus to obtain the first model; a second training unit for iterating on the first model using a second corpus The model is obtained by training, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
可选地,所述第一训练单元用于:通过BERT使用所述第一语料集进行训练得到所述第一模型,其中,在所述训练中掩盖语料集中的每一条语料的部分内容,所述训练用于预测所掩盖的内容。Optionally, the first training unit is configured to: use the first corpus to train through BERT to obtain the first model, wherein part of the content of each corpus in the corpus is masked during the training, so The training is used to predict what is covered.
可选地,所述文本包括工单文本,所述类别包括:工单的类型,其中,所述类型包括至少一类。Optionally, the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.
根据本发明实施例的另一方面,还提供了一种存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行上述中任意一项所述的文本分类处理方法。According to another aspect of the embodiments of the present invention, there is also provided a storage medium, the storage medium including a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above The text classification processing method.
根据本发明实施例的另一方面,还提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行上述中任意一项所述的文本分类处理方法。According to another aspect of the embodiments of the present invention, there is also provided a processor configured to run a program, wherein the text classification processing method described in any one of the above is executed when the program is running.
在本发明实施例中,采用获取待分类的文本;将所述待分类的文本输入到模型当中,其中,所述模型为使用训练数据通过机器学习训练所得到的;将从所述模型中获取的输出作为所述待分类的文本对应的类别;保存所述待分类的文本和其对应的类别的方式,通过机器学习训练得到的模型识别待分类的文本对应的类别,并进行保存,达到了快速、准确进行分类的目的,从而实现了提高文本分类效率的技术效果,进而解决了现有技术依靠人工方式对文本进行分类技术问题。In the embodiment of the present invention, the text to be classified is obtained; the text to be classified is input into a model, where the model is obtained through machine learning training using training data; and the text is obtained from the model The output of is used as the category corresponding to the text to be classified; the way to save the text to be classified and its corresponding category, the model obtained through machine learning training recognizes the category corresponding to the text to be classified, and saves it to achieve The purpose of categorizing quickly and accurately is to achieve the technical effect of improving the efficiency of text categorization, thereby solving the technical problem of manually categorizing text in the prior art.
附图说明Description of the drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The exemplary embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:
图1是根据本发明实施例的文本分类处理方法的流程图;Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention;
图2是根据本发明可选实施例的分类模型的训练的流程图;Fig. 2 is a flowchart of training of a classification model according to an optional embodiment of the present invention;
图3是根据本发明实施例的文本分类处理装置的示意图。Fig. 3 is a schematic diagram of a text classification processing device according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
实施例1Example 1
根据本发明实施例,提供了一种文本分类处理方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, an embodiment of a text classification processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, Although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.
图1是根据本发明实施例的文本分类处理方法的流程图,如图1所示,该方法包括如下步骤:Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:
步骤S102,获取待分类的文本;Step S102: Obtain the text to be classified;
上述待分类的文本包括但不限于工单,其中,可以通过多种方式获取待分类的文本,例如,利用爬取软件、人工录入等。在具体实施过程中,利用多种方式获取待分类的文本,可以扩展待分类的文本来源,以适用于多种应用场景。The above-mentioned text to be classified includes but is not limited to a work order. Among them, the text to be classified can be obtained in a variety of ways, for example, using crawling software, manual entry, and so on. In the specific implementation process, multiple methods are used to obtain the text to be classified, which can expand the source of the text to be classified to be suitable for a variety of application scenarios.
步骤S104,将待分类的文本输入到模型当中,其中,模型为使用训练数据通过机器学习训练所得到的;Step S104, input the text to be classified into the model, where the model is obtained through machine learning training using training data;
在上述模型为分类模型时,可以通过该模型对待分类的文本进行处理。可选地,该模型为工单分类模型。需要说明的是,上述模型是使用训练数据通过机器学习训练所得到的,可以实现文本的自动分类。When the above model is a classification model, the text to be classified can be processed through the model. Optionally, the model is a work order classification model. It should be noted that the above model is obtained through machine learning training using training data, and can realize automatic text classification.
步骤S106,将从模型中获取的输出作为待分类的文本对应的类别;Step S106: Use the output obtained from the model as the category corresponding to the text to be classified;
通过上述模型可以将输入的待分类的文本,对应输出其对应的类别,该模型可以有效提高分类精度,提高文本分类效率。Through the above model, the input text to be classified can be output corresponding to its corresponding category. This model can effectively improve the classification accuracy and improve the efficiency of text classification.
步骤S108,保存待分类的文本和其对应的类别。Step S108: Save the text to be classified and its corresponding category.
作为一种可选的实施例,可以将待分类的文本和其对应的类别以预定格式进行保存,其中,该预定格式包括文本属性和类别属性,可以将待分类的文本保存在文本属性的位置,将待分类的文本对应的类别保存在类别属性的位置。需要说明的是,在具体实施过程中,并不仅限于上述方式。As an optional embodiment, the text to be classified and its corresponding category can be saved in a predetermined format, where the predetermined format includes text attributes and category attributes, and the text to be classified can be saved in the location of the text attribute. , Save the category corresponding to the text to be classified in the location of the category attribute. It should be noted that in the specific implementation process, it is not limited to the above methods.
通过上述步骤,可以通过机器学习训练得到的模型识别待分类的文本对应的类别,并进行保存,达到了快速、准确进行分类的目的,从而实现了提高文本分类效率的技术效果,进而解决了现有技术依靠人工方式对文本进行分类技术问题。Through the above steps, the model obtained by machine learning training can be used to identify the category corresponding to the text to be classified, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification and solving the problem. There are technical problems that rely on manual methods to classify text.
可选地,在获取待分类的文本之前,方法还包括:使用多组训练数据通过机器学习进行训练得到模型。Optionally, before obtaining the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain a model.
上述使用多组训练数据也就是使用大量的训练数据,因此,基于大量的训练数据通过机器学习训练得到的模型,该模型的识别或者预测效果更好、使得分类精度、准确度得到很大的提升。The above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
作为一种可选的实施例,在训练模型过程中,可以利用Transformer中的attention机制代替原本的RNN,而RNN在训练的时候,当前步的计算要依赖于上一步的隐含状态,也就是说这是一个序列的过程,每次计算都要等之前的计算完成才能展开。而Transformer不用RNN,所有的计算都可以并行进行,从而提高的训练的速度。As an optional embodiment, in the process of training the model, the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be unfolded. The Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
另外,在RNN里,如果第一帧要和第十帧建立依赖,那么第一帧的数据要依次经过第二三四五...九帧传给第十帧,进而产生二者的计算。而在这个传递的过程中,可能第一帧的数据已经产生了偏差,因此这个交互的速度和准确性都没有保障,而在Transformer中,由于有selfattention的存在,任意两帧之间都有直接的交互,从而建立了直接的依赖,无论二者距离多远,这样可以提高训练的准确性。In addition, in the RNN, if the first frame is to be dependent on the tenth frame, then the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated. In the process of this transfer, the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed. In Transformer, due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
可选地,通过机器学习进行训练得到模型包括:使用第一语料集进行预训练得到第一模型;使用第二语料集对第一模型进行迭代训练得到模型,其中,第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。Optionally, training to obtain a model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain a model, wherein the second corpus includes multiple groups Data, each group of data includes text and the category corresponding to the text.
可以通过第一语料集以及第二语料集分别对第一模型进行预训练、迭代训练得到最终的模型。无论是第一语料集,还是第二语料集均包括多组数据,每一组数据均包括文本以及该文本所对应的类别。通过上述的不同的训练方式,能够不断对模型进行调优与更新,有效提高模型的稳定性。The first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model. Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text. Through the above-mentioned different training methods, the model can be continuously tuned and updated, and the stability of the model can be effectively improved.
可选地,使用第一语料集进行预训练得到第一模型包括:通过BERT使用第一语料集进行训练得到第一模型,其中,在训练中掩盖语料集中的每一条语料的部分内容,训练用于预测所掩盖的内容。Optionally, using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked during training, and training is used What is hidden in the forecast.
上述BERT包括Transformer编码器,其中,在用于预测所掩盖的内容时,掩盖与屏蔽词对应的所有标记。同时在保证整体掩蔽率保持不变的情况下,第一模型可以独立地预测每个掩蔽词的标记。The above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
可选地,文本包括工单文本,类别包括:工单的类型,其中,类型包括至少一类。Optionally, the text includes the work order text, and the category includes: the type of the work order, where the type includes at least one category.
上述单文本可以包括但不限于95598工单,其中,工单的类型可以根据应用需求进行划分,比如,可以根据距离、录入时间、工单级别等划分不同的工单类型。The aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
下面对本发明一种可选的实施方式进行说明。An optional implementation manner of the present invention will be described below.
以95598工单为例,图2是根据本发明可选实施例的分类模型的训练的流程图,如图2所示,在客服接入时,客服将工单内容人工录入类别与文本两个部分,在对类别与文本分别做相应的清洗校对工作后,文本内容进入已经训练好的分类模型中。随后将分类模型的预测数据与人工录入的类别进行比对,得到当前模型的评价指标用以评估当前模型性能。Taking the 95598 work order as an example, Figure 2 is a flowchart of the training of the classification model according to an optional embodiment of the present invention. As shown in Figure 2, when the customer service accesses, the customer service manually enters the content of the work order into two categories: category and text. Part, after the corresponding cleaning and proofreading work is done on the category and the text, the text content enters the already trained classification model. Then the prediction data of the classification model is compared with the manually entered categories, and the evaluation index of the current model is obtained to evaluate the performance of the current model.
同时通过当前模型性能判断是否需要使用新的比对结果与文本内容继续对模型进行调优与更新。这样可以确保模型的实时效果,避免出现不确定的模型偏差,并且为模型提供了持续使用及优化的可能性。At the same time, the current model performance is used to determine whether it is necessary to use the new comparison results and text content to continue to tune and update the model. This can ensure the real-time effect of the model, avoid uncertain model deviations, and provide the model with the possibility of continuous use and optimization.
需要说明的是,在上述实施过程中,可以为95598工单提供自动化的基于文本内容的分类功能;具有模型性能的实时监控与显示功能,为模型维护提供便利;模型具备持续更新优化的能力,能够在实际业务过程中不断调优;针对文本工单的趋势变化,具备一定的适应能力;模型在实际业务过程中的使用方式。It should be noted that in the above implementation process, automatic text-based classification functions can be provided for 95598 work orders; real-time monitoring and display of model performance are provided to facilitate model maintenance; models have the ability to continuously update and optimize, Able to continuously optimize in the actual business process; have a certain adaptability to the trend changes of text work orders; the way the model is used in the actual business process.
另外,通过上述工单分类模型,不仅提高了预测精度,还能够实现业务中要求的工单分类功能。In addition, through the above-mentioned work order classification model, not only the prediction accuracy is improved, but also the work order classification function required in the business can be realized.
实施例2Example 2
根据本发明实施例的另外一个方面,还提供了一种用于执行上述实施例1中的文 本分类处理方法的装置实施例,图3是根据本发明实施例的文本分类处理装置的示意图,如图3所示,该文本分类处理装置包括:获取模块302,输入模块304,输出模块306以及保存模块308。下面对该文本分类处理装置进行详细说明。According to another aspect of the embodiment of the present invention, there is also provided an embodiment of an apparatus for executing the text classification processing method in the above embodiment 1. FIG. 3 is a schematic diagram of the text classification processing apparatus according to the embodiment of the present invention, such as As shown in FIG. 3, the text classification processing device includes: an acquisition module 302, an input module 304, an output module 306, and a storage module 308. The text classification processing device will be described in detail below.
获取模块302,用于获取待分类的文本;The obtaining module 302 is used to obtain the text to be classified;
输入模块304,连接至上述获取模块302,用于将待分类的文本输入到模型当中,其中,模型为使用训练数据通过机器学习训练所得到的;The input module 304, connected to the above-mentioned acquisition module 302, is used to input the text to be classified into the model, where the model is obtained through machine learning training using training data;
输出模块306,连接至上述输入模块304,用于将从模型中获取的输出作为待分类的文本对应的类别;The output module 306, connected to the aforementioned input module 304, is configured to use the output obtained from the model as the category corresponding to the text to be classified;
保存模块308,连接至上述输出模块306,用于保存待分类的文本和其对应的类别。The saving module 308 is connected to the aforementioned output module 306, and is used to save the text to be classified and its corresponding category.
上述装置可以通过机器学习训练得到的模型识别待分类的文本对应的类别,并进行保存,达到了快速、准确进行分类的目的,从而实现了提高文本分类效率的技术效果,进而解决了现有技术依靠人工方式对文本进行分类技术问题。The above device can recognize the category corresponding to the text to be classified through the model obtained by the machine learning training, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification, thereby solving the prior art Rely on manual methods to classify text technical issues.
此处需要说明的是,上述获取模块302,输入模块304,输出模块306以及保存模块308对应于实施例1中的步骤S102至S108,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例1所公开的内容。需要说明的是,上述模块作为装置的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted here that the above-mentioned acquisition module 302, input module 304, output module 306, and saving module 308 correspond to steps S102 to S108 in Embodiment 1. The above modules and the corresponding steps implement the same examples and application scenarios. However, it is not limited to the content disclosed in Example 1 above. It should be noted that, as a part of the device, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.
可选地,还包括:训练模块,用于使用多组训练数据通过机器学习进行训练得到模型。Optionally, it further includes: a training module, configured to use multiple sets of training data to train through machine learning to obtain a model.
上述使用多组训练数据也就是使用大量的训练数据,因此,基于大量的训练数据通过机器学习训练得到的模型,该模型的识别或者预测效果更好、使得分类精度、准确度得到很大的提升。The above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
作为一种可选的实施例,在训练模型过程中,可以利用Transformer中的attention机制代替原本的RNN,而RNN在训练的时候,当前步的计算要依赖于上一步的隐含状态,也就是说这是一个序列的过程,每次计算都要等之前的计算完成才能展开。而Transformer不用RNN,所有的计算都可以并行进行,从而提高的训练的速度。As an optional embodiment, in the process of training the model, the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be expanded. The Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
另外,在RNN里,如果第一帧要和第十帧建立依赖,那么第一帧的数据要依次经过第二三四五...九帧传给第十帧,进而产生二者的计算。而在这个传递的过程中,可能第一帧的数据已经产生了偏差,因此这个交互的速度和准确性都没有保障,而在Transformer中,由于有selfattention的存在,任意两帧之间都有直接的交互,从而建立了直接的依赖,无论二者距离多远,这样可以提高训练的准确性。In addition, in the RNN, if the first frame is to be dependent on the tenth frame, then the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated. In the process of this transfer, the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed. In Transformer, due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
可选地,训练模块包括:第一训练单元,用于使用第一语料集进行预训练得到第一模型;第二训练单元,用于使用第二语料集对第一模型进行迭代训练得到模型,其中,第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。Optionally, the training module includes: a first training unit for pre-training using the first corpus to obtain the first model; a second training unit for iterative training of the first model using the second corpus to obtain the model, Among them, the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
可以通过第一语料集以及第二语料集分别对第一模型进行预训练、迭代训练得到最终的模型。无论是第一语料集,还是第二语料集均包括多组数据,每一组数据均包括文本以及该文本所对应的类别。通过上述的不同的训练方式,能够不断对模型进行调优与更新,有效提高模型的稳定性。The first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model. Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text. Through the above-mentioned different training methods, the model can be continuously tuned and updated, and the stability of the model can be effectively improved.
可选地,第一训练单元用于:通过BERT使用第一语料集进行训练得到第一模型,其中,在训练中掩盖语料集中的每一条语料的部分内容,训练用于预测所掩盖的内容。Optionally, the first training unit is used to: use the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
上述BERT包括Transformer编码器,其中,在用于预测所掩盖的内容时,掩盖与屏蔽词对应的所有标记。同时在保证整体掩蔽率保持不变的情况下,第一模型可以独立地预测每个掩蔽词的标记。The above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
可选地,文本包括工单文本,类别包括:工单的类型,其中,类型包括至少一类。Optionally, the text includes the work order text, and the category includes: the type of the work order, where the type includes at least one category.
上述单文本可以包括但不限于95598工单,其中,工单的类型可以根据应用需求进行划分,比如,可以根据距离、录入时间、工单级别等划分不同的工单类型。The aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
实施例3Example 3
根据本发明实施例的另一方面,还提供了一种存储介质,存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行上述中任意一项的文本分类处理方法。According to another aspect of the embodiments of the present invention, a storage medium is also provided. The storage medium includes a stored program, wherein the device where the storage medium is located is controlled to execute any one of the above-mentioned text classification processing methods when the program is running.
实施例4Example 4
根据本发明实施例的另一方面,还提供了一种处理器,处理器用于运行程序,其中,程序运行时执行上述中任意一项的文本分类处理方法。According to another aspect of the embodiments of the present invention, there is also provided a processor, which is configured to run a program, where any one of the text classification processing methods described above is executed when the program is running.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the foregoing embodiments of the present invention are only for description, and do not represent the superiority or inferiority of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模 块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units may be a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including a number of instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, several improvements and modifications can be made, and these improvements and modifications are also It should be regarded as the protection scope of the present invention.

Claims (10)

  1. 一种文本分类处理方法,其特征在于,包括:A text classification processing method, which is characterized in that it comprises:
    获取待分类的文本;Obtain the text to be classified;
    将所述待分类的文本输入到模型当中,其中,所述模型为使用训练数据通过机器学习训练所得到的;Inputting the text to be classified into a model, where the model is obtained through machine learning training using training data;
    将从所述模型中获取的输出作为所述待分类的文本对应的类别;Use the output obtained from the model as the category corresponding to the text to be classified;
    保存所述待分类的文本和其对应的类别。Save the text to be classified and its corresponding category.
  2. 根据权利要求1所述的方法,其特征在于,在获取所述待分类的文本之前,所述方法还包括:The method according to claim 1, characterized in that, before obtaining the text to be classified, the method further comprises:
    使用多组训练数据通过机器学习进行训练得到所述模型。Multiple sets of training data are used to train through machine learning to obtain the model.
  3. 根据权利要求2所述的方法,其特征在于,通过机器学习进行训练得到所述模型包括:The method according to claim 2, wherein the training to obtain the model through machine learning comprises:
    使用第一语料集进行预训练得到第一模型;Use the first corpus for pre-training to obtain the first model;
    使用第二语料集对所述第一模型进行迭代训练得到所述模型,其中,所述第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。The first model is iteratively trained using a second corpus to obtain the model, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  4. 根据权利要求3所述的方法,其特征在于,使用所述第一语料集进行预训练得到第一模型包括:The method according to claim 3, wherein the pre-training using the first corpus to obtain the first model comprises:
    通过BERT使用所述第一语料集进行训练得到所述第一模型,其中,在所述训练中掩盖语料集中的每一条语料的部分内容,所述训练用于预测所掩盖的内容。The first model is obtained through BERT training using the first corpus, wherein part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述文本包括工单文本,所述类别包括:工单的类型,其中,所述类型包括至少一类。The method according to any one of claims 1 to 4, wherein the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.
  6. 一种文本分类处理装置,其特征在于,包括:A text classification processing device, characterized in that it comprises:
    获取模块,用于获取待分类的文本;The obtaining module is used to obtain the text to be classified;
    输入模块,用于将所述待分类的文本输入到模型当中,其中,所述模型为使用训练数据通过机器学习训练所得到的;An input module for inputting the text to be classified into a model, where the model is obtained through machine learning training using training data;
    输出模块,用于将从所述模型中获取的输出作为所述待分类的文本对应的类别;An output module, configured to use the output obtained from the model as the category corresponding to the text to be classified;
    保存模块,用于保存所述待分类的文本和其对应的类别。The saving module is used to save the text to be classified and its corresponding category.
  7. 根据权利要求6所述的装置,其特征在于,还包括:The device according to claim 6, further comprising:
    训练模块,用于使用多组训练数据通过机器学习进行训练得到所述模型。The training module is configured to use multiple sets of training data to train through machine learning to obtain the model.
  8. 根据权利要求7所述的装置,其特征在于,所述训练模块包括:The device according to claim 7, wherein the training module comprises:
    第一训练单元,用于使用第一语料集进行预训练得到第一模型;The first training unit is used for pre-training using the first corpus to obtain the first model;
    第二训练单元,用于使用第二语料集对所述第一模型进行迭代训练得到所述模型,其中,所述第二语料集包括多组数据,每一组数据均包括文本以及该文本所对应的类别。The second training unit is configured to use a second corpus to perform iterative training on the first model to obtain the model, wherein the second corpus includes multiple sets of data, and each set of data includes text and the text The corresponding category.
  9. 根据权利要求8所述的装置,其特征在于,所述第一训练单元用于:The device according to claim 8, wherein the first training unit is used for:
    通过BERT使用所述第一语料集进行训练得到所述第一模型,其中,在所述训练中掩盖语料集中的每一条语料的部分内容,所述训练用于预测所掩盖的内容。The first model is obtained through BERT training using the first corpus, wherein part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
  10. 根据权利要求6至9中任一项所述的装置,其特征在于,所述文本包括工单文本,所述类别包括:工单的类型,其中,所述类型包括至少一类。The device according to any one of claims 6 to 9, wherein the text includes a work order text, and the category includes: a type of the work order, wherein the type includes at least one category.
PCT/CN2020/092099 2019-12-25 2020-05-25 Method and device for text classification WO2021128721A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911360673.7 2019-12-25
CN201911360673.7A CN111209394A (en) 2019-12-25 2019-12-25 Text classification processing method and device

Publications (1)

Publication Number Publication Date
WO2021128721A1 true WO2021128721A1 (en) 2021-07-01

Family

ID=70786462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092099 WO2021128721A1 (en) 2019-12-25 2020-05-25 Method and device for text classification

Country Status (2)

Country Link
CN (1) CN111209394A (en)
WO (1) WO2021128721A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861201A (en) * 2020-07-17 2020-10-30 南京汇宁桀信息科技有限公司 Intelligent government affair order dispatching method based on big data classification algorithm
CN112949674A (en) * 2020-08-22 2021-06-11 上海昌投网络科技有限公司 Multi-model fused corpus generation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213860A (en) * 2018-07-26 2019-01-15 中国科学院自动化研究所 Merge the text sentiment classification method and device of user information
CN109670167A (en) * 2018-10-24 2019-04-23 国网浙江省电力有限公司 A kind of electric power customer service work order emotion quantitative analysis method based on Word2Vec
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
US10354203B1 (en) * 2018-01-31 2019-07-16 Sentio Software, Llc Systems and methods for continuous active machine learning with document review quality monitoring
CN110489521A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Text categories detection method, device, electronic equipment and computer-readable medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032644A (en) * 2019-04-03 2019-07-19 人立方智能科技有限公司 Language model pre-training method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354203B1 (en) * 2018-01-31 2019-07-16 Sentio Software, Llc Systems and methods for continuous active machine learning with document review quality monitoring
CN109213860A (en) * 2018-07-26 2019-01-15 中国科学院自动化研究所 Merge the text sentiment classification method and device of user information
CN109670167A (en) * 2018-10-24 2019-04-23 国网浙江省电力有限公司 A kind of electric power customer service work order emotion quantitative analysis method based on Word2Vec
CN109710825A (en) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 Webpage harmful information identification method based on machine learning
CN110489521A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Text categories detection method, device, electronic equipment and computer-readable medium

Also Published As

Publication number Publication date
CN111209394A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
CN109635117B (en) Method and device for recognizing user intention based on knowledge graph
WO2021051517A1 (en) Information retrieval method based on convolutional neural network, and device related thereto
CN109033284A (en) The power information operational system database construction method of knowledge based map
WO2023093116A1 (en) Method and apparatus for determining industrial chain node of enterprise, and terminal and storage medium
US11741094B2 (en) Method and system for identifying core product terms
CN108021651B (en) Network public opinion risk assessment method and device
US20220277005A1 (en) Semantic parsing of natural language query
CN110866799A (en) System and method for monitoring online retail platform using artificial intelligence
CN108108426A (en) Understanding method, device and the electronic equipment that natural language is putd question to
CN112559687B (en) Question identification and query method and device, electronic equipment and storage medium
CN108874783A (en) Power information O&M knowledge model construction method
WO2023065642A1 (en) Corpus screening method, intention recognition model optimization method, device, and storage medium
CN112966089A (en) Problem processing method, device, equipment, medium and product based on knowledge base
US20220100967A1 (en) Lifecycle management for customized natural language processing
WO2021128721A1 (en) Method and device for text classification
CN111242710A (en) Business classification processing method and device, service platform and storage medium
CN107480270A (en) A kind of real time individual based on user feedback data stream recommends method and system
CN111694957B (en) Method, equipment and storage medium for classifying problem sheets based on graph neural network
KR20210063882A (en) A method and an apparatus for analyzing marketing information based on knowledge graphs supporting efficient classifying documents processing
KR20210063878A (en) A method and an apparatus for providing chatbot services of analyzing marketing information
CN113553431A (en) User label extraction method, device, equipment and medium
Lo et al. An emperical study on application of big data analytics to automate service desk business process
KR20210063879A (en) Computer program and recording medium for providing chatbot services of analyzing marketing information
WO2023225093A1 (en) System for and a method of graph model generation
CN116090450A (en) Text processing method and computing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905055

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20905055

Country of ref document: EP

Kind code of ref document: A1