[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN109189932B - Text classification method and device and computer-readable storage medium - Google Patents

Text classification method and device and computer-readable storage medium Download PDF

Info

Publication number
CN109189932B
CN109189932B CN201811035883.4A CN201811035883A CN109189932B CN 109189932 B CN109189932 B CN 109189932B CN 201811035883 A CN201811035883 A CN 201811035883A CN 109189932 B CN109189932 B CN 109189932B
Authority
CN
China
Prior art keywords
classification
text
word
label
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811035883.4A
Other languages
Chinese (zh)
Other versions
CN109189932A (en
Inventor
林江华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811035883.4A priority Critical patent/CN109189932B/en
Publication of CN109189932A publication Critical patent/CN109189932A/en
Application granted granted Critical
Publication of CN109189932B publication Critical patent/CN109189932B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text classification method and apparatus, and a computer-readable storage medium. The text classification method comprises the following steps: classifying the plurality of labeled corpuses by using a text classification model to obtain a model classification label of each labeled corpus; selecting a labeled corpus with a model classification label inconsistent with a corresponding labeled classification label as a sample corpus; respectively converting texts in each sample corpus into a word list; classifying the word combinations extracted from each word list according to the model classification labels to obtain the word combinations under each model classification label; and generating a classification adjustment template according to the word combination, wherein the classification adjustment template comprises an original classification label, template content and an adjustment classification label, the template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, and the adjustment classification label is a labeling label of a sample corpus corresponding to the word combination.

Description

Text classification method and device and computer-readable storage medium
Technical Field
The present disclosure relates to the field of computers, and in particular, to a text classification method and apparatus, and a computer-readable storage medium.
Background
Text classification techniques are widely used in electronic text information processing. The development of deep learning technology further expands the application scene of text classification.
Relevant text classification techniques based on deep learning generally include: determining a classification standard; searching and labeling corpora to form a corpus; training a classification model by using a material library; and classifying other texts by using the trained classification model.
Disclosure of Invention
Due to the limitations of the corpus and the deep learning, the accuracy of the classification model cannot reach 100%, and the missing part of accuracy is difficult to effectively improve through the optimization of the classification model.
In view of this, the present disclosure provides a text classification scheme, which can further improve the accuracy of text classification.
According to some embodiments of the present disclosure, there is provided a text classification method including: classifying the plurality of labeled corpuses by using a text classification model to obtain a model classification label of each labeled corpus; selecting a labeled corpus with a model classification label inconsistent with a corresponding labeled classification label as a sample corpus; respectively converting texts in each sample corpus into a word list; classifying the word combinations extracted from each word list according to the model classification labels to obtain the word combinations under each model classification label; and generating a classification adjustment template according to the word combination, wherein the classification adjustment template comprises an original classification label, template content and an adjustment classification label, the template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, and the adjustment classification label is a labeling label of a sample corpus corresponding to the word combination.
In some embodiments, the text classification method further comprises: and deleting the word combinations which simultaneously appear under a plurality of model classification labels.
In some embodiments, the text classification method further comprises: and deleting the word combinations with the occurrence times smaller than the threshold value in the sample corpus.
In some embodiments, the same word combination appears multiple times in a sample corpus, counted only once.
In some embodiments, the category adjustment template further comprises a priority reflecting a likelihood that the adjusted category label is correct.
In some embodiments, the priority is expressed as
Figure BDA0001790860670000021
a. b respectively representing the times of the word combination in the template content appearing in the sample corpus under the original classification label and the adjusted classification label.
In some embodiments, the priority is expressed as
Figure BDA0001790860670000022
c denotes in said classificationAnd adjusting the total number of sample corpora under the original classification label of the template.
In some embodiments, the text classification method further comprises: classifying the texts to be classified by using the text classification model to obtain model classification labels of the texts to be classified; converting the text to be classified into a word list; and taking a classification adjusting template meeting the following conditions as a matching result: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template; determining the matching result with the highest priority as a matching classification adjusting template under the condition that at least one matching result exists and the corresponding priority of the matching result with the highest priority is greater than or equal to a priority threshold; and modifying the model classification label of the text to be classified into an adjustment classification label of the matching classification adjustment template as a classification result.
In some embodiments, the text is converted to a word list by performing word segmentation and word decommissioning processing on the text.
In some embodiments, the order between words in the word list is the same as in the corresponding text.
According to further embodiments of the present disclosure, there is provided a text classification apparatus including: the classification unit is configured to classify the plurality of labeled corpora by using the text classification model to obtain model classification labels of the labeled corpora; the selection unit is configured to select a labeling corpus with a model classification label inconsistent with a corresponding labeling classification label as a sample corpus; the conversion unit is configured to convert texts in the sample corpora into word lists respectively; the classification unit is configured to classify the word combinations extracted from each word list according to the model classification labels to obtain the word combinations under each model classification label; the generating unit is configured to generate a classification adjustment template according to a word combination, wherein the classification adjustment template comprises an original classification label, template content and an adjustment classification label, the template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, and the adjustment classification label is a labeling label of a sample corpus corresponding to the word combination.
In some embodiments, the text classification apparatus further comprises: and the deleting unit is configured to delete the word combinations which simultaneously appear under the plurality of model classification labels or delete the word combinations of which the appearance times in the sample corpus are less than a threshold value.
In some embodiments, the text classification apparatus further comprises: a matching unit configured to take as a matching result a classification adjustment template satisfying the following conditions: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template; a determining unit configured to determine a matching result with a highest priority as a matching classification adjustment template if there is at least one matching result and a corresponding priority of the matching result with the highest priority is greater than or equal to a priority threshold; and the adjusting unit is configured to modify the model classification label of the text to be classified into an adjusting classification label of the matching classification adjusting template as a classification result.
According to still further embodiments of the present disclosure, there is provided a text classification apparatus including: a memory and a processor coupled to the memory, the processor configured to perform the text classification method of any of the above embodiments based on instructions stored in the memory device.
According to further embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the text classification method according to any of the embodiments described above.
In the above embodiment, the classification result of the text classification model is reprocessed to generate the classification adjustment template, so as to improve the accuracy of text classification. The generated classification adjustment template does not influence the model training process and an external calling party, and can adapt to different model training modes.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 illustrates a flow diagram of some embodiments of a text classification method according to the present disclosure;
FIG. 2 illustrates a flow diagram of further embodiments of a text classification method according to the present disclosure;
FIG. 3 illustrates a flow diagram of further embodiments of a text classification method according to the present disclosure;
FIG. 4 illustrates a block diagram of some embodiments of a text classification apparatus according to the present disclosure;
FIG. 5 shows a block diagram of further embodiments of a text classification apparatus according to the present disclosure;
FIG. 6 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Fig. 1 illustrates a flow diagram of some embodiments of a text classification method according to the present disclosure. As shown in FIG. 1, the text classification method includes steps S1-S5.
In step S1, the text classification model is used to classify the tagged corpora to obtain model classification tags of the tagged corpora.
In some embodiments, the text classification model is trained based on a deep-learning neural network. The labeled corpus can be obtained from a corpus used for training a text classification model. The markup corpus can contain fields such as texts and markup classification tags.
In step S2, a labeled corpus in which the model classification label is inconsistent with the corresponding labeled classification label is selected as a sample corpus.
In some embodiments, the labeled corpus in which the model classification label is inconsistent with the corresponding labeled classification label may be filtered by field matching. When the text classification model is trained, the model classification labels of the labeled corpus are compared with the corresponding labeled classification labels, and the text classification model is adjusted by using the comparison result, so that the proportion of the two labels is increased. However, limited to the deep learning technique and the size of the corpus, the ratio of the two labels being consistent cannot reach 100%, so that the classification accuracy of the text classification model cannot reach the expected value. The method and the device can further process the labeled linguistic data with inconsistent labels on the basis of the fully trained text classification model to generate a classification adjustment template so as to adjust the classification result of the text classification model, thereby further improving the accuracy of text classification.
In step S3, the texts in the sample corpuses are converted into word lists respectively.
In some embodiments, the text is converted to a word list by performing word segmentation and word decommissioning processing on the text. Stop words are for example words that have no influence on the semantics of the text.
Each sample corpus corresponds to a word list. In the word list, the order between words is the same as in the corresponding text. For example, for a sample corpus labeled "weather" and having a text of "how much can be put through", the word list obtained after word segmentation is "can", "put through", "how much". It can be seen that the same words in different locations, such as "through", "how much" in the above example, may be present in a list.
In step S4, the word combinations extracted from each word list are classified according to the model classification labels, and the word combinations under each model classification label are obtained.
In some embodiments, word combinations are extracted from the word lists of the respective sample corpora, and the order of the words is kept unchanged, that is, the order of the respective words in one word combination is kept unchanged. The length of the extracted word combination can be selected according to actual needs, and for example, the length can be 1 to 3. Still taking the sample corpus labeled with "weather" and the text "how much can be worn", assuming that only the word combination with the length of 2 is taken, the word combinations of "how much can be worn", and the like are extracted.
In step S5, a classification adjustment template is generated from the word combinations.
A classification adjustment template is a data set that identifies how to adjust the result of a sub-model class that satisfies a condition to another classification. The classification adjustment template comprises an original classification label, template content and an adjusted classification label. The template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, and the classification label is adjusted to be a labeling label of the sample corpus corresponding to the word combination.
The generated class adjustment template is described below in conjunction with tables 1-3. Table 1 shows a set of sample corpora and their corresponding model class labels and label class labels. As shown in table 1, for the word combination "high temperature", 2 classification adjustment templates can be generated, the template contents all include "high temperature", the original classification labels are all "weather", but the adjustment classification label of the 1 st classification adjustment template is "mobile phone", and the adjustment classification label of the 2 nd classification adjustment template is "universe".
Model classification label Labeling classification label Sample corpus
Weather (weather) Weather (weather) The temperature of Beijing is very high
Weather (weather) Weather (weather) How the weather is in tomorrow
Weather (weather) Mobile phone The temperature is very high when playing games
Weather (weather) Mobile phone High charging temperature and burning hand
Weather (weather) Universe of things Surface temperature of the sunIs very high
TABLE 1
The classification adjustment template may also include a priority. The priority reflects the likelihood that the adjustment of the classification label is correct. In some embodiments, the priority may be expressed as
Figure BDA0001790860670000072
Wherein, a and b respectively represent the times of appearance of word combinations in the template content in the sample corpus under the original classification label and the adjusted classification label. According to the data shown in table 1, for the 1 st category, the adjustment template has a-4 and b-2, i.e., the priority is 0.5; and for the 2 nd category adjustment template, a is 4, b is 1, and the priority is 0.25.
Table 2 and table 3 show examples of the generated 1 st and 2 nd class adjustment templates, respectively.
Name of field Meaning of a field
Original classification label Weather (weather)
Template content The temperature is very high
Adjusting classification labels Mobile phone
Priority level 0.5
TABLE 2
Name of field Meaning of a field
Original classification label Weather (weather)
Template content The temperature is very high
Adjusting classification labels Universe of things
Priority level 0.25
TABLE 3
In other embodiments, the priority may also be expressed as
Figure BDA0001790860670000071
Where c represents the total number of sample corpora under the original classification label of the classification adjustment template. According to the data shown in table 1, c is 5, the priority of the 1 st category adjustment template is changed to about 0.45, and the priority of the 2 nd category adjustment template is changed to about 0.23.
According to actual needs, templates can be added manually according to the format of the classification adjustment template, and the method is used as a supplement for automatically generating the classification adjustment template. The manual work can also modify or delete the classification adjustment template generated in the above embodiment to improve the effect of the classification adjustment template. The priority of the classification adjustment template with manual access is higher and can be set to 1.
FIG. 2 illustrates a flow diagram of further embodiments of a text classification method according to the present disclosure. FIG. 2 differs from FIG. 1 in that after the combination of words under each model classification label is obtained by classification, the text classification method further includes steps S41-S42.
In step S41, word combinations that simultaneously appear under a plurality of model classification labels are deleted.
The same word combinations may exist in different sample corpora. The same word combination appears under different model classification labels, which means that the word combination has little influence on the classification result of the text. For example, words such as "me", "of", etc. may appear in many sample corpora, but they have little effect on the semantics of the text, so such word combinations may be deleted. That is, no classification adjustment template is generated for such word combinations. Thus, the workload of generating the classification adjusting template can be reduced, and the improvement of the classification accuracy is not obviously influenced.
In step S42, word combinations that occur less than the threshold number of times in the sample corpus are deleted.
In some embodiments, when counting the number of occurrences of a word combination in a sample corpus, the number of occurrences of the same word combination in one sample corpus is counted only once. For example, for the sample corpus labeled "weather" and "how much to wear" in the text, the statistical result of the word combination "wear", "how much to wear" is: the number of the through holes is 1, 1 and 1.
A word combination that appears less than a threshold number of times (e.g., 5 times) in all sample corpora indicates that such a word combination is too low in term of significance for classification adjustment. For example, in all sample corpuses, a phrase combination "wearable" appears in only 4 sample corpuses, and in the case where the threshold is set to 5 times, the phrase combination may be deleted. That is, no classification adjustment template is generated for such word combinations. Therefore, the workload of generating the classification adjusting template can be reduced under the condition of not obviously improving the classification accuracy.
It should be appreciated that only one of steps S41-S42 may be performed. Also, step S42 may be performed before step S41 or simultaneously with step S41. That is, the execution order between step S41 and step S42 has no influence on the implementation of the text classification scheme of the present disclosure.
In the above embodiment, the classification result of the text classification model is reprocessed to generate the classification adjustment template, so as to improve the accuracy of text classification. The generated classification adjustment template does not influence the model training process and an external calling party, and can adapt to different model training modes.
Fig. 3 illustrates a flow diagram of further embodiments of text classification methods according to the present disclosure. As shown in FIG. 3, the text classification method further includes steps S6-S10.
In step S6, the text to be classified is classified by using the text classification model, so as to obtain a model classification label of the text to be classified.
And classifying the texts to be classified by using the same text classification model for classifying the labeled corpora to obtain a primary classification result. The preliminary classification result includes fields such as text and model classification labels.
The application of the classification adjustment template will be described below by taking the text to be classified as "high charging temperature and easy explosion". In step S6, the model classification obtained by classifying the text is labeled "weather", for example.
In step S7, the text to be classified is converted into a word list. Similar to step S3, the text may also be converted into a word list by performing word segmentation and word deactivation processing on the text. For example, the text to be classified, "charging temperature is high and explosion is easy," can be converted into a word list of "charging", "temperature", "very high", "easy", "explosion".
In step S8, a classification adjustment template satisfying the following conditions is taken as a matching result: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template.
The process of extracting the word combinations is similar to the correlation process in step S4. For example, word combinations such as "charging temperature", "temperature is high", "easy to be high", "explosive", etc. may be extracted from the word list of the text to be classified.
As described above, the template contents of the 1 st and 2 nd category adjustment templates both include the word combination "high temperature", and the original category labels are both "weather". Therefore, both of the classification adjustment templates satisfy the condition, and can be used as a matching result. Assuming that no matching classification adjustment template is found for other word combinations, 2 matching results can be obtained for the text to be classified in the example.
In step S9, a matching classification adjustment template is determined according to the matching result.
And screening out the matching result with the highest priority when at least one matching result exists. In some embodiments, a priority value threshold may be set, and the highest priority matching result may be determined to be a matching classification adjustment template only if its corresponding priority is greater than or equal to the threshold. The priority threshold may be set according to the actual application.
According to tables 2 and 3, the priorities of the 1 st class adjustment template and the 2 nd class adjustment template as the matching results are 0.5 and 0.25, respectively. In the case where the priority threshold is set to 0.5, the priority of the 1 st class adjustment template satisfies the condition, and therefore, the matching result with the highest priority, that is, the 1 st class adjustment template, can be taken as the matching class adjustment template.
On the contrary, if the priority threshold is set to be greater than 0.5, for example, 0.6, none of the matching results in the above example satisfies the condition, that is, it is determined that the matching classification adjustment template does not exist, and the classification adjustment is not performed, but the model classification label obtained in step S6 is directly output as the classification result.
If there is no matching result, it is determined that there is no matching classification adjustment template, and no classification adjustment is performed, that is, the model classification label obtained in step S6 is directly output as a classification result. In step S10, the model classification label of the text to be classified is modified into an adjusted classification label matching the classification adjusted template as a classification result.
According to table 2, the adjustment classification label of the 1 st classification adjustment template as the matching classification adjustment template is "mobile phone", so that "mobile phone" can be used as the text "classification result of high charging temperature and easy explosion".
In the embodiment, the classification adjusting template is introduced behind the text classification model, and the classification result is adjusted, so that the accuracy of text classification can be further improved, and classification errors can be corrected purposefully.
Fig. 4 illustrates a block diagram of some embodiments of a text classification apparatus according to the present disclosure.
As shown in fig. 4, the text classification apparatus 4 includes a classification unit 41, a selection unit 42, a conversion unit 43, a classification unit 44, and a generation unit 45.
The classification unit 41 is configured to classify the text using a text classification model. In some embodiments, the classifying unit 41 is configured to classify the plurality of labeled corpuses by using a text classification model, and obtain a model classification label of each labeled corpus, for example, execute step S1. In other embodiments, the classification unit 41 may be further configured to classify the text to be classified by using a text classification model, so as to obtain a model classification tag of the text to be classified, for example, execute step S6.
The selecting unit 42 is configured to select the labeled corpus in which the model classification label is inconsistent with the corresponding labeled classification label as the sample corpus, for example, to perform step S2.
The conversion unit 43 is configured to convert the text into word lists, respectively. In some embodiments, the conversion unit 43 is configured to convert the texts in the sample corpuses into word lists respectively, for example, execute step S3. In other embodiments, the converting unit 43 is configured to convert the text to be classified into a word list, for example, to execute step S7.
The classifying unit 44 is configured to classify the word combinations extracted from the word lists according to the model classification labels, and obtain the word combinations under the model classification labels, for example, execute step S4.
The generating unit 45 is configured to generate a classification adjustment template from the word combinations, for example, to perform step S5. As mentioned above, the generated classification adjustment template includes the original classification label, the template content and the adjustment classification label, where the template content includes the word combination, the original classification label is the model classification label corresponding to the word combination, and the adjustment classification label is the label of the sample corpus corresponding to the word combination.
In some embodiments, the text classification device 4 further comprises a deletion unit 46. In some embodiments, the deleting unit 46 is configured to delete a word combination that appears under multiple model classification labels at the same time, for example, perform step S41. In other embodiments, the deleting unit 46 is configured to delete the word combinations that appear in the sample corpus less than the threshold, for example, execute step S42. By using the deleting unit, the workload of generating the classification adjusting template can be reduced, and the improvement of the classification accuracy is not obviously affected.
In further embodiments, the text classification apparatus 4 further comprises a matching unit 47, a determining unit 48 and an adjusting unit 49.
The matching unit 47 is configured to take as a matching result a classification adjustment template that satisfies the following condition: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template. For example, the matching unit 47 may perform step S8.
The determining unit 48 is configured to determine a matching classification adjustment template according to the matching result, for example, to perform step S9. In some embodiments, the matching result with the highest priority is determined to be the matching classification adjustment template in case there is at least one matching result and the corresponding priority of the matching result with the highest priority is greater than or equal to the priority threshold.
The adjusting unit 49 is configured to modify the model classification label of the text to be classified into an adjusted classification label matching the classification adjustment template, as a result of the classification, for example, perform step S10.
Fig. 5 shows a block diagram of further embodiments of a text classification apparatus according to the present disclosure.
As shown in fig. 5, the apparatus 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51. The memory 51 is used for storing instructions for executing the corresponding embodiment of the text classification method. The processor 52 is configured to perform the text classification method in any of the embodiments of the present disclosure based on instructions stored in the memory 51.
In the above embodiment, the classification result is adjusted through the classification adjustment template of the text classification device, so that the accuracy of text classification can be improved.
In addition to text classification methods, apparatus, embodiments of the present disclosure may take the form of a computer program product embodied on one or more non-volatile storage media containing computer program instructions. Accordingly, embodiments of the present disclosure also include a computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the text classification method in any of the foregoing embodiments.
FIG. 6 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.
As shown in FIG. 6, computer system 60 may take the form of a general purpose computing device. Computer system 60 includes a memory 610, a processor 620, and a bus 600 that connects the various system components.
The memory 610 may include, for example, system memory, non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs. The system memory may include volatile storage media such as Random Access Memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions to perform a corresponding embodiment of a text classification method. Non-volatile storage media include, but are not limited to, magnetic disk storage, optical storage, flash memory, and the like.
The processor 620 may be implemented as discrete hardware components, such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gates or transistors, or the like. Accordingly, each of the modules, such as the judging module and the determining module, may be implemented by a Central Processing Unit (CPU) executing instructions in a memory for performing the corresponding step, or may be implemented by a dedicated circuit for performing the corresponding step.
Bus 600 may use any of a variety of bus architectures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
Computer system 60 may also include input-output interface 630, network interface 640, storage interface 650, and the like. These interfaces 630, 640, 650 and the memory 610 and the processor 620 may be connected by a bus 600. The input/output interface 630 may provide a connection interface for input/output devices such as a display, a mouse, and a keyboard. The network interface 640 provides a connection interface for various networking devices. The storage interface 640 provides a connection interface for external storage devices such as a floppy disk, a usb disk, and an SD card.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable text classification apparatus to produce a machine, such that the execution of the instructions by the processor results in an apparatus that implements the functions specified in the flowchart and/or block diagram block or blocks.
These computer-readable program instructions may also be stored in a computer-readable memory that can direct a computer to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function specified in the flowchart and/or block diagram block or blocks.
The present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
So far, some embodiments of the present disclosure have been described in detail by way of examples. It should be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Variations, modifications, substitutions, variations, combinations, and alterations of the above embodiments may be made by those skilled in the art without departing from the scope of the present disclosure.

Claims (12)

1. A method of text classification, comprising:
classifying the plurality of labeled corpuses by using a text classification model to obtain a model classification label of each labeled corpus;
selecting a labeled corpus with a model classification label inconsistent with a corresponding labeled classification label as a sample corpus;
respectively converting texts in each sample corpus into a word list;
classifying the word combinations extracted from each word list according to the model classification labels to obtain the word combinations under each model classification label;
generating a classification adjustment template according to a word combination, wherein the classification adjustment template comprises an original classification label, a priority, template content and an adjustment classification label, the template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, the adjustment classification label is a label of a sample corpus corresponding to the word combination, and the priority reflects the possibility that the adjustment classification label is correct;
classifying the texts to be classified by using the text classification model to obtain model classification labels of the texts to be classified;
converting the text to be classified into a word list;
and taking a classification adjusting template meeting the following conditions as a matching result: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template;
determining the matching result with the highest priority as a matching classification adjusting template under the condition that at least one matching result exists and the corresponding priority of the matching result with the highest priority is greater than or equal to a priority threshold;
and modifying the model classification label of the text to be classified into an adjustment classification label of the matching classification adjustment template as a classification result.
2. The text classification method of claim 1, further comprising: and deleting the word combinations which simultaneously appear under a plurality of model classification labels.
3. The text classification method of claim 1, further comprising: and deleting the word combinations with the occurrence times smaller than the threshold value in the sample corpus.
4. The method of classifying text according to claim 3, wherein the same word combination appears in one sample corpus a plurality of times, counted only once.
5. The text classification method of claim 1, wherein the priority is expressed as
Figure FDA0002769327110000021
a. b respectively representing the times of the word combination in the template content appearing in the sample corpus under the original classification label and the adjusted classification label.
6. The text classification method of claim 5, wherein the priority is expressed as
Figure FDA0002769327110000022
c denotes adjusting in said classificationTotal number of sample corpora under the original classification label of the template.
7. The text classification method according to any one of claims 1 to 6, wherein the text is converted into a word list by performing word segmentation and word de-stop processing on the text.
8. The text classification method according to any one of claims 1 to 6, wherein an order between words in the word list is the same as in the corresponding text.
9. A text classification apparatus comprising:
the classification unit is configured to classify the plurality of labeled corpora by using the text classification model to obtain model classification labels of the labeled corpora;
the selection unit is configured to select a labeling corpus with a model classification label inconsistent with a corresponding labeling classification label as a sample corpus;
the conversion unit is configured to convert texts in the sample corpora into word lists respectively;
the classification unit is configured to classify the word combinations extracted from each word list according to the model classification labels to obtain the word combinations under each model classification label;
the generating unit is configured to generate a classification adjustment template according to a word combination, the classification adjustment template comprises an original classification label, a priority, template content and an adjustment classification label, the template content comprises the word combination, the original classification label is a model classification label corresponding to the word combination, the adjustment classification label is a label of a sample corpus corresponding to the word combination, and the priority reflects the possibility that the adjustment classification label is correct;
wherein:
the classification unit is also configured to classify the text to be classified by using the text classification model to obtain a model classification label of the text to be classified;
the conversion unit is further configured to convert the text to be classified into a word list;
the text classification apparatus further includes:
a matching unit configured to take as a matching result a classification adjustment template satisfying the following conditions: the model classification label of the text to be classified is consistent with the original classification label of the classification adjusting template, and at least one word combination extracted from the word list of the text to be classified is contained in the template content of the classification adjusting template;
a determining unit configured to determine a matching result with a highest priority as a matching classification adjustment template if there is at least one matching result and a corresponding priority of the matching result with the highest priority is greater than or equal to a priority threshold;
and the adjusting unit is configured to modify the model classification label of the text to be classified into an adjusting classification label of the matching classification adjusting template as a classification result.
10. The text classification apparatus of claim 9, further comprising:
and the deleting unit is configured to delete the word combinations which simultaneously appear under the plurality of model classification labels or delete the word combinations of which the appearance times in the sample corpus are less than a threshold value.
11. A text classification apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the text classification method of any of claims 1-8 based on instructions stored in the memory.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the text classification method according to any one of claims 1 to 8.
CN201811035883.4A 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium Active CN109189932B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811035883.4A CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811035883.4A CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN109189932A CN109189932A (en) 2019-01-11
CN109189932B true CN109189932B (en) 2021-02-26

Family

ID=64914969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811035883.4A Active CN109189932B (en) 2018-09-06 2018-09-06 Text classification method and device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN109189932B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674263B (en) * 2019-12-04 2022-02-08 广联达科技股份有限公司 Method and device for automatically classifying model component files

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107894980A (en) * 2017-12-06 2018-04-10 陈件 A kind of multiple statement is to corpus of text sorting technique and grader

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8868402B2 (en) * 2009-12-30 2014-10-21 Google Inc. Construction of text classifiers
CN104182423A (en) * 2013-05-27 2014-12-03 华东师范大学 Conditional random field-based automatic Chinese personal name recognition method
US9396724B2 (en) * 2013-05-29 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for building a language model
CN104915327B (en) * 2014-03-14 2019-01-29 腾讯科技(深圳)有限公司 A kind of processing method and processing device of text information
CN107291775B (en) * 2016-04-11 2020-07-31 北京京东尚科信息技术有限公司 Method and device for generating repairing linguistic data of error sample
CN105955975A (en) * 2016-04-15 2016-09-21 北京大学 Knowledge recommendation method for academic literature
CN108108355A (en) * 2017-12-25 2018-06-01 北京牡丹电子集团有限责任公司数字电视技术中心 Text emotion analysis method and system based on deep learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951472A (en) * 2017-03-06 2017-07-14 华侨大学 A kind of multiple sensibility classification method of network text
CN107894980A (en) * 2017-12-06 2018-04-10 陈件 A kind of multiple statement is to corpus of text sorting technique and grader

Also Published As

Publication number Publication date
CN109189932A (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN109471933B (en) Text abstract generation method, storage medium and server
CN110427487B (en) Data labeling method and device and storage medium
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN108170680A (en) Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
JPWO2012147428A1 (en) Text clustering apparatus, text clustering method, and program
US9262400B2 (en) Non-transitory computer readable medium and information processing apparatus and method for classifying multilingual documents
CN110717040A (en) Dictionary expansion method and device, electronic equipment and storage medium
WO2021159803A1 (en) Text summary generation method and apparatus, and computer device and readable storage medium
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
CN111339775A (en) Named entity identification method, device, terminal equipment and storage medium
CN112380348B (en) Metadata processing method, apparatus, electronic device and computer readable storage medium
CN111737420A (en) Class case retrieval method, system, device and medium based on dispute focus
CN113934848B (en) Data classification method and device and electronic equipment
CN111444712A (en) Keyword extraction method, terminal and computer readable storage medium
CN109189932B (en) Text classification method and device and computer-readable storage medium
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN110874408A (en) Model training method, text recognition device and computing equipment
CN118447514A (en) Intelligent identifying and classifying method and device for fund bulletin
CN112632956A (en) Text matching method, device, terminal and storage medium
CN111241269A (en) Short message text classification method and device, electronic equipment and storage medium
CN116010545A (en) Data processing method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant