[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105760526B - A kind of method and apparatus of news category - Google Patents

A kind of method and apparatus of news category Download PDF

Info

Publication number
CN105760526B
CN105760526B CN201610115723.5A CN201610115723A CN105760526B CN 105760526 B CN105760526 B CN 105760526B CN 201610115723 A CN201610115723 A CN 201610115723A CN 105760526 B CN105760526 B CN 105760526B
Authority
CN
China
Prior art keywords
news
score
unit
manuscript
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610115723.5A
Other languages
Chinese (zh)
Other versions
CN105760526A (en
Inventor
钱烽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201610115723.5A priority Critical patent/CN105760526B/en
Publication of CN105760526A publication Critical patent/CN105760526A/en
Application granted granted Critical
Publication of CN105760526B publication Critical patent/CN105760526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments of the present invention provide a kind of method and apparatus of news category: extracting the headline of Press release;Target category matching is carried out to the headline, obtains the first matching result;Calculate the first score value of first matching result, and when determining that first score value meets the first preset condition, the Press release is divided into target category corresponding to first matching result, the scheme proposed is due to can be to avoid every Press release of manual read, classified according to contribution content-label, therefore, solve that efficiency existing in the prior art is lower, timeliness is poor and the lower defect of accuracy.

Description

News classification method and device
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a news classification method and device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
News is a name of information transmitted through media such as newspapers, radio stations, broadcasting, television stations, internet, etc., and is mainly a report of a newly occurred fact or a report of a newly changed fact, so that timeliness of news is important.
In daily life, in order to facilitate readers to quickly find news concerned by themselves, the news needs to be classified, and the current common classification method mainly comprises the following steps: manually reading each news manuscript, classifying according to the manuscript content mark, for example, marking the corresponding region according to the manuscript content, classifying the manuscripts according to the region, and summarizing the manuscripts into local news of the region.
Disclosure of Invention
However, the current method requires manual processing, and therefore, has the defects of low efficiency, poor timeliness and low accuracy, which is a very annoying process.
Therefore, an improved news classification method and apparatus are needed to solve the problems of low efficiency, poor timeliness and low accuracy in the prior art.
In this context, embodiments of the present invention are intended to provide a news classification method and apparatus.
In a first aspect of embodiments of the present invention, there is provided a method of news classification, comprising:
extracting news titles of the news manuscripts;
performing target category matching on the news headlines to obtain a first matching result;
and calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition.
In an embodiment, according to the method in the foregoing embodiment of the present invention, performing target category matching on the news headline to obtain a first matching result includes:
performing area name matching on the news headlines to obtain at least one area name;
calculating a first score for the first match result, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In some embodiments, the method according to any of the above embodiments of the present invention, wherein determining that the first score satisfies a first preset condition includes:
determining that the first score is greater than or equal to 1.5.
In some embodiments, according to the method of any one of the above embodiments of the present invention, if it is determined that the first score does not satisfy the first preset condition, the method further includes:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
In some embodiments, according to the method of any one of the above embodiments of the present invention, performing target category matching on the news text content to obtain a second matching result, including:
performing area name matching on the news text content to obtain at least one area name;
calculating a second score for the second match, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
dividing the news manuscript into target categories corresponding to the second matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
In some embodiments, the method according to any of the above embodiments of the present invention, wherein determining that the second score satisfies a second preset condition includes:
determining that the second score is greater than or equal to 3.
In some embodiments, according to the method of any one of the above embodiments of the present invention, after determining that the second score does not satisfy the second preset condition, the method further includes:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In some embodiments, before predicting the probability of the region to which the news article belongs according to the classification model, the method according to any of the above embodiments of the present invention further comprises:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training corpus.
In some embodiments, the method according to any of the above embodiments of the present invention, obtaining the classification model based on the corpus includes:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
In a second aspect of the embodiments of the present invention, there is provided an apparatus for classifying news, including:
the extracting unit is used for extracting news titles of the news manuscripts;
the matching unit is used for carrying out target category matching on the news headlines to obtain a first matching result;
a calculating unit, configured to calculate a first score of the first matching result;
the judging unit is used for judging whether the first score meets a first preset condition or not;
and the classification unit is used for classifying the news manuscript into a target category corresponding to the first matching result when the judgment unit judges that the first score meets the first preset condition.
In an embodiment, according to the apparatus in the foregoing embodiment of the present invention, when the matching unit performs target category matching on the news headline and obtains a first matching result, the method specifically includes:
performing area name matching on the news headlines to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit is used for taking the product of the basic score and the times as a first initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value and a second largest value of all the first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all the first initial scores except the maximum value;
the determining unit is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In some embodiments, in the apparatus according to any one of the above embodiments of the present invention, when the determining unit determines that the first score satisfies a first preset condition, specifically:
determining that the first score is greater than or equal to 1.5.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the extracting unit is further configured to extract news body content of the news manuscript;
the matching unit is also used for carrying out target category matching on the news text content to obtain a second matching result;
the calculating unit is further used for calculating a second score of the second matching result;
the judging unit is further configured to judge whether the second score meets a second preset condition;
the classification unit is further configured to, when the judgment unit judges that the second score meets a second preset condition, classify the news manuscript into a target category corresponding to the second matching result.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, when the matching unit performs target category matching on the news text content and obtains a second matching result, the method specifically includes:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit is used for taking the product of the basic score and the times as a second initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value of all the second initial scores and a number of times that a target area name appears in the news text content, where the target area name is an area name corresponding to the maximum value;
the calculating unit is further configured to use a value obtained by subtracting, as the second score, the number of times that the target area name appears in the news text content from the number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, when the determining unit determines that the second score satisfies a second preset condition, specifically:
determining that the second score is greater than or equal to 3.
In some embodiments, the apparatus according to any of the above embodiments of the present invention, further comprises an algorithm unit for predicting a probability of a region to which the news article belongs according to a classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the algorithm unit includes an obtaining unit and a training unit, wherein:
the acquiring unit is used for acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition;
and the training unit is used for obtaining the classification model based on the training corpus.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the arithmetic unit further includes an encoding unit and a feature processing unit, wherein:
the extraction unit is also used for extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
the encoding unit is used for encoding each news manuscript into a characteristic vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit is used for performing feature selection and feature combination on the training corpus coded into the feature vectors;
the training unit is further used for training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
In a third aspect of embodiments of the present invention, there is provided a method of news classification, including:
carrying out target category matching on the news manuscript to obtain a matching result;
calculating the score of the matching result, and dividing the news manuscript into target categories corresponding to the matching result when the score meets a preset condition;
training a classification model based on the news articles and the corresponding target categories; and classifying the news manuscript based on the classification model when the score does not meet the preset condition.
In an embodiment, according to the method described in the above embodiment of the present invention, performing target category matching on a news manuscript to obtain a matching result includes:
extracting news titles of the news manuscripts; and
performing target category matching on the news headlines to obtain a first matching result;
calculating the score of the matching result, and dividing the news manuscript into target categories corresponding to the matching result when the score meets the preset condition, wherein the steps of:
and calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition.
In some embodiments, according to the method of any one of the above embodiments of the present invention, performing target category matching on the news headline to obtain a first matching result includes:
performing area name matching on the news headlines to obtain at least one area name;
calculating a first score for the first match result, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In some embodiments, according to the method of any one of the above embodiments of the present invention, if it is determined that the first score does not satisfy the first preset condition, the method further includes:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
In some embodiments, according to the method of any one of the above embodiments of the present invention, performing target category matching on the news text content to obtain a second matching result, including:
performing area name matching on the news text content to obtain at least one area name;
calculating a second score for the second match, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
dividing the news manuscript into target categories corresponding to the second matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
In some embodiments, the method according to any of the above embodiments of the present invention, training a classification model based on the news contribution and the corresponding target category, includes:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training corpus.
In some embodiments, the method according to any of the above embodiments of the present invention, obtaining the classification model based on the corpus includes:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
In some embodiments, the method according to any of the above embodiments of the present invention, classifying the news article based on the classification model, includes:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In some embodiments, the method according to any of the above embodiments of the present invention, training a classification model based on the news contribution and the corresponding target category, includes:
training a classification model based on the news contribution and the corresponding target class periodically.
In a fourth aspect of the embodiments of the present invention, there is provided an apparatus for news classification, including:
the matching unit is used for carrying out target type matching on the news manuscript to obtain a matching result;
the calculating unit is used for calculating the score of the matching result;
the judging unit is used for judging whether the score meets a preset condition or not;
the classification unit is used for classifying the news manuscript into a target category corresponding to the matching result when the judgment unit judges that the score meets a preset condition;
the algorithm unit is used for training a classification model based on the news manuscript and the corresponding target class;
the classification unit is further configured to classify the news manuscript based on the classification model when the judgment unit judges that the score does not satisfy the preset condition.
In one embodiment, the apparatus according to the above embodiments of the present invention, further includes an extracting unit, configured to extract a news headline of the news manuscript;
the matching unit is specifically used for performing target category matching on the news headlines to obtain a first matching result;
the calculating unit is specifically configured to calculate a first score of the first matching result;
the classification unit is specifically configured to, when the determination unit determines that the first score meets a first preset condition, classify the news article into a target category corresponding to the first matching result.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, when the matching unit performs target category matching on the news headline and obtains a first matching result, the method specifically includes:
performing area name matching on the news headlines to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit is used for taking the product of the basic score and the times as a first initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value and a second largest value of all the first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all the first initial scores except the maximum value;
the determining unit is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the extracting unit is further configured to extract news body content of the news manuscript;
the matching unit is also used for carrying out target category matching on the news text content to obtain a second matching result;
the calculating unit is further used for calculating a second score of the second matching result;
the judging unit is further configured to judge whether the second score meets a second preset condition;
the classification unit is further configured to, when the judgment unit judges that the second score meets a second preset condition, classify the news manuscript into a target category corresponding to the second matching result.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, when the matching unit performs target category matching on the news text content and obtains a second matching result, the method specifically includes:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit is used for taking the product of the basic score and the times as a second initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value of all the second initial scores and a number of times that a target area name appears in the news text content, where the target area name is an area name corresponding to the maximum value;
the calculating unit is further configured to use a value obtained by subtracting, as the second score, the number of times that the target area name appears in the news text content from the number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the algorithm unit includes an obtaining unit and a training unit, wherein:
the acquiring unit is used for acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and the training unit is used for obtaining the classification model based on the training corpus.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the arithmetic unit further includes an encoding unit and a feature processing unit, wherein:
the extraction unit is also used for extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
the encoding unit is used for encoding each news manuscript into a characteristic vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit is used for performing feature selection and feature combination on the training corpus coded into the feature vectors;
the training unit is further used for training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
In some embodiments, according to the apparatus of any one of the above embodiments of the present invention, the algorithm unit is specifically configured to predict a probability of a region to which the news article belongs according to a classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In some embodiments, according to the apparatus of any of the above embodiments of the present invention, the algorithm unit is specifically configured to train a classification model periodically based on the news article and the corresponding target category.
In the embodiment of the invention, a news classification method is provided: extracting news titles of the news manuscripts; performing target category matching on the news headlines to obtain a first matching result; calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition; according to the scheme, each news manuscript is prevented from being manually read, and classification is carried out according to the manuscript content marks, so that the defects of low efficiency, poor timeliness and low accuracy in the prior art are overcome;
in the embodiment of the invention, a news classification method is also provided: carrying out target category matching on the news manuscript to obtain a matching result; calculating the score of the matching result, and dividing the news manuscript into target categories corresponding to the matching result when the score meets a preset condition; training a classification model based on the news articles and the corresponding target categories; and when the score is judged not to meet the preset condition, classifying the news manuscripts based on the classification model, and because the scheme can also avoid manually reading each news manuscript and classifying according to the content marks of the manuscripts, the defects of low efficiency, poor timeliness and low accuracy in the prior art are overcome.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1A schematically illustrates a flow chart for performing news categorization according to an embodiment of the present invention;
FIG. 1B schematically illustrates a flow diagram for sorting by news body content according to an embodiment of the present invention;
FIG. 1C schematically shows a flow chart for deriving a classification model according to an embodiment of the invention;
FIG. 1D schematically illustrates a flow diagram for news article classification according to a classification model, according to an embodiment of the present invention;
FIG. 2 schematically illustrates a flow diagram for performing news categorization according to an embodiment of the present invention;
FIG. 3 schematically shows a schematic view of an apparatus for classifying news according to an embodiment of the present invention;
fig. 4 schematically shows another schematic view of an apparatus for news classification according to another embodiment of the present invention;
fig. 5 schematically shows another schematic diagram of an apparatus for news classification according to another embodiment of the present invention;
fig. 6 schematically shows another schematic view of an apparatus for news classification according to another embodiment of the present invention;
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Reference in the specification to "an embodiment" or "an implementation" may mean either one embodiment or one implementation or some instances of embodiments or implementations.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a news classification method and device are provided.
It is to be noted that any number of elements in the figures are provided by way of example and not limitation, and any nomenclature is used for distinction only and not in any limiting sense.
Technical terms involved in the present invention will be briefly described below so that the related person can better understand the present solution.
Supervised machine learning classification algorithms: it may refer to determining a set of training data sets labeled with classes, fitting the set of training data sets using a mathematical model and an optimization algorithm to obtain a mathematical model, which may be used to predict classes of training samples for unknown classes. For example: a logistic classification algorithm, a naive bayes algorithm, a support vector machine algorithm, etc.
Classification models: the mathematical model obtained by fitting a training data set by using a supervised machine learning classification algorithm can be referred to.
Training the corpus: may refer to a training data set of text types for which categories have been labeled.
Bootstrap: the method can be a mode of achieving a certain effect by depending on a self strategy at the beginning of system starting without the help of external resources.
The accuracy is as follows: the method can be used for measuring the classification capability of a classification algorithm by using a classification model obtained by training a supervised machine learning classification algorithm, predicting a group of test samples of unknown classes, and then obtaining the ratio of the obtained result to the real class of the test samples, wherein the accuracy can be used for measuring the classification capability of the classification algorithm.
The AC automaton algorithm: the method can be an algorithm for quickly searching the occurrence frequency of words in a text by constructing a dictionary tree, is often used for text word frequency matching by a search engine system, and has higher query efficiency than a hash table.
Threshold value: also called threshold, may refer to the lowest or highest value that an effect can produce.
Multi-classification logistic model: and adopting a sigmoid function as a matching hypothesis, and classifying more than two classes by adopting a supervised machine learning classification algorithm.
Vector space model: the processing of text content is simplified to vector operations in vector space, and it expresses semantic similarity in spatial similarity. When documents are represented as vectors in document space, the similarity between documents can be measured by calculating the cosine distance between the vectors.
TF-IDF (term frequency-inverse document frequency) algorithm: a commonly used weighting technique for information retrieval and information exploration is used to evaluate the importance of each term in the article. The importance of a word increases in direct proportion to the number of times it appears in the contribution, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. The TF-IDF algorithm is often applied by search engines as a measure of the degree of correlation between a document and a user query.
Selecting characteristics: the method is characterized in that N most important features are selected from original M features of a training data set, so that the classification effect of a machine learning algorithm is optimized. Feature selection is a process of selecting some most effective features from original features to reduce dimensionality of a data set, is an important means for improving performance of a learning algorithm, and is a key data preprocessing step in pattern recognition.
Combining the characteristics: the method is characterized in that original M characteristics of a training data set are subjected to linear or nonlinear combination to obtain N new characteristics, and the N new characteristics are cascaded to the original characteristics. And (3) using the M + N characteristics for a machine learning classification algorithm to optimize the effect of the classification algorithm.
Summary of The Invention
The inventor finds that in the prior art, the news is classified manually, so that the defects of low efficiency, poor timeliness and low accuracy exist, and the news classification efficiency, timeliness and accuracy can be improved if manual classification is avoided.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
For example, for a news manuscript with the title of 'Anhui long-distance female passenger lost connection midway', a news title of 'Anhui long-distance female passenger lost connection midway' is extracted, then target category matching is performed on the news title to obtain a first matching result, a first score of the first matching result is calculated, and when the first score meets a first preset condition, the news manuscript is divided into target categories corresponding to the first matching result.
The news manuscript referred in the present invention may be an online news manuscript, or may be other news manuscripts, which is not limited herein.
Exemplary method
A method for news classification according to an exemplary embodiment of the present invention is described with reference to fig. 1A, 2. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Fig. 1A schematically shows a flow diagram of a method 10 for news classification according to an embodiment of the present invention. As shown in fig. 1A, the method may include steps 100, 110, and 120.
The method 10 begins at step 100, where a news headline of a news article is extracted.
The news article in the embodiment of the present invention may be an online news article, and certainly, may also be a news article of other media, which is not limited specifically herein.
In the embodiment of the present invention, there are various ways to extract the news headlines of the news manuscript, and no specific limitation is made herein.
After step 100, a step 110 may also be performed, in which the news headlines are subject to target category matching, resulting in a first matching result.
In the embodiment of the present invention, when performing target category matching on the news headline to obtain the first matching result, optionally, the following method may be adopted:
and matching the area names of the news headlines to obtain at least one area name.
For example, the news title is "Beijing house price compares with Shanghai, Shenzhen house price", and since the news title matches with 3 region names, 3 region names are obtained.
In the embodiment of the present invention, when the target category matching is performed on the news headline, an AC automaton algorithm may be used, and of course, other methods may also be used, which are not described in detail herein.
It should be noted that some proper nouns may also include the area name, and in order to improve the accuracy of news classification, the area name included in these proper nouns is not used as the area name obtained by matching in the present invention.
For example, the terms "Hangzhou road", "Shanghai Volkswagen" and the like are not used as the names of the regions in the present invention.
In embodiments of the invention, proper nouns including the names of regions may be stored, for example, by extracting words related to the names of regions from a public dictionary.
After step 110, step 120 may be further performed, in which a first score of the first matching result is calculated, and when it is determined that the first score meets a first preset condition, the news article is divided into target categories corresponding to the first matching result.
In the embodiment of the invention, the news headlines are subjected to target category matching in the following way, and a first matching result is obtained: and matching the area names of the news headlines to obtain at least one area name. When calculating the first score of the first matching result, optionally, the following manner may be adopted:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In the embodiment of the present invention, the region names may be classified into three categories, a provincial level, a city level and a district level, as shown in table 1.
TABLE 1 area name Classification
Serial number Name of area Upper level region Regional level
1 Beijing Is free of Provincial level
2 Zhejiang river Is free of Provincial level
3 Hangzhou province Zhejiang river Market level
4 Ningbo Zhejiang river Market level
5 West lake Hangzhou province Zone level
In the embodiment of the present invention, the basic scores corresponding to the regional names of different levels may be different, for example, the basic score corresponding to the provincial regional name may be greater than the basic score corresponding to the city regional name, and the basic score corresponding to the city regional name may be greater than the basic score corresponding to the regional name.
In the embodiment of the present invention, when it is determined that the first score meets the first preset condition, optionally, the following manner may be adopted:
determining that the first score is greater than or equal to 1.5.
For example, for a news title of "Wuhan, Nanjing people live in the new Hangzhou state of Hangzhou", three place names of "Wuhan", "Nanjing" and "Hangzhou" are matched. Assuming that the basic scores of all three place names are 10, the initial scores of the three place names are 10, 10 and 20, respectively. The maximum value is 20 corresponding to Hangzhou, the second maximum value is 10 corresponding to Wuhan or Nanjing, the ratio of the maximum value to the second maximum value is 2 and is more than 1.5, and the preset condition is met. The news articles are classified into the category of the place name "hang state" corresponding to the maximum value.
In addition to the above-described classification method, there may be other arbitrary classification methods, such as classifying a news article into a place name having the largest number of occurrences, with the number of occurrences being a first score.
In the foregoing, the case of classifying news according to news titles is described, in practical applications, when the first score does not satisfy the first preset condition, the news cannot be classified according to the news titles, and at this time, the news can be further classified according to the content of the news text, so in the embodiment of the present invention, if it is determined that the first score does not satisfy the first preset condition, the method further includes the following operations:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
That is, first, classification is performed using a first matching result of news headlines of the news manuscript; if the news articles cannot be classified according to their news headlines, the news articles can then be classified using the news body content of the news articles. As shown in fig. 1B.
In the embodiment of the present invention, when performing target category matching on the news text content to obtain a second matching result, optionally, the following method may be adopted:
performing area name matching on the news text content to obtain at least one area name;
calculating a second score for the second match, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
dividing the news manuscript into target categories corresponding to the second matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. Fans coming to war include amateur football club representatives from areas such as Ningbo, in addition to local football fans, and the area matching names obtained according to the content of news text are Hangzhou and Ningbo.
In the embodiment of the present invention, there are various ways to determine that the second score satisfies the second preset condition, and optionally, the following ways may be adopted:
determining that the second score is greater than or equal to 3.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. The fans who come to war, except local football fans, also include amateur football club representatives from areas such as Ningbo, and the like, the area matching names obtained by matching the text contents of the news are Hangzhou and Ningbo, the frequency of occurrence of the Hangzhou is 4 times, the frequency of occurrence of the Ningbo is 1 time, the first initial score of the Hangzhou is 10 multiplied by 4 to 40, the first initial score of the Ningbo is 10 multiplied by 1 to 10, and the value obtained by dividing the first initial score of the Hangzhou by the first initial score of the Ningbo is judged to be more than 1.5; the news contribution is categorized as a local news contribution in hangzhou. If the value obtained by dividing the first initial score of Hangzhou by the first initial score of Ningbo is less than 1.5, judging whether the value obtained by subtracting the occurrence frequency of Ningbo from the occurrence frequency of Hangzhou is greater than or equal to 3; for example, in the above example, the number of occurrences in Hangzhou minus the number of occurrences of their ningwave is 4-1 to 3; thus, news articles are categorized as local news articles in Hangzhou.
The method is bootstrapped, does not need any manual input, can process large-scale local news classification requests in real time, has high efficiency and good timeliness, and meets the functional requirements of Internet news products.
In the foregoing, the classification is performed according to news titles, and if the classification cannot be performed according to news titles, the classification is performed according to news body contents, and at this time, if the classification cannot be performed according to news body contents, the classification may be performed according to a classification model, so in the embodiment of the present invention, after it is further determined that the second score does not satisfy the second preset condition, the method further includes the following operations:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In the embodiment of the present invention, before predicting the probability of the region to which the news article belongs according to the classification model, the method further includes the following operations:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training corpus.
The "before" is used to indicate the logical context, and in fact, the step of predicting the probability of the region to which the news article belongs according to the classification model and the step of obtaining the training corpus and obtaining the classification model based on the training corpus are performed in parallel according to respective requirements.
In the embodiment of the present invention, when obtaining the classification model based on the training corpus, optionally, the following manner may be adopted:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
FIG. 1C illustrates the main process of obtaining a classification model according to one embodiment: obtaining a training corpus, extracting a keyword from each news manuscript in the training corpus, coding each news manuscript into a feature vector according to corresponding manuscript attributes and the keyword, then, performing feature selection and feature combination on the training corpus coded into the feature vector, and then, training the training corpus after performing the feature selection and feature combination to obtain the classification model.
In embodiments of the invention, the classification model may be updated periodically, for example once a day.
In the embodiment of the present invention, the manuscript attribute includes manuscript sending media information and/or manuscript sending time information and the like.
In the embodiment of the invention, if the news manuscript can not be classified according to the classification model, the news manuscript can be judged to be classified.
Fig. 1D is a schematic flow chart of classifying news articles according to the classification model, predicting the region to which the news articles belong and the probability thereof according to the classification model, and determining whether the probability is greater than a threshold value, if so, taking the news articles as the news articles in the region to which the news articles belong, otherwise, determining that the articles cannot be classified.
According to the method, the scoring rule is firstly used for bootstrap classification to obtain local news manuscripts with high accuracy, then a supervised machine learning algorithm trains a classification model based on the local manuscripts, and other news manuscripts are subjected to supplementary classification, so that the purpose of no manual investment is achieved, large-scale local news classification requests can be processed in real time, and the functional requirements of internet news products are met.
Fig. 2 schematically shows a flow diagram of a method 20 for news classification according to an embodiment of the present invention. As shown in fig. 2, the method may include steps 200, 210, and 220.
The method 20 begins at step 200 where a news article is subject to a target category match resulting in a match.
The news article in the embodiment of the present invention may be an online news article, and certainly, may also be a news article of other media, which is not limited specifically herein.
In one embodiment, target category matching the news contribution includes target category matching a title of the news contribution.
In the embodiment of the present invention, there are various ways to extract the news headlines of the news manuscript, and no specific limitation is made herein.
In one embodiment, target category matching the news contribution includes target category matching the body content of the news contribution.
In one embodiment, target category matching the news contribution includes target category matching the full text of the news contribution. In one embodiment, the target category matching of the news manuscript includes performing target category matching of a title of the news manuscript, and if classification cannot be achieved according to the title, continuing to perform target category matching of the text content of the news manuscript.
After step 200, step 210 may also be performed, in which a score of the matching result is calculated, and when it is determined that the score meets a preset condition, the news article is divided into target categories corresponding to the matching result.
After step 210, step 220 may also be performed, wherein a classification model is trained based on the news contribution and the corresponding target category; and classifying the news manuscript based on the classification model when the score does not meet the preset condition.
In the embodiment of the present invention, optionally, the target category matching is performed on the news manuscript, and when a matching result is obtained, the following method may be adopted:
extracting news titles of the news manuscripts; and
performing target category matching on the news headlines to obtain a first matching result;
calculating the score of the matching result, and when the score is judged to meet a preset condition and the news manuscript is divided into target categories corresponding to the matching result, optionally, the following method can be adopted:
and calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition.
In the embodiment of the present invention, when the target category matching is performed on the news headline, an AC automaton algorithm may be used, and of course, other methods may also be used, which are not described in detail herein.
In the embodiment of the present invention, optionally, when performing target category matching on the news headline to obtain a first matching result, the following method may be adopted:
and matching the area names of the news headlines to obtain at least one area name.
For example, the news title is "Beijing house price compares with Shanghai, Shenzhen house price", and since the news title matches with 3 region names, 3 region names are obtained.
It should be noted that some proper nouns may also include the area name, and in order to improve the accuracy of news classification, the area name included in these proper nouns is not used as the area name obtained by matching in the present invention.
For example, the terms "Hangzhou road", "Shanghai Volkswagen" and the like are not used as the names of the regions in the present invention.
In embodiments of the invention, proper nouns including the names of regions may be stored, for example, by extracting words related to the names of regions from a public dictionary.
Calculating a first score for the first match result, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching result, including:
and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In the embodiment of the present invention, the region names may be classified into three categories, a provincial level, a city level and a district level, as shown in table 1.
In the embodiment of the present invention, the basic scores corresponding to the regional names of different levels may be different, for example, the basic score corresponding to the provincial regional name may be greater than the basic score corresponding to the city regional name, and the basic score corresponding to the city regional name may be greater than the basic score corresponding to the regional name.
In the embodiment of the present invention, when it is determined that the first score meets the first preset condition, optionally, the following manner may be adopted:
determining that the first score is greater than or equal to 1.5.
The above describes a case of classifying news according to news titles, and in practical applications, when the first score does not satisfy the first preset condition, the classification according to news titles cannot be performed, at this time, further, the classification according to the news text content may be performed, or the classification according to the news text content may be performed directly, or the classification according to the full text of the news may be performed directly, where the classification according to the full text of the news may employ a classification method the same as the classification according to the news titles or a classification method the same as the classification according to the news text content. In one embodiment, a method for classifying news according to news body content comprises the following steps:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
In one embodiment, the first matching result of the news headlines of the news manuscript is firstly used for classification; if the news articles cannot be classified according to their news headlines, the news articles may then be classified using the news body content of the news articles, as shown in FIG. 1B.
In the embodiment of the present invention, when performing target category matching on the news text content to obtain a second matching result, optionally, the following method may be adopted:
performing area name matching on the news text content to obtain at least one area name;
when calculating the second score of the second matching result, optionally, the following manner may be adopted:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
when the news manuscript is divided into the target categories corresponding to the second matching results, optionally, the following method may be adopted:
and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. Fans coming to war include amateur football club representatives from areas such as Ningbo, in addition to local football fans, and the obtained area matching names are Hangzhou and Ningbo.
In the embodiment of the present invention, there are various ways to determine that the second score satisfies the second preset condition, and optionally, the following ways may be adopted:
determining that the second score is greater than or equal to 3.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. The fans who come to war, except local football fans, also include amateur football club representatives from areas such as Ningbo, and the like, the area matching names obtained by matching the text contents are Hangzhou and Ningbo, the frequency of occurrence of the Hangzhou is 4 times, the frequency of occurrence of the Ningbo is 1 time, the first initial score of the Hangzhou is 10 multiplied by 4 which is 40, the first initial score of the Ningbo is 10 multiplied by 1 which is 10, and the value obtained by dividing the first initial score of the Hangzhou by the first initial score of the Ningbo is judged to be more than 1.5; the news contribution is categorized as a local news contribution in hangzhou. If the value obtained by dividing the first initial score of Hangzhou by the first initial score of Ningbo is less than 1.5, judging whether the value obtained by subtracting the occurrence frequency of Ningbo from the occurrence frequency of Hangzhou is greater than or equal to 3; for example, in the above example, the number of occurrences in Hangzhou minus the number of occurrences of their ningwave is 4-1 to 3; thus, news articles are categorized as local news articles in Hangzhou.
In the embodiment of the present invention, when training a classification model based on the news manuscript and the corresponding target category, optionally, the following method may be adopted:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training corpus.
In the embodiment of the present invention, when obtaining the classification model based on the training corpus, the following method may optionally be adopted:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
In the embodiment of the present invention, when the news articles are classified based on the classification model, optionally, the following method may be adopted:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In the embodiment of the present invention, when training a classification model based on the news manuscript and the corresponding target category, optionally, the following method may be adopted:
training a classification model based on the news contribution and the corresponding target class periodically.
FIG. 1C is the main process of obtaining a classification model: obtaining a training corpus, extracting a keyword from each news manuscript in the training corpus, coding each news manuscript into a feature vector according to corresponding manuscript attributes and the keyword, then, performing feature selection and feature combination on the training corpus coded into the feature vector, and then, training the training corpus after performing the feature selection and feature combination to obtain the classification model.
In embodiments of the invention, the classification model may be updated periodically, for example once a day.
In the embodiment of the present invention, the manuscript attribute includes manuscript sending media information and/or manuscript sending time information and the like.
In the embodiment of the invention, if the news manuscript can not be classified according to the classification model, the news manuscript can be judged to be classified.
Fig. 1D is a schematic flow chart of classifying news articles according to the classification model, predicting the region to which the news articles belong and the probability thereof according to the classification model, and determining whether the probability is greater than a threshold value, if so, taking the news articles as the news articles in the region to which the news articles belong, otherwise, determining that the articles cannot be classified.
Exemplary device
Having introduced the method of an exemplary embodiment of the present invention, next, the apparatus 30, 40 for news classification of an exemplary embodiment of the present invention will be described with reference to fig. 3, 4, respectively, the apparatus 30 including an extracting unit 300, a matching unit 310, a calculating unit 320, a judging unit 330, and a classifying unit 340, wherein:
an extracting unit 300 for extracting a news title of a news manuscript;
the matching unit 310 is configured to perform target category matching on the news headlines to obtain a first matching result;
a calculating unit 320, configured to calculate a first score of the first matching result;
a determining unit 330, configured to determine whether the first score meets a first preset condition;
the classifying unit 340 is configured to, when the determining unit 330 determines that the first score meets the first preset condition, classify the news article into a target category corresponding to the first matching result.
The news article in the embodiment of the present invention may be an online news article, and certainly, may also be a news article of other media, which is not limited specifically herein.
In the embodiment of the present invention, there are various ways for the extracting unit 300 to extract the news headlines of the news manuscript, and the ways are not specifically limited herein.
In the embodiment of the present invention, when the matching unit 310 performs target category matching on the news headline to obtain a first matching result, the method specifically includes:
and matching the area names of the news headlines to obtain at least one area name.
For example, the news title is "Beijing house price compares with Shanghai, Shenzhen house price", and since the news title matches with 3 region names, 3 region names are obtained.
In the embodiment of the present invention, when the matching unit 310 matches the target category of the news headline, an AC automaton algorithm may be used, or other methods may also be used, which are not described in detail herein.
It should be noted that some proper nouns may also include the area name, and in order to improve the accuracy of news classification, the area name included in these proper nouns is not used as the area name obtained by matching in the present invention.
For example, the terms "Hangzhou road", "Shanghai Volkswagen" and the like are not used as the names of the regions in the present invention.
In embodiments of the invention, proper nouns including the names of regions may be stored, for example, by extracting words related to the names of regions from a public dictionary.
In this embodiment of the present invention, optionally, the calculating unit 320 includes a determining unit 320A and a product calculating unit 320B, where:
the determining unit 320A is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit 320B is configured to take a product of the basic score and the number of times as a first initial score corresponding to the arbitrary region name;
the determining unit 320A is further configured to determine a maximum value and a second largest value of all first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all first initial scores except the maximum value;
the determining unit 320A is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit 340 is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In the embodiment of the present invention, the region names may be classified into three categories, a provincial level, a city level and a district level, as shown in table 1.
In the embodiment of the present invention, the basic scores corresponding to the regional names of different levels may be different, for example, the basic score corresponding to the provincial regional name may be greater than the basic score corresponding to the city regional name, and the basic score corresponding to the city regional name may be greater than the basic score corresponding to the regional name.
In this embodiment of the present invention, optionally, when the determining unit 330 determines that the first score meets the first preset condition, specifically:
determining that the first score is greater than or equal to 1.5.
For example, for a news title of "Wuhan, Nanjing people live in the new Hangzhou state of Hangzhou", three place names of "Wuhan", "Nanjing" and "Hangzhou" are matched. Assuming that the basic scores of all three place names are 10, the initial scores of the three place names are 10, 10 and 20, respectively. The maximum value is 20 corresponding to Hangzhou, the second maximum value is 10 corresponding to Wuhan or Nanjing, the ratio of the maximum value to the second maximum value is 2 and is more than 1.5, and the preset condition is met. The news articles are classified into the category of the place name "hang state" corresponding to the maximum value.
In addition to the above-described classification method, there may be other arbitrary classification methods, such as classifying a news article into a place name having the largest number of occurrences, with the number of occurrences being a first score.
In the foregoing, the case of classifying news according to news titles is described, in practical applications, when the first score does not satisfy the first preset condition, the classification according to news titles cannot be performed, and at this time, the news can be further classified according to news text contents, so that the extracting unit 300 is further configured to extract the news text contents of the news manuscript;
the matching unit 310 is further configured to perform target category matching on the news text content to obtain a second matching result;
the calculating unit 320 is further configured to calculate a second score of the second matching result;
the determining unit 330 is further configured to determine whether the second score meets a second preset condition;
the classifying unit 340 is further configured to, when the determining unit 330 determines that the second score meets a second preset condition, classify the news article into a target category corresponding to the second matching result.
That is, the classifying unit 340 first classifies using the first matching result of the news headlines of the news manuscript; if the news articles cannot be classified according to their news headlines, the news articles can then be classified using the news body content of the news articles. As shown in fig. 1B.
In this embodiment of the present invention, optionally, when the matching unit 310 performs target category matching on the news text content to obtain a second matching result, specifically:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit 320 includes a determination unit 320A and a product calculation unit 320B, wherein:
the determining unit 320A is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit 320B is configured to take a product of the basic score and the number of times as a second initial score corresponding to the arbitrary region name;
the determining unit 320A is further configured to determine a maximum value of all the second initial scores and a number of times that a target region name appears in the news text content, where the target region name is a region name corresponding to the maximum value;
the calculating unit 320 is further configured to use a value obtained by subtracting, as the second score, a number of times that the target area name appears in the news text content from a number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit 340 is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. Fans coming to war include amateur football club representatives from areas such as Ningbo, in addition to local football fans, and the area matching names obtained according to the content of news text are Hangzhou and Ningbo.
In this embodiment of the present invention, optionally, when the determining unit 330 determines that the second score meets the second preset condition, specifically:
determining that the second score is greater than or equal to 3.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. Fans who come to war, in addition to local football fans, also include amateur football club representatives from areas such as ningbo, etc., by matching the text content of the news, the obtained area matching names are hangzhou and ningbo, the number of times of occurrence of the "hangzhou" is 4, the number of times of occurrence of the "ningbo" is 1, the first initial score of the hangzhou is 10 × 4 ═ 40, the first initial score of the ningbo is 10 × 1 ═ 10, the judgment unit 330 first judges that the value obtained by dividing the first initial score of the hangzhou by the first initial score of the ningbo is more than 1.5; the classification unit 340 classifies the news contribution as a local news contribution in Hangzhou. The determining unit 330 determines whether a value obtained by subtracting the number of occurrences of the ningbo from the number of occurrences of the hangzhou is greater than or equal to 3, if the value obtained by dividing the first initial score of the hangzhou by the first initial score of the ningbo is less than 1.5; for example, in the above example, the number of occurrences in Hangzhou minus the number of occurrences of their ningwave is 4-1 to 3; accordingly, the classification unit 340 classifies the news contribution as a local news contribution in Hangzhou.
The method is bootstrapped, does not need any manual input, can process large-scale local news classification requests in real time, has high efficiency and good timeliness, and meets the functional requirements of Internet news products.
The classification unit 340 is described above to classify according to news titles first, and if classification according to news titles is not possible, then classify according to news text contents, and at this time, if classification according to news text contents is also not possible, classification according to a classification model is possible, so in the embodiment of the present invention, further, the apparatus further includes an algorithm unit 350, configured to predict a probability of a region to which the news article belongs according to the classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In this embodiment of the present invention, optionally, the algorithm unit 350 includes an obtaining unit 350A and a training unit 350B, where:
the obtaining unit 350A is configured to obtain a corpus, where the corpus includes the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition;
the training unit 350B is configured to obtain the classification model based on the training corpus.
It should be noted that the step of predicting the probability of the region to which the news manuscript belongs according to the classification model and the step of obtaining the corpus and obtaining the classification model based on the corpus are performed in parallel according to respective requirements.
In this embodiment of the present invention, optionally, the algorithm unit 350 further includes an encoding unit 350C and a feature processing unit 350D, where:
the extracting unit 300 is further configured to extract a keyword from each news manuscript in the training corpus by using a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
the encoding unit 350C is further configured to encode each news manuscript into a feature vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit 350D is configured to perform feature selection and feature combination on the corpus encoded into the feature vector;
the training unit 350B is further configured to train the corpus after feature selection and feature combination by using a multi-classification logistic model, so as to obtain the classification model.
FIG. 1C illustrates the main process of obtaining a classification model according to one embodiment: the obtaining unit 350A obtains a training corpus, the extracting unit 300 extracts a keyword from each news manuscript in the training corpus, the encoding unit 350C encodes each news manuscript into a feature vector according to the corresponding manuscript attribute and the keyword, the feature processing unit 350D performs feature selection and feature combination on the training corpus encoded into the feature vector, and the training unit 350B performs training on the training corpus after feature selection and feature combination to obtain the classification model.
In embodiments of the invention, the classification model may be updated periodically, for example once a day.
In the embodiment of the present invention, the manuscript attribute includes manuscript sending media information and/or manuscript sending time information and the like.
In the embodiment of the present invention, if the classifying unit 340 cannot classify the news articles according to the classification model, it may be determined that the news articles cannot be classified.
Fig. 1D is a schematic flow chart of classifying news articles according to the classification model, predicting the region to which the news articles belong and the probability thereof according to the classification model, and determining whether the probability is greater than a threshold value, if so, taking the news articles as the news articles in the region to which the news articles belong, otherwise, determining that the articles cannot be classified.
According to the scheme, the scoring rule is firstly used for bootstrap classification to obtain local news manuscripts with high accuracy, then a supervised machine learning algorithm trains a classification model based on the manuscripts, and other news manuscripts are supplemented and classified, so that the purpose of no manual input is achieved, large-scale local news classification requests can be processed in real time, and the functional requirements of internet news products are met.
Referring to fig. 4, the apparatus 40 includes a matching unit 400, a calculating unit 410, a determining unit 420, a classifying unit 430 and an algorithm unit 440, wherein:
the matching unit 400 is used for performing target category matching on the news manuscript to obtain a matching result;
a calculating unit 410, configured to calculate a score of the matching result;
a judging unit 420, configured to judge whether the score meets a preset condition;
a classifying unit 430, configured to, when the determining unit 420 determines that the score meets a preset condition, classify the news article into a target category corresponding to the matching result;
an algorithm unit 440, configured to train a classification model based on the news contribution and the corresponding target category;
the classifying unit 430 is further configured to classify the news articles based on the classification model when the judging unit 420 judges that the score does not satisfy the preset condition
The news article in the embodiment of the present invention may be an online news article, and certainly, may also be a news article of other media, which is not limited specifically herein.
In one embodiment, the matching unit 400 performs target category matching on the news articles including target category matching on the titles of the news articles.
In the embodiment of the present invention, there are various ways to extract the news headlines of the news manuscript, and no specific limitation is made herein.
In one embodiment, the matching unit 400 performs target category matching on the news articles, including target category matching on the body content of the news articles.
In one embodiment, the matching unit 400 performs target category matching on the news contribution includes performing target category matching on the full text of the news contribution.
In one embodiment, the performing of the target category matching on the news manuscript by the matching unit 400 includes performing the target category matching on the title of the news manuscript first, and if the classification cannot be performed according to the title, continuing performing the target category matching on the text content of the news manuscript.
In this embodiment of the present invention, optionally, the apparatus further includes an extracting unit 450, configured to extract a news title of the news manuscript;
the matching unit 400 is specifically configured to perform target category matching on the news headlines to obtain a first matching result;
the calculating unit 410 is specifically configured to calculate a first score of the first matching result;
the classifying unit 430 is specifically configured to, when the determining unit 420 determines that the first score meets a first preset condition, classify the news article into a target category corresponding to the first matching result.
In the embodiment of the present invention, when the matching unit 400 matches the target category of the news headline, an AC automaton algorithm may be used, or other methods may be used, which are not described in detail herein.
In this embodiment of the present invention, optionally, when the matching unit 400 performs target category matching on the news headline to obtain a first matching result, the method specifically includes:
and matching the area names of the news headlines to obtain at least one area name.
For example, the news title is "Beijing house price compares with Shanghai, Shenzhen house price", and since the news title matches with 3 region names, 3 region names are obtained.
It should be noted that some proper nouns may also include the area name, and in order to improve the accuracy of news classification, the area name included in these proper nouns is not used as the area name obtained by matching in the present invention.
For example, the terms "Hangzhou road", "Shanghai Volkswagen" and the like are not used as the names of the regions in the present invention.
In embodiments of the invention, proper nouns including the names of regions may be stored, for example, by extracting words related to the names of regions from a public dictionary.
The calculation unit 410 includes a determination unit 410A and a product calculation unit 410B, wherein:
the determining unit 410A is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit 410B is configured to take a product of the basic score and the number of times as a first initial score corresponding to the arbitrary region name;
the determining unit 410A is further configured to determine a maximum value and a second largest value of all first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all first initial scores except the maximum value;
the determining unit 410A is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit 430 is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum first initial scores in all the first initial scores.
In the embodiment of the present invention, the region names may be classified into three categories, a provincial level, a city level and a district level, as shown in table 1.
In the embodiment of the present invention, the basic scores corresponding to the regional names of different levels may be different, for example, the basic score corresponding to the provincial regional name may be greater than the basic score corresponding to the city regional name, and the basic score corresponding to the city regional name may be greater than the basic score corresponding to the regional name.
In this embodiment of the present invention, when the determining unit 420 determines that the first score meets the first preset condition, optionally, the following manner may be adopted:
determining that the first score is greater than or equal to 1.5.
In practical applications, when the first score does not satisfy the first preset condition, the news cannot be classified according to the news titles, and at this time, the news is further classified directly according to the news text content, or the news is classified directly according to the full text of the news, wherein the classification of the news according to the full text of the news can adopt the same classification method as the classification of the news according to the news titles or the same classification method as the classification of the news according to the news text content. In an embodiment, the extracting unit 450 is further configured to extract news body content of the news manuscript;
the matching unit 400 is further configured to perform target category matching on the news text content to obtain a second matching result;
the calculating unit 410 is further configured to calculate a second score of the second matching result;
the determining unit 420 is further configured to determine whether the second score meets a second preset condition;
the classifying unit 430 is further configured to, when the determining unit 420 determines that the second score meets a second preset condition, classify the news article into a target category corresponding to the second matching result.
In one embodiment, the first matching result of the news headlines of the news manuscript is firstly used for classification; if the news articles cannot be classified according to their news headlines, the news articles may then be classified using the news body content of the news articles, as shown in FIG. 1B.
In this embodiment of the present invention, optionally, when the matching unit 400 performs target category matching on the news text content to obtain a second matching result, the method specifically includes:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit 410 includes a determination unit 410A and a product calculation unit 410B, wherein:
the determining unit 410A is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit 410B is configured to take a product of the basic score and the number of times as a second initial score corresponding to the arbitrary region name;
the determining unit 410A is further configured to determine a maximum value of all the second initial scores and a number of times that a target area name appears in the news text content, where the target area name is a area name corresponding to the maximum value;
the calculating unit 410 is further configured to use a value obtained by subtracting, as the second score, a number of times that the target area name appears in the news text content from a number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit 430 is specifically configured to: and dividing the news manuscript into the region names corresponding to the maximum second initial scores in all the second initial scores.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. Fans coming to war include amateur football club representatives from areas such as Ningbo, in addition to local football fans, and the obtained area matching names are Hangzhou and Ningbo.
In this embodiment of the present invention, there are various ways for determining, by the determining unit 420, that the second score satisfies the second preset condition, and optionally, the following ways may be adopted:
determining that the second score is greater than or equal to 3.
For example, the content of the news text is "last week, 2015 Asia Pacific China expecting the Hangzhou project elite league to sound in heavy rain. 8 hang city amateur football teams oligomerize cottage football training bases, and they will be expanded here to a volume of three weeks. The activities are sponsored by the Hangzhou city football association and born by the Hangzhou city football management center, and are part of the 34 th west lake cup super league in Hangzhou city. The fans who come to war, except local football fans, also include amateur football club representatives from areas such as Ningbo, and the like, the area matching names obtained by matching the text contents are Hangzhou and Ningbo, the frequency of occurrence of the Hangzhou is 4 times, the frequency of occurrence of the Ningbo is 1 time, the first initial score of the Hangzhou is 10 multiplied by 4 which is 40, the first initial score of the Ningbo is 10 multiplied by 1 which is 10, and the value obtained by dividing the first initial score of the Hangzhou by the first initial score of the Ningbo is judged to be more than 1.5; the news contribution is categorized as a local news contribution in hangzhou. If the value obtained by dividing the first initial score of Hangzhou by the first initial score of Ningbo is less than 1.5, judging whether the value obtained by subtracting the occurrence frequency of Ningbo from the occurrence frequency of Hangzhou is greater than or equal to 3; for example, in the above example, the number of occurrences in Hangzhou minus the number of occurrences of their ningwave is 4-1 to 3; thus, news articles are categorized as local news articles in Hangzhou.
In this embodiment of the present invention, optionally, the algorithm unit 440 includes an obtaining unit 440A and a training unit 440B, where:
the obtaining unit 440A is configured to obtain a corpus, where the corpus includes the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
the training unit 440B is configured to obtain the classification model based on the training corpus.
In this embodiment of the present invention, optionally, the algorithm unit 440 further includes an encoding unit 440C and a feature processing unit 440D, where:
the extracting unit 450 is further configured to extract a keyword from each news manuscript in the training corpus by using a vector space model and a word frequency inverse file word frequency TF-IDF algorithm;
the encoding unit 440C is configured to encode each news manuscript into a feature vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit 440D is configured to perform feature selection and feature combination on the corpus encoded as feature vectors;
the training unit 440B is configured to train the corpus after feature selection and feature combination by using a multi-classification logistic model, so as to obtain the classification model.
In this embodiment of the present invention, optionally, the algorithm unit 440 is specifically configured to predict a probability of an area to which the news article belongs according to a classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
In this embodiment of the present invention, optionally, the algorithm unit 440 is specifically configured to periodically train a classification model based on the news article and the corresponding target category.
FIG. 1C is the main process of obtaining a classification model: the obtaining unit 440A obtains a corpus, the extracting unit 450 extracts a keyword from each news manuscript in the corpus, the encoding unit 440C encodes each news manuscript into a feature vector according to the corresponding manuscript attribute and the keyword, the feature processing unit 440D performs feature selection and feature combination on the corpus encoded into the feature vector, and the training unit 440B performs training on the corpus after feature selection and feature combination to obtain the classification model.
In embodiments of the invention, the classification model may be updated periodically, for example once a day.
In the embodiment of the present invention, the manuscript attribute includes manuscript sending media information and/or manuscript sending time information and the like.
In the embodiment of the invention, if the news manuscript can not be classified according to the classification model, the news manuscript can be judged to be classified.
Fig. 1D is a schematic flow chart of classifying news articles according to the classification model, predicting the region to which the news articles belong and the probability thereof according to the classification model, and determining whether the probability is greater than a threshold value, if so, taking the news articles as the news articles in the region to which the news articles belong, otherwise, determining that the articles cannot be classified.
Exemplary device
Having described the method and apparatus of an exemplary embodiment of the present invention, an apparatus for news classification according to another exemplary embodiment of the present invention is described next.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, the apparatus for news classification according to the present invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps for use in the news classification method according to various exemplary embodiments of the present invention described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit may perform step 100 as shown in fig. 1A: extracting news titles of the news manuscripts; step 110: performing target category matching on the news headlines to obtain a first matching result; step 120: and calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition. For another example, the processing unit may perform step 200 as shown in fig. 2: carrying out target category matching on the news manuscript to obtain a matching result; step 210: calculating the score of the matching result, and dividing the news manuscript into target categories corresponding to the matching result when the score meets a preset condition; step 220: training a classification model based on the news articles and the corresponding target categories; and classifying the news manuscript based on the classification model when the score does not meet the preset condition.
An apparatus 50 for securing news classification according to this embodiment of the present invention is described below with reference to fig. 5. The apparatus 50 for news classification shown in fig. 5 is only an example, and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in fig. 5, the apparatus 50 for news classification is embodied in the form of a general purpose computing device. The components of the apparatus 50 for news classification may include, but are not limited to: the at least one processing unit 516, the at least one memory unit 528, and the bus 518 that connects the various system components including the memory unit 528 and the processing unit 516.
Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
Storage unit 528 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)530 and/or cache memory 532, and may further include Read Only Memory (ROM) 534.
The storage unit 528 may also include a program/utility 540 having a set (at least one) of program modules 542, such program modules 542 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The apparatus for news classification 50 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the apparatus for news classification 50, and/or with any device (e.g., router, modem, etc.) that enables the apparatus for news classification 50 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 522. Also, the apparatus 50 for news classification may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 520. As shown, the network adapter 520 communicates with the other modules of the apparatus 50 for news classification via the bus 518. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the apparatus 50 for news classification, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Exemplary program product
In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps in the method for news classification according to various exemplary embodiments of the present invention described in the above section "exemplary method" of this specification, when the program product is run on the terminal device, for example, the terminal device may perform the steps 100 as shown in fig. 1A: extracting news titles of the news manuscripts; step 110: performing target category matching on the news headlines to obtain a first matching result; step 120: and calculating a first score of the first matching result, and dividing the news manuscript into target categories corresponding to the first matching result when the first score meets a first preset condition. For another example, the processing unit may perform step 200 as shown in fig. 2: carrying out target category matching on the news manuscript to obtain a matching result; step 210: calculating the score of the matching result, and dividing the news manuscript into target categories corresponding to the matching result when the score meets a preset condition; step 220: training a classification model based on the news articles and the corresponding target categories; and classifying the news manuscript based on the classification model when the score does not meet the preset condition.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As shown in fig. 6, a program product 60 for news classification according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although in the above detailed description several means or sub-means of the device for news classification are mentioned, this division is only not mandatory. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, according to embodiments of the invention. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (30)

1. A method of news classification, comprising:
extracting news titles of the news manuscripts;
performing target category matching on the news headlines to obtain a first matching result;
calculating a first score of the first matching result, and when the first score meets a first preset condition, dividing the news manuscript into a target category corresponding to the first matching result;
performing target category matching on the news headlines to obtain a first matching result, wherein the step of performing target category matching on the news headlines comprises the following steps:
performing area name matching on the news headlines to obtain at least one area name;
calculating a first score for the first match result, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching results, wherein the target categories comprise:
and dividing the news manuscript into the category of the region name corresponding to the maximum first initial score in all the first initial scores.
2. The method of claim 1, determining that the first score satisfies a first preset condition, comprising:
determining that the first score is greater than or equal to 1.5.
3. The method of claim 1, if it is determined that the first score does not satisfy the first predetermined condition, the method further comprising:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
4. The method of claim 3, wherein performing a target category matching on the news body content to obtain a second matching result comprises:
performing area name matching on the news text content to obtain at least one area name;
calculating a second score for the second match, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
dividing the news manuscript into target categories corresponding to the second matching results, including:
and dividing the news manuscript into the category of the region name corresponding to the maximum second initial score in all the second initial scores.
5. The method of claim 3 or 4, determining that the second score satisfies a second preset condition, comprising:
determining that the second score is greater than or equal to 3.
6. The method of claim 3, after determining that the second score does not satisfy the second preset condition, the method further comprising:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
7. The method of claim 6, prior to predicting a probability of a region to which the news article belongs based on a classification model, the method further comprising:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training speech material.
8. The method of claim 7, deriving the classification model based on the training speech, comprising:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
9. An apparatus for news classification, comprising:
the extracting unit is used for extracting news titles of the news manuscripts;
the matching unit is used for carrying out target category matching on the news headlines to obtain a first matching result;
a calculating unit, configured to calculate a first score of the first matching result;
the judging unit is used for judging whether the first score meets a first preset condition or not;
the classification unit is used for classifying the news manuscript into a target category corresponding to the first matching result when the judgment unit judges that the first score meets the first preset condition;
wherein,
the matching unit performs target category matching on the news headlines, and when a first matching result is obtained, the method specifically comprises the following steps:
performing area name matching on the news headlines to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit is used for taking the product of the basic score and the times as a first initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value and a second largest value of all the first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all the first initial scores except the maximum value;
the determining unit is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit is specifically configured to: and dividing the news manuscript into the category of the region name corresponding to the maximum first initial score in all the first initial scores.
10. The apparatus according to claim 9, wherein when the determining unit determines that the first score satisfies a first preset condition, the determining unit specifically:
determining that the first score is greater than or equal to 1.5.
11. The apparatus of claim 9, the extracting unit further configured to extract news body content of the news article;
the matching unit is also used for carrying out target category matching on the news text content to obtain a second matching result;
the calculating unit is further used for calculating a second score of the second matching result;
the judging unit is further configured to judge whether the second score meets a second preset condition;
the classification unit is further configured to classify the news article into a target category corresponding to the second matching result when the judgment unit judges that the second score meets a second preset condition.
12. The apparatus according to claim 11, wherein the matching unit performs object category matching on the news text content, and when a second matching result is obtained, the method specifically comprises:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit is used for taking the product of the basic score and the times as a second initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value of all the second initial scores and a number of times that a target area name appears in the news text content, where the target area name is an area name corresponding to the maximum value;
the calculating unit is further configured to use a value obtained by subtracting, as the second score, the number of times that the target area name appears in the news text content from the number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit is specifically configured to: and dividing the news manuscript into the category of the region name corresponding to the maximum second initial score in all the second initial scores.
13. The apparatus according to claim 11 or 12, wherein when the determining unit determines that the second score satisfies a second preset condition, the determining unit specifically:
determining that the second score is greater than or equal to 3.
14. The apparatus of claim 11, further comprising an algorithm unit for predicting a probability of a region to which the news contribution belongs based on a classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
15. The apparatus of claim 14, the algorithm unit comprising an acquisition unit and a training unit, wherein:
the acquiring unit is used for acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition;
and the training unit is used for obtaining the classification model based on the training corpus.
16. The apparatus of claim 15, the arithmetic unit further comprising an encoding unit and a feature processing unit, wherein:
the extraction unit is also used for extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
the encoding unit is used for encoding each news manuscript into a characteristic vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit is used for performing feature selection and feature combination on the training corpus coded into the feature vectors;
the training unit is further used for training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
17. A method of news classification, comprising:
carrying out target category matching on the news manuscript to obtain a matching result;
calculating the score of the matching result, and when the score meets a preset condition, dividing the news manuscript into a target category corresponding to the matching result;
training a classification model based on the news articles and the corresponding target categories; when the score is judged not to meet the preset condition, classifying the news manuscript based on the classification model;
the method for matching the news manuscript with the target category to obtain the matching result comprises the following steps:
extracting news titles of the news manuscripts; and
performing target category matching on the news headlines to obtain a first matching result;
calculating the score of the matching result, and when the score meets the preset condition, dividing the news manuscript into the target category corresponding to the matching result, wherein the steps of:
calculating a first score of the first matching result, and when the first score meets a first preset condition, dividing the news manuscript into a target category corresponding to the first matching result;
the target category matching of the news headlines to obtain a first matching result includes:
performing area name matching on the news headlines to obtain at least one area name;
calculating a first score for the first match result, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
taking the product of the basic score and the times as a first initial score corresponding to the any region name;
determining a maximum value and a second largest value of all first initial scores, wherein the second largest value is a first initial score which is smaller than the maximum value and is larger than all remaining first initial scores except the maximum value;
dividing the maximum value by the second maximum value to obtain a ratio as the first score;
dividing the news manuscript into target categories corresponding to the first matching results, wherein the target categories comprise:
and dividing the news manuscript into the category of the region name corresponding to the maximum first initial score in all the first initial scores.
18. The method of claim 17, if it is determined that the first score does not satisfy the first predetermined condition, the method further comprising:
extracting the news text content of the news manuscript;
performing target category matching on the news text content to obtain a second matching result;
and calculating a second score of the second matching result, and dividing the news manuscript into target categories corresponding to the second matching result when the second score meets a second preset condition.
19. The method of claim 18, performing a target category matching on the news body content to obtain a second matching result, comprising:
performing area name matching on the news text content to obtain at least one area name;
calculating a second score for the second match, comprising:
for any one of the at least one region name, respectively executing:
determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
taking the product of the basic score and the times as a second initial score corresponding to the any region name;
determining the maximum value of all the second initial scores and the frequency of occurrence of a target region name in the news text content, wherein the target region name is the region name corresponding to the maximum value;
subtracting the times corresponding to each of the remaining area names from the times of the target area name appearing in the news text content to obtain a value as the second score;
wherein: the remaining area names are area names except the area name corresponding to the maximum value in the at least one area name;
dividing the news manuscript into target categories corresponding to the second matching results, including:
and dividing the news manuscript into the category of the region name corresponding to the maximum second initial score in all the second initial scores.
20. The method of claim 18, training a classification model based on the news contribution and corresponding target category, comprising:
acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and obtaining the classification model based on the training speech material.
21. The method of claim 20, deriving the classification model based on the training speech, comprising:
extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
according to the corresponding manuscript attributes and the keywords, coding each news manuscript into a feature vector;
carrying out feature selection and feature combination on the training corpus coded into the feature vector;
and training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
22. The method of claim 17, classifying the news contribution based on the classification model, comprising:
predicting the probability of the region of the news manuscript according to the classification model;
and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
23. The method of claim 17, training a classification model based on the news contribution and corresponding target category, comprising:
training a classification model based on the news contribution and the corresponding target class periodically.
24. An apparatus for news classification, comprising:
the matching unit is used for carrying out target type matching on the news manuscript to obtain a matching result;
the calculating unit is used for calculating the score of the matching result;
the judging unit is used for judging whether the score meets a preset condition or not;
the classification unit is used for classifying the news manuscript into a target category corresponding to the matching result when the judgment unit judges that the score meets a preset condition;
the algorithm unit is used for training a classification model based on the news manuscript and the corresponding target class;
the classification unit is further configured to classify the news manuscript based on the classification model when the judgment unit judges that the score does not meet the preset condition;
the device also comprises an extraction unit, a news analysis unit and a control unit, wherein the extraction unit is used for extracting news titles of the news manuscripts;
the matching unit is specifically used for performing target category matching on the news headlines to obtain a first matching result;
the calculating unit is specifically configured to calculate a first score of the first matching result;
the classification unit is specifically configured to, when the judgment unit judges that the first score meets a first preset condition, classify the news manuscript into a target category corresponding to the first matching result;
the matching unit performs target category matching on the news headlines, and when a first matching result is obtained, the method specifically comprises the following steps:
performing area name matching on the news headlines to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to any region name and the frequency of the region name appearing in the news title;
the product calculating unit is used for taking the product of the basic score and the times as a first initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value and a second largest value of all the first initial scores, where the second largest value is a first initial score that is smaller than the maximum value and is larger than all remaining first initial scores of all the first initial scores except the maximum value;
the determining unit is further configured to divide the maximum value by the second largest value to obtain a ratio, which is used as the first score;
the classification unit is specifically configured to: and dividing the news manuscript into the category of the region name corresponding to the maximum first initial score in all the first initial scores.
25. The apparatus of claim 24, the extracting unit further configured to extract news body content of the news contribution;
the matching unit is also used for carrying out target category matching on the news text content to obtain a second matching result;
the calculating unit is further used for calculating a second score of the second matching result;
the judging unit is further configured to judge whether the second score meets a second preset condition;
the classification unit is further configured to classify the news article into a target category corresponding to the second matching result when the judgment unit judges that the second score meets a second preset condition.
26. The apparatus according to claim 25, wherein the matching unit performs object category matching on the news text content, and when a second matching result is obtained, the method specifically comprises:
performing area name matching on the news text content to obtain at least one area name;
the calculation unit includes a determination unit and a product calculation unit, wherein:
the determining unit is configured to, for any one of the at least one region name, respectively perform: determining a basic score corresponding to the any region name and the frequency of the any region name appearing in the news text content;
the product calculating unit is used for taking the product of the basic score and the times as a second initial score corresponding to the any region name;
the determining unit is further configured to determine a maximum value of all the second initial scores and a number of times that a target area name appears in the news text content, where the target area name is an area name corresponding to the maximum value;
the calculating unit is further configured to use a value obtained by subtracting, as the second score, the number of times that the target area name appears in the news text content from the number of times that each of the remaining area names corresponds to;
wherein the remaining area names are area names of the at least one area name except the area name corresponding to the maximum value;
the classification unit is specifically configured to: and dividing the news manuscript into the category of the region name corresponding to the maximum second initial score in all the second initial scores.
27. The apparatus of claim 25, the algorithm unit comprising an acquisition unit and a training unit, wherein:
the acquiring unit is used for acquiring a training corpus, wherein the training corpus comprises the news manuscript and the area name corresponding to the news manuscript when the first score meets a first preset condition, and/or the news manuscript and the area name corresponding to the news manuscript when the second score meets a second preset condition; and
and the training unit is used for obtaining the classification model based on the training language material.
28. The apparatus of claim 27, the arithmetic unit further comprising an encoding unit and a feature processing unit, wherein:
the extraction unit is also used for extracting key words from each news manuscript in the training corpus by adopting a vector space model and a word frequency reverse file word frequency TF-IDF algorithm;
the encoding unit is used for encoding each news manuscript into a characteristic vector according to the corresponding manuscript attribute and the corresponding keyword;
the feature processing unit is used for performing feature selection and feature combination on the training corpus coded into the feature vectors;
the training unit is further used for training the training corpus subjected to feature selection and feature combination by adopting a multi-classification logistic model to obtain the classification model.
29. The apparatus according to claim 24, the arithmetic unit being configured to predict a probability of a region to which the news contribution belongs based on a classification model; and when the probability is judged to be larger than the threshold value, taking the news manuscript as the news manuscript of the region to which the news manuscript belongs.
30. The apparatus of claim 24, the algorithmic means being specifically configured to train a classification model based on the news contribution and corresponding target class periodically.
CN201610115723.5A 2016-03-01 2016-03-01 A kind of method and apparatus of news category Active CN105760526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610115723.5A CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610115723.5A CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Publications (2)

Publication Number Publication Date
CN105760526A CN105760526A (en) 2016-07-13
CN105760526B true CN105760526B (en) 2019-05-07

Family

ID=56332195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610115723.5A Active CN105760526B (en) 2016-03-01 2016-03-01 A kind of method and apparatus of news category

Country Status (1)

Country Link
CN (1) CN105760526B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202057B (en) * 2016-08-30 2019-07-12 东软集团股份有限公司 The recognition methods of similar news information and device
CN108090099B (en) * 2016-11-22 2022-02-25 科大讯飞股份有限公司 Text processing method and device
CN106503266A (en) * 2016-11-30 2017-03-15 政和科技股份有限公司 Document Classification Method and device
CN109816134B (en) * 2017-11-22 2021-07-20 北京京东尚科信息技术有限公司 Method and device for predicting delivery address and storage medium
CN107889068A (en) * 2017-12-11 2018-04-06 成都欧督系统科技有限公司 Message broadcast controlling method based on radio communication
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN110674290B (en) * 2019-08-09 2023-03-10 国家计算机网络与信息安全管理中心 Relationship prediction method, device and storage medium for overlapping community discovery
CN110750697B (en) * 2019-10-30 2022-07-29 汉海信息技术(上海)有限公司 Merchant classification method, device, equipment and storage medium
CN111209390B (en) * 2020-01-06 2023-09-05 新方正控股发展有限责任公司 News display method and system and computer readable storage medium
CN111324735A (en) * 2020-02-20 2020-06-23 湖南芒果听见科技有限公司 Method and terminal for automatically classifying hourly essentials

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104881458A (en) * 2015-05-22 2015-09-02 国家计算机网络与信息安全管理中心 Labeling method and device for web page topics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN104346411A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Method and equipment for clustering multiple manuscripts
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104881458A (en) * 2015-05-22 2015-09-02 国家计算机网络与信息安全管理中心 Labeling method and device for web page topics

Also Published As

Publication number Publication date
CN105760526A (en) 2016-07-13

Similar Documents

Publication Publication Date Title
CN105760526B (en) A kind of method and apparatus of news category
CN108509474B (en) Synonym expansion method and device for search information
CN107862027B (en) Retrieve intension recognizing method, device, electronic equipment and readable storage medium storing program for executing
CN106156204B (en) Text label extraction method and device
CN106547871B (en) Neural network-based search result recall method and device
Qian et al. Social event classification via boosted multimodal supervised latent dirichlet allocation
CN102799647B (en) Method and device for webpage reduplication deletion
US8819024B1 (en) Learning category classifiers for a video corpus
US9087297B1 (en) Accurate video concept recognition via classifier combination
WO2017101342A1 (en) Sentiment classification method and apparatus
US20220318275A1 (en) Search method, electronic device and storage medium
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN110390094B (en) Method, electronic device and computer program product for classifying documents
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN103699625A (en) Method and device for retrieving based on keyword
CN111291177A (en) Information processing method and device and computer storage medium
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
US20240168984A1 (en) Method of retrieving document and apparatus for retrieving document

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant