[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN117391071B - News topic data mining method, device and storage medium - Google Patents

News topic data mining method, device and storage medium Download PDF

Info

Publication number
CN117391071B
CN117391071B CN202311639781.4A CN202311639781A CN117391071B CN 117391071 B CN117391071 B CN 117391071B CN 202311639781 A CN202311639781 A CN 202311639781A CN 117391071 B CN117391071 B CN 117391071B
Authority
CN
China
Prior art keywords
event
news
vector
trend
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311639781.4A
Other languages
Chinese (zh)
Other versions
CN117391071A (en
Inventor
谢红韬
袁公萍
陈林翠
张瑶
严增勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN202311639781.4A priority Critical patent/CN117391071B/en
Publication of CN117391071A publication Critical patent/CN117391071A/en
Application granted granted Critical
Publication of CN117391071B publication Critical patent/CN117391071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a news topic data mining method, a news topic data mining device and a storage medium, wherein the news topic data mining method comprises the following steps: collecting time sequence data of news manuscript quantity, and dividing the time sequence data through a preconfigured time window; converting the time sequence data into a one-dimensional vector based on the time scale of the time window; calculating a first-order differential vector of the one-dimensional vector; traversing the first-order differential vector through a symbol function to generate a trend vector; traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a pre-configured correction rule; performing first-order differential calculation on the corrected trend vector to obtain a second-order differential value; dividing the time sequence data into a plurality of independent event groups according to the second-order differential value pair; acquiring text data of all news in an event group; converting the text data into TF-IDF vectors; performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups; and analyzing word frequency and part of speech through an NLP tool to generate a corresponding event title.

Description

News topic data mining method, device and storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a news topic data mining method, a news topic data mining device, and a storage medium.
Background
In the information age today, public opinion analysis and news event discovery are increasingly important. With the rapid development of big data technology, processing huge news data and extracting valuable information therefrom has become a complex and vital task. This not only helps to understand social dynamics and public concerns, but also provides rapid and accurate decision support for governments, businesses and individuals.
However, the conventional news event discovery method has a series of problems, which restrict its effectiveness and practicality in coping with the present-day complex information environment. First, these methods often have difficulty in accurately grasping the context of the event development, and cannot capture the complete process of the event from initial occurrence to gradual warming to dissipation. Secondly, the processing of redundant events becomes a serious problem, and the traditional method is easy to generate a large number of similar events when processing massive news data, so that the screening and acquisition of key information by users are greatly influenced. Finally, the timeliness challenge is also a significant problem with conventional approaches, which tend to be frustrating, particularly in situations where real-time feedback and decision making are required.
The conventional news event discovery method is based on manual summary or simple clustering algorithm, and cannot cope with diversity and rapid change of modern social information well due to low accuracy and time-consuming challenges. When processing massive amounts of news data, these methods often have difficulty meeting the precise needs of users for critical information, and often do not provide an effective means for a deep understanding of the event development process.
Therefore, an innovative news event discovery method is needed to solve the above problems, improve accuracy, reduce redundant events, and enhance timeliness, so as to better adapt to the needs of the information age. The method aims at providing a brand new and efficient news event discovery algorithm by integrating key steps such as time sequence data analysis and text clustering so as to meet urgent requirements of modern society on accurate and real-time news information.
Disclosure of Invention
In order to solve the technical problems, the application provides a news topic data mining method, a news topic data mining device and a storage medium.
The following describes the technical solutions provided in the present application:
the first aspect of the application provides a news topic data mining method, which comprises the following steps:
collecting time sequence data of news manuscript sending quantity, and dividing the time sequence data through a preconfigured time window;
converting the time sequence data into a one-dimensional vector based on the time scale of the time window;
calculating a first-order differential vector of the one-dimensional vector;
traversing the first-order differential vector through a symbol function to generate a trend vector;
traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a pre-configured correction rule;
Performing first-order differential calculation on the corrected trend vector to obtain a second-order differential value;
dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;
for each event group, acquiring text data of all news in the event group;
converting the text data into TF-IDF vectors by a feature extraction method;
performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;
and carrying out word frequency part of speech analysis on each event news group through an NLP tool to generate a corresponding event title.
Optionally, the converting the text data into TF-IDF vectors by the feature extraction method includes:
for text data in any event group, the following algorithm is performed:
word segmentation processing is carried out on the text data;
calculating word frequencies of various words in the text data by the following formula:
TF(t,d)=count(t,d)/total_terms(d);
wherein d represents any text data, t represents a given word, count (t, d) represents the number of times the given word t appears in the text data d, total_terms (d) represents the total number of words in the text data d, and TF (t, d) represents the word frequency of the given word t;
the inverse document frequency in the text data is calculated by:
IDF(t,D)=log(total_documents(D)/documents_containing_term(t,D));
Wherein D represents a set of text data in the event group, total_documents (D) represents the number of text data in the event group, documents_rotation_term (t, D) represents the number of text data containing a given term t in the set of text data, log represents a natural logarithm operation, and IDF (t, D) represents an inverse document frequency of the given term t;
the TF-IDF vector is calculated by the following equation:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D);
where TF-IDF (t, D) represents the TF-IDF vector for a given word t.
Optionally, the performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups includes:
step one: determining a sample set d= (x 1, x2,) xm, a neighborhood parameter (e, minPts), and a sample distance measure, wherein e represents a neighborhood distance threshold, minPts represents a number of sample points that should be included in at least a neighborhood of one point, the sample set being a set of a plurality of TF-IDF vectors;
step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode, wherein the e-neighborhood represents a circular area with the sample Xj as a center and the radius being e, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) contains all other samples with the distance not exceeding e from the sample Xj;
Step three: comparing the absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj larger than the MinPts into a core object sample set omega;
step four: when the core object sample set Ω is not empty, randomly selecting a core object o from the core object sample set Ω, and executing the following algorithm:
initializing a current cluster core object queue omega cur= { o };
initializing a class sequence number k=k+1;
initializing a current cluster sample set Ck= { o };
updating the unvisited sample set Γ = Γ - { o };
step five: if the current cluster core object queue omega cur is empty, finishing the generation of the current cluster Ck; after generating the cluster Ck, updating the cluster partition c=c { Ck }, and updating the core object sample set Ω=Ω -Ck;
step six: if the current cluster core object queue Ω cur is not empty, then the following algorithm is performed:
taking out a core object o' from a current cluster core object queue omega cur;
determining all e-neighborhood subsampled sets Ne (o') by a neighborhood distance threshold e;
let Δ=ne (o')Γ;
updating a current cluster sample set Ck=Ck U delta, and updating an unvisited sample set Γ=Γ -delta;
update Ω cur=Ω cur & (ΔΣΩ) - { o' };
Repeating the fifth step;
step seven: output cluster division c= { C1, C2,..and Ck }, resulting in multiple event news groups.
Optionally, dividing the time series data into a plurality of independent event groups according to the second-order differential value includes:
identifying peaks and troughs in the time sequence data according to the second-order differential value;
the time series data is divided into a plurality of independent event clusters based on the peaks and valleys.
Optionally, the traversing from the tail of the trend vector, and correcting the zero value in the trend vector according to a preconfigured correction rule includes:
traversing the trend vector from the tail and making the following corrections:
trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;
trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;
where Trend (i) represents the i-th Trend vector from the tail.
Optionally, the performing word frequency part of speech analysis on each event news group through the NLP tool, and generating the corresponding event title includes:
counting the word frequency of nouns, prepositions and verbs contained in the titles of each event news group, and determining the nouns, prepositions and verbs with the highest word frequency;
the maximum class word frequency sum is calculated by the following equation:
Cnvp=Cn+Cv+Cp;
Wherein Cnvp represents the sum of the most part of word frequencies, cn represents the word frequency of the most noun, cv represents the most verb word frequency, cp represents the most preposition word frequency;
the keyword word frequency threshold is calculated by the following formula:
C_threshold=((1-Cnvp)/(Csum))×(Cnvp/3);
wherein, C_threshold represents the word frequency threshold of the keyword, and the sum of word frequencies of all the words of Csum;
determining all keywords with word frequencies greater than the word frequency threshold value of the keywords, and forming a keyword array;
traversing the title of each event news group and generating event titles based on the keyword arrays.
Optionally, the traversing each event news group and generating the event title based on the keyword array includes:
calculating the inclusion degree and word number of each event news group on the keyword array;
the title with the largest inclusion and the smallest word number is taken as the event title.
Optionally, the calculating the inclusion degree of each event news group to the keyword array includes:
for each title, the ratio of the number of keywords contained therein to the total number of keyword arrays is calculated.
A second aspect of the present application provides a news topic data mining apparatus, including:
the acquisition unit is used for acquiring time sequence data of news manuscript sending quantity and dividing the time sequence data through a preconfigured time window;
A conversion unit for converting the time series data into a one-dimensional vector based on the time scale of the time window;
a first-order calculation unit for calculating a first-order difference vector of the one-dimensional vector;
the trend vector generation unit is used for traversing the first-order difference vector through a symbol function to generate a trend vector;
the correcting unit is used for traversing from the tail part of the trend vector and correcting zero values in the trend vector according to a preset correcting rule;
the second-order computing unit is used for carrying out first-order difference computation on the corrected trend vector to obtain a second-order difference value;
the event group dividing unit is used for dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;
a text data obtaining unit, configured to obtain, for each event group, text data of all news in the event group;
a vector conversion unit for converting the text data into TF-IDF vectors by a feature extraction method;
the clustering unit is used for carrying out text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;
the event title generation unit is used for carrying out word frequency part of speech analysis on each event news group through the NLP tool to generate a corresponding event title.
A third aspect of the present application provides a news topic data mining apparatus, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any one of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
1. by adopting a time sequence data mining mode, the change trend of news manuscript quantity can be captured better, event manuscript rule is used as guide, and event groups can be extracted rapidly and accurately, so that the evolution process of news topics can be known more comprehensively.
2. The trend vector is generated through the first-order difference and the symbol function, so that the change trend of the news manuscript quantity can be reflected more clearly, and key time points such as occurrence, burst and dissipation can be identified.
3. A pre-configured correction rule is introduced to correct zero values in the trend vector. This helps to more accurately identify the start and end of the trend, improving the accuracy of event discovery.
4. The time sequence data is divided into a plurality of independent event groups by utilizing the second-order differential value, so that different events are distinguished more carefully, and the discovered events are more specific and targeted.
5. Through TF-IDF vector and text clustering based on density, text data in the event group is effectively extracted and clustered, and events are better organized and understood.
6. And 4, word frequency part of speech analysis is performed by using an NLP tool, so that key information is extracted from the text, and a representative event title is generated.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of one embodiment of a method for news topic data mining provided in the present application;
FIG. 2 is a schematic diagram of timing data in the present application;
FIG. 3 is a schematic diagram of identifying peaks and troughs of time series data in the present application;
FIG. 4 is a schematic diagram of event clusters based on peak-trough partitioning in the present application;
FIG. 5 is a schematic structural diagram of one embodiment of a news topic data mining apparatus provided in the present application;
fig. 6 is a schematic structural diagram of another embodiment of the news topic data mining apparatus provided in the present application.
Detailed Description
It should be noted that, the method provided in the present application may be applied to a terminal or a system, and may also be applied to a server, for example, the terminal may be a smart phone or a computer, a tablet computer, a smart television, a smart watch, a portable computer terminal, or a fixed terminal such as a desktop computer. For convenience of explanation, the terminal is taken as an execution body for illustration in the application.
Referring to fig. 1, the present application first provides an embodiment of a news topic data mining method, which includes:
s101, collecting time sequence data of news manuscript quantity, and dividing the time sequence data through a preconfigured time window;
in this step, first, time series data of news release amounts, that is, the number of news releases recorded in time series are collected. Then, the time series data is divided by a preset time window, and the whole time series is cut into a plurality of small time window segments for subsequent processing.
In this step of the process, the process is carried out,
and acquiring time sequence data of the news manuscript quantity by using a data acquisition tool or platform. This may include retrieving relevant information from news websites, social media platforms, or other data sources. The collected data is ensured to contain the time stamp of the news release and the corresponding manuscript sending amount. The size of the time window and the sliding step size are preset. The size of the time windows represents the time range covered by each window, while the sliding step represents the time interval between windows. The time interval may be hours in this embodiment.
Dividing the whole time sequence data according to a preset time window. The entire time series data can be segmented by moving one step at a time in a sliding window manner. The data within each time window forms a sub-sequence representing the news posting conditions within that window.
For each time window, the news manuscript amount in the window can be counted, and the data are arranged according to the time sequence to form a one-dimensional vector. Each vector element represents a contribution amount within a respective time window. And traversing all time windows to obtain a one-dimensional vector of the whole time sequence data. This vector may reflect the trend of news manuscript amount over time.
Fig. 2 is a schematic diagram of a timing data.
S102, converting the time sequence data into a one-dimensional vector based on the time scale of the time window;
at this step, the time series data is converted into a one-dimensional vector by sorting the news manuscript amount in each time window. Each element of the one-dimensional vector corresponds to a time window whose value represents the amount of news contribution within the time window.
In this step, the goal of converting the time series data into one-dimensional vectors is to provide a more convenient form of data for subsequent analysis and processing. For the whole time sequence data, the news manuscript quantity in each window is processed in turn by traversing each time window. And counting news manuscript sending quantity in each time window. This may be a news amount, click-through, or other suitable indicator describing news progress. And constructing a one-dimensional vector by taking the manuscript sending amount in each time window as an element. The order of the vectors should be arranged in the order of the time windows, i.e. in the time order of the time series data.
S103, calculating a first-order differential vector of the one-dimensional vector;
in this step, the first-order difference refers to a difference between adjacent elements. In this embodiment, the first order difference operation will be applied to the one-dimensional vector, and the variation of the news posting volume in the adjacent time window is calculated. This can be achieved by traversing one-dimensional vectors and calculating the differences between adjacent elements.
The following examples illustrate:
given a one-dimensional vector X, its elements are x_1, x_2, …, x_n, where n is the length of the vector.
The first order differential vector Diff is defined as:
Diff_i=X_{i+1}-X_i
wherein Diff is a first order differential vector, i is the index of the vector, and 1.ltoreq.i.ltoreq.n-1. Diff_i represents the difference between the i-th element and the i+1-th element in the original vector.
Can be expressed in mathematical notation as:
Diff=[Diff_1,Diff_2,…,Diff_{n-1}]
this vector Diff is the first order difference vector of the one-dimensional vector X.
S104, traversing the first-order differential vector through a symbol function to generate a trend vector;
in the step, signs in the first-order differential vector are extracted through a sign function, and a trend vector is generated. The trend vector reflects the trend of the news manuscript amount, namely occurrence, burst, dissipation and the like.
And extracting signs of the first-order differential vectors to generate trend vectors.
Definition of the sign function:
sign(x)={
1,if x>0
0,if x=0
-1,if x<0
}
first order differential vector:
Diff=[Diff_1,Diff_2,...,Diff_{n-1}]
trend vector:
TrenDiff=[sign(Diff_1),sign(Diff_2),...,sign(Diff_{n-1})]
s105, traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a preset correction rule;
the goal of this step is to correct the trend vector to increase its accuracy, traversing from the tail of the vector, correcting by pre-configured correction rules. In the time-series data of news manuscript amount, the trend vector is corrected to more accurately capture the trend of the event development.
In some cases, the raw data may contain some noise or fluctuations, resulting in transient zero values in the trend vector. By modifying these zero values, the trend vector can be smoothed to more closely match the actual contribution change trend.
The correction may increase the sensitivity of the trend vector to changes in the manuscript amount. If there are consecutive zero values in the trend vector, some subtle changes may be missed. By correction, the actual manuscript quantity change can be better captured, and the sensitivity of the algorithm is improved.
In some cases, the raw data may cause zero values to occur due to acquisition or processing uncertainties. Correction helps to reduce such errors and make the trend vector more accurately reflect the actual situation.
A specific modified embodiment is provided below:
traversing the trend vector from the tail and making the following corrections:
trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;
trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;
where Trend (i) represents the i-th Trend vector from the tail.
In this embodiment, correction rules are pre-configured, primarily for zero values in the trend vector. The correction rules may include some of the following possible scenarios:
If a certain zero value in the trend vector is followed by a positive value, the zero value is corrected to a positive value.
If a certain zero value in the trend vector is followed by a negative value, the zero value is corrected to a negative value.
Depending on the actual situation, other correction rules may also need to be considered to ensure the rationality and validity of the correction.
For each zero value, the correction is performed according to a pre-configured correction rule. The corrected trend vector can more accurately reflect the change trend of the news manuscript quantity.
The following is an example of pseudo code for this step:
for i from n-1to1:
if Trend(i)==0:
if Trend(i+1)>0:
Trend(i)= 1
elif Trend(i+1)<0:
Trend(i)=-1
in this example, we assume that when the zero value in the trend vector is followed by a positive value, the zero value is corrected to a positive value, and if it is a negative value, the correction is negative.
S106, performing first-order difference calculation on the corrected trend vector to obtain a second-order difference value;
in this step, the present embodiment performs a first order difference operation on the corrected trend vector to calculate a second order difference value. The first-order differential operation has been used in the previous step, which represents the trend of variation between adjacent time points. The second order difference represents the trend of the first order difference.
An example of pseudo code is provided below:
corrected trend vector:
Trend=[Trend_1,Trend_2,...,Trend_n]
First order differential operation:
FirstDiff=[Trend[i]-Trend[i-1]foriinrange(1,n)]
second order differential operation:
SecondDiff=[FirstDiff[i]-FirstDiff[i-1]foriinrange(1,n-1)]
s107, dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;
in this step, the time series data is divided into a plurality of independent event groups according to the second-order differential value. The specific implementation can be carried out according to the following steps:
first, a partitioning rule for defining a second order difference value is required to determine when to partition time series data into a new event group. This may be achieved by setting a threshold, detecting peaks and valleys, etc.
Traversing the calculated second order difference value from beginning to end. And according to the dividing rule, finding the positions of the second-order differential values meeting the dividing condition, and dividing the time sequence data into different event groups at the positions. For each event group, the time points at which it starts and ends are determined, resulting in a time range for each individual event group.
The following is an example of pseudo code for this process:
second order difference value #
SecondDiff=[SecondDiff_1,SecondDiff_2,...,SecondDiff_{n-1}]
# division rule (example: dividing into New events when the second order differential value is greater than a certain threshold)
threshold=0.5
# initializing event group list
event_groups=[]
# traversal second order difference value
foriinrange(n-1):
ifSecondDiff[i]>threshold:
# according to the partitioning rules, partition into new events
event_group={
'start_time': i, # event start time
'end_time': i+1, # event end time
'data' timing [ i: i+1] # event data
}
event_groups.append(event_group)
# obtain time Range and data for each event group
forevent_groupinevent_groups:
start_time=event_group['start_time']
end_time=event_group['end_time']
event_data=event_group['data']
# the data of each event group can be further processed or stored, analyzed, etc
In another alternative implementation manner, the peaks and the troughs of the second-order differential value can be identified and divided according to the peaks and the troughs.
The method comprises the following specific steps: identifying peaks and troughs in the time sequence data according to the second-order differential value; the time series data is divided into a plurality of independent event clusters based on the peaks and valleys.
In this alternative implementation, the time series data is divided based on peaks and troughs by identifying the peaks and troughs of the second order differential values. For the second order differential value, peaks and valleys are identified by some algorithm or rule. An alternative approach is to find turning points from positive to negative or from negative to positive in the second order differential value or to find extreme points (maxima or minima) which may represent peaks or troughs. The time series data is divided into a plurality of independent event clusters according to the identified peaks and valleys. The time period between each peak and trough may be considered as an event cluster that includes an occurrence-burst-dissipation change in the amount of manuscript.
Determining the peaks and troughs by the extremum points includes finding local maxima in the time series data as peaks and local minima as troughs. The following is one possible implementation:
first, the second-order differential value is traversed, and positions meeting the requirement of being larger than adjacent points are found, and the positions are peaks.
And traversing the second-order differential value to find out the positions which are smaller than the adjacent points and are the wave troughs.
And merging the positions of the peaks and the troughs, and sequencing according to a time sequence to obtain a peak-trough sequence.
Referring to fig. 3 and 4, fig. 3 is a schematic diagram of an identified peak and trough, and fig. 4 is a schematic diagram of dividing a plurality of event clusters based on the peak and trough.
The time sequence data is divided into a plurality of independent event groups according to the positions of the wave crests and the wave troughs. Each event group corresponds to a peak-to-valley time period and includes an occurrence-burst-dissipation change in the amount of manuscript.
The following is an example of pseudo code for this process:
second order difference value #
SecondDiff=[SecondDiff_1,SecondDiff_2,...,SecondDiff_{n-1}]
Finding peaks #
peaks=[iforiinrange(1,n-1)ifSecondDiff[i-1]<SecondDiff[i]>SecondDiff[i+1]]
# find troughs
valleys=[iforiinrange(1,n-1)ifSecondDiff[i-1]>SecondDiff[i]<SecondDiff[i+1]]
Combining wave crest and wave trough and arranging
peaks_and_valleys=sorted(peaks+valleys)
# dividing event clusters according to peaks and troughs
event_groups=[]
foriinrange(1,len(peaks_and_valleys)):
start_index=peaks_and_valleys[i-1]
end_index=peaks_and_valleys[i]
event_group={
'start_time':start_index,
'end_time':end_index,
'data': timing [ start_index: end_index ] # event data
}
event_groups.append(event_group)
# obtain time Range and data for each event group
forevent_groupinevent_groups:
start_time=event_group['start_time']
end_time=event_group['end_time']
event_data=event_group['data']
# the data of each event group can be further processed or stored, analyzed, etc
The determination of the peak and trough by the extreme points and by the turning points has the advantages:
the advantages of the wave crest and the wave trough are determined through extreme points:
intuitiveness: extreme points generally correspond to significant changes in the data, which are relatively easy to understand and interpret. Peaks generally indicate that data peaks within a certain period of time, while valleys indicate that data bottoms out.
Stability: extreme points may reflect the overall trend of the data to some extent, so peaks and troughs are generally more stable to the overall characteristics of the data.
The advantages of the wave crest and the wave trough are determined through turning points:
flexibility: the turning point can more flexibly cope with transient changes in the data than just relying on extrema. In some cases, short-time, sharp fluctuations may not appear noticeable at extreme points, but are more easily identified at turning points.
The adaptability: the determination of turning points can be better adapted to data of different shapes and distributions. In some cases, the data may not have significant extrema, but there are rapid changes in the amount of manuscript that the turning point can better capture.
S108, for each event group, acquiring text data of all news in the event group;
in this step, for each divided event group, it is necessary to acquire text data of all news in the event group. For each event group previously partitioned, these event groups are traversed. A time frame may be obtained from information of an event group to determine which news to extract belongs to the event group. And according to the time range of the event group, inquiring a corresponding database or a data source to acquire news published in the time range. Text data, including headlines, body texts, etc., are extracted from the queried news.
S109, converting the text data into TF-IDF vectors through a feature extraction method;
in the step, feature extraction is carried out on the acquired text data, and a TF-IDF method is adopted to convert the text data into TF-IDF vectors. The TF-IDF vector is used to represent the characteristics of the text data reflecting the importance of each word in the text.
In implementing the conversion of text data into TF-IDF vectors, some common Natural Language Processing (NLP) libraries and tools may be used to simplify the task. The following is a specific implementation:
And preprocessing the acquired text data, including removing stop words, punctuation marks, special characters, converting the stop words, punctuation marks into lowercase letters and the like. The text data is segmented into words or lexical units. This may use a word segmentation tool such as NLTK (Natural Language Toolkit) or spaCy. The TF-IDF value for each word is calculated using the TF-IDF algorithm. TF (Term Frequency) indicates the frequency of occurrence of words in text, IDF (Inverse Document Frequency) indicates the inverse document frequency of words. TF-IDF is the product of the two. And combining the calculated TF-IDF values into a vector. Each word corresponds to a dimension in the vector.
The TF-IDF conversion can be implemented in this step using the tfidfvector class of the scikit-learn library, which encapsulates the TF-IDF calculation and vectorization process.
One specific TF-IDF vector generation implementation is provided below:
for text data in any event group, the following algorithm is performed:
word segmentation processing is carried out on the text data;
calculating word frequencies of various words in the text data by the following formula:
TF(t,d)=count(t,d)/total_terms(d);
wherein d represents any text data, t represents a given word, count (t, d) represents the number of times the given word t appears in the text data d, total_terms (d) represents the total number of words in the text data d, and TF (t, d) represents the word frequency of the given word t;
Calculating an inverse document frequency in the text data by:
IDF(t,D)=log(total_documents(D)/documents_containing_term(t,D));
wherein D represents a set of text data in the event group, total_documents (D) represents the number of text data in the event group, documents_rotation_term (t, D) represents the number of text data containing a given term t in the set of text data, log represents a natural logarithm operation, and IDF (t, D) represents an inverse document frequency of the given term t;
the TF-IDF vector is calculated by the following equation:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D);
where TF-IDF (t, D) represents the TF-IDF vector for a given word t.
In the news data mining processing scene of the scheme, the text data is converted into the TF-IDF vector, and the method has the following advantages:
the TF-IDF vector can effectively convert text information into numerical characteristics, and the important information of words in the text is reserved. This helps the machine learning algorithm understand and process the text data. TF-IDF takes into account the importance of each word in the text and helps emphasize key words by calculating the relative importance of each word in the text collection. In news data mining, this enables better capture of key information in news headlines and content. The TF-IDF vector is a sparse vector in which the vast majority of elements are zero. This sparsity helps reduce the need for storage and computing resources when processing large-scale text data. TF-IDF vectors are commonly used for text clustering and classification tasks. By using the TF-IDF vector, patterns, trends and topics can be found in mining news data, and effective classification and clustering of news can be achieved. For some words that frequently appear in the entire set of text but are not important in the particular text, the TF-IDF suppresses its effect by reducing its weight so that the model is more focused on those words that are more critical in the particular text. The use of TF-IDF vectors can be conveniently integrated with a variety of machine learning algorithms, including clustering, classification, and other text analysis tasks.
S110, performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;
in this step, density-based text clustering is performed on the TF-IDF vectors to aggregate text data into a plurality of event news groups. Each event news group contains news that is similar in text feature space.
In the step of implementing density-based text clustering, it may be implemented as follows:
the TF-IDF method is used for each news text, which is represented as a high-dimensional numerical vector, where each dimension corresponds to a vocabulary item.
Here, density-based text clustering may be used, and a Density-Based Spatial Clustering of Applications with Noise algorithm may be selected. The algorithm can automatically find clusters of arbitrary shape based on the density of data points.
The algorithm has two key parameters, namely a density radius (e) and a minimum data point number (MinPts). The density radius determines the neighborhood of a core point, while the minimum number of data points refers to at least how many data points are needed in the neighborhood to form a cluster.
Using TF-IDF vectors as input, text clustering is performed using an algorithm. The algorithm will aggregate similar text into one cluster according to the density of the text in TF-IDF space and identify outliers (noise).
After execution of the algorithm, a plurality of text clusters will be obtained, each cluster representing an event news group. The text in these clusters is highly similar and may represent the same event or topic. Each event news group is analyzed for specific news content contained therein, for example, meaning of each cluster may be analyzed by keywords, topics, etc.
One specific example of obtaining multiple event news clusters through density-based clustering is provided below:
in this embodiment, performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups includes:
step one: determining a sample set d= (x 1, x2,) xm, a neighborhood parameter (e, minPts), and a sample distance measure, wherein e represents a neighborhood distance threshold, minPts represents a number of sample points that should be included in at least a neighborhood of one point, the sample set being a set of a plurality of TF-IDF vectors;
step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode;
step three: comparing an absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj which are larger than the MinPts into a core object sample set omega, wherein e-neighborhood represents a circular area with a radius of e and a sample Xj as a center, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) comprises all other samples with the distance of not more than e from the sample Xj;
The method comprises the following specific steps:
for each sample Xj, the distance between it and all other samples in the dataset is calculated.
Taking the sample Xj as a center and the radius as e, finding all samples with the distance not exceeding e from the Xj, and putting the samples into Ne (Xj).
Step four: when the core object sample set Ω is not empty, randomly selecting a core object o from the core object sample set Ω, and executing the following algorithm:
initializing a current cluster core object queue omega cur= { o };
initializing a class sequence number k=k+1;
initializing a current cluster sample set Ck= { o };
updating the unvisited sample set Γ = Γ - { o };
step five: if the current cluster core object queue omega cur is empty, finishing the generation of the current cluster Ck; after generating the cluster Ck, updating the cluster partition c=c { Ck }, and updating the core object sample set Ω=Ω -Ck;
step six: if the current cluster core object queue Ω cur is not empty, then the following algorithm is performed:
taking out a core object o' from a current cluster core object queue omega cur;
determining all e-neighborhood subsampled sets Ne (o') by a neighborhood distance threshold e;
let Δ=ne (o')Γ;
updating a current cluster sample set Ck=Ck U delta, and updating an unvisited sample set Γ=Γ -delta;
Update Ω cur=Ω cur & (ΔΣΩ) - { o' };
repeating the fifth step;
step seven: output cluster division c= { C1, C2,..and Ck }, resulting in multiple event news groups.
In this embodiment, the density-based clustering algorithm can adapt the distribution of data without pre-specifying the number of clusters. This allows for more flexibility in the algorithm for event news groups of different sizes and densities. The algorithm has robustness to noisy data, enabling outliers (points not belonging to any cluster) to be marked as noise. In news data, there may be some uncorrelated or abnormal news, and these noise points do not interfere with the generation of normal event news clusters.
Unlike traditional K-means and other algorithms, density-based clustering can form clusters of arbitrary shape, and is suitable for complex shapes and distribution of event news groups.
The algorithm is capable of processing data with widely varying densities. In event news, there may be a higher posting density for some time periods and a lower posting density for other time periods, and this change can be captured well by the algorithm. The algorithm is different from K-means algorithm and the like, the number of clusters is not required to be specified in advance in clustering based on density, and priori knowledge of a data structure is avoided. The clustering results are relatively easy to explain, each cluster represents an event news group, and news in the clusters are relatively similar, so that understanding and analysis are facilitated.
S111, analyzing word frequency parts of speech of each event news group through an NLP tool, and generating corresponding event titles.
The news in each event news group is subjected to part-of-speech analysis of NLP (natural language processing) tools. In performing this step, word frequency part of speech analysis may be performed by the NLP tool to generate a corresponding event title, and appropriate natural language processing tools, such as NLTK (Natural Language Toolkit), space, stanford NLP, etc., are first determined to support word frequency part of speech analysis of the text.
Before using the NLP tool, each news text is preprocessed, including removing stop words and punctuation marks, and performing word drying (Stemming) or word shape reduction (Stemming) to reduce noise and redundant information of the text.
And analyzing news texts in each event news group by using an NLP tool, counting word frequencies, and finding out keywords with higher occurrence frequency. Word frequency information may be obtained by counting the number of occurrences of each word in the event news group. And performing part-of-speech analysis by using an NLP tool, and determining the part-of-speech of each word in the text. This helps understand the grammatical roles of words and extracts key information such as nouns, verbs, etc.
According to the word frequency and the part-of-speech analysis result, keywords such as nouns, verbs and the like with higher occurrence frequency are selected, and a representative event title can be constructed by combining the context. The title may be generated according to a certain rule, such as selecting several words with highest word frequency to form the title, or according to a certain algorithm to generate the title with generalization. And associating the generated event titles with corresponding event news groups to form final results for subsequent analysis and display.
A specific embodiment for generating event titles is provided below, which includes:
counting the word frequency of nouns, prepositions and verbs contained in the titles of each event news group, and determining the nouns, prepositions and verbs with the highest word frequency;
the maximum class word frequency sum is calculated by the following equation:
Cnvp=Cn+Cv+Cp;
wherein Cnvp represents the sum of the most part of word frequencies, cn represents the word frequency of the most noun, cv represents the most verb word frequency, cp represents the most preposition word frequency;
the keyword word frequency threshold is calculated by the following formula:
C_threshold=((1-Cnvp)/(Csum))×(Cnvp/3);
wherein, C_threshold represents the word frequency threshold of the keyword, and the sum of word frequencies of all the words of Csum;
determining all keywords with word frequencies greater than the word frequency threshold value of the keywords, and forming a keyword array;
Traversing the title of each event news group and generating event titles based on the keyword arrays.
According to the embodiment, automatic word frequency statistics and threshold calculation are carried out on the text data, so that automatic processing of event titles is realized, and the burden of manual operation is reduced. By considering the word frequency of keywords such as nouns, verbs, prepositions and the like and combining the calculation of the sum of the maximum word frequency, the key information in the title can be effectively extracted, and the representative event title can be generated. The threshold mode is adopted, so that the keyword selection has certain flexibility. The threshold value can be adjusted according to actual conditions so as to meet the requirements in different scenes.
The word frequency part of speech analysis is carried out by utilizing a Natural Language Processing (NLP) tool, so that the deep understanding of text data is enhanced, and the accuracy of title generation is improved. The method based on statistics and threshold is suitable for news events in different fields and topics, and has certain universality and adaptability.
The foregoing describes in detail an embodiment of a news topic data mining method provided in the present application, and the following describes in detail an embodiment of a news topic data mining apparatus provided in the present application:
Referring to fig. 4, the present application first provides an embodiment of a news topic data mining apparatus, including:
the acquisition unit 401 is configured to acquire time sequence data of a news manuscript amount, and divide the time sequence data through a preconfigured time window;
a conversion unit 402, configured to convert the time series data into a one-dimensional vector based on a time scale of the time window;
a first-order calculation unit 403 for calculating a first-order differential vector of the one-dimensional vector;
a trend vector generating unit 404, configured to traverse the first-order difference vector through a sign function, and generate a trend vector;
a correction unit 405, configured to traverse from the tail of the trend vector, and correct zero values in the trend vector according to a pre-configured correction rule;
a second-order computing unit 406, configured to perform a first-order difference computation on the corrected trend vector to obtain a second-order difference value;
an event group dividing unit 407 configured to divide the time-series data into a plurality of independent event groups according to the second-order differential value;
a text data obtaining unit 408, configured to obtain, for each event group, text data of all news in the event group;
A vector conversion unit 409 for converting the text data into TF-IDF vectors by a feature extraction method;
a clustering unit 410, configured to perform text clustering on the TF-IDF vector based on density, so as to obtain a plurality of event news groups;
the event title generating unit 411 is configured to perform word frequency part of speech analysis on each event news group through the NLP tool, and generate a corresponding event title.
Optionally, the vector conversion unit 409 is specifically configured to:
for text data in any event group, the following algorithm is performed:
word segmentation processing is carried out on the text data;
calculating word frequencies of various words in the text data by the following formula:
TF(t,d)=count(t,d)/total_terms(d);
wherein d represents any text data, t represents a given word, count (t, d) represents the number of times the given word t appears in the text data d, total_terms (d) represents the total number of words in the text data d, and TF (t, d) represents the word frequency of the given word t;
the inverse document frequency in the text data is calculated by:
IDF(t,D)=log(total_documents(D)/documents_containing_term(t,D));
wherein D represents a set of text data in the event group, total_documents (D) represents the number of text data in the event group, documents_rotation_term (t, D) represents the number of text data containing a given term t in the set of text data, log represents a natural logarithm operation, and IDF (t, D) represents an inverse document frequency of the given term t;
The TF-IDF vector is calculated by the following equation:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D);
where TF-IDF (t, D) represents the TF-IDF vector for a given word t.
Optionally, the event group dividing unit 407 is specifically configured to:
traversing the trend vector from the tail and making the following corrections:
trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;
trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;
where Trend (i) represents the i-th Trend vector from the tail.
Optionally, the event title generating unit 411 is specifically configured to:
counting the word frequency of nouns, prepositions and verbs contained in the titles of each event news group, and determining the nouns, prepositions and verbs with the highest word frequency;
the maximum class word frequency sum is calculated by the following equation:
Cnvp=Cn+Cv+Cp;
wherein Cnvp represents the sum of the most part of word frequencies, cn represents the word frequency of the most noun, cv represents the most verb word frequency, cp represents the most preposition word frequency;
the keyword word frequency threshold is calculated by the following formula:
C_threshold=((1-Cnvp)/(Csum))×(Cnvp/3);
wherein, C_threshold represents the word frequency threshold of the keyword, and the sum of word frequencies of all the words of Csum;
determining all keywords with word frequencies greater than the word frequency threshold value of the keywords, and forming a keyword array;
Traversing the title of each event news group and generating event titles based on the keyword arrays.
Optionally, the event title generating unit 411 is specifically configured to:
calculating the inclusion degree and word number of each event news group on the keyword array;
the title with the largest inclusion and the smallest word number is taken as the event title.
Optionally, the event title generating unit 411 is specifically configured to:
for each title, the ratio of the number of keywords contained therein to the total number of keyword arrays is calculated.
Referring to fig. 6, the present application further provides a news topic data mining apparatus, including:
a processor 601, a memory 602, an input/output unit 603, and a bus 604;
the processor 601 is connected to the memory 602, the input-output unit 603, and the bus 604;
the memory 602 holds a program, which the processor 601 invokes to perform any of the methods described above.
The present application also relates to a computer readable storage medium having a program stored thereon, characterized in that the program, when run on a computer, causes the computer to perform any of the methods as above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (10)

1. A news topic data mining method, the method comprising:
collecting time sequence data of news manuscript sending quantity, and dividing the time sequence data through a preconfigured time window;
converting the time sequence data into a one-dimensional vector based on the time scale of the time window;
Calculating a first-order differential vector of the one-dimensional vector;
traversing the first-order differential vector through a symbol function to generate a trend vector;
traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a pre-configured correction rule;
performing first-order differential calculation on the corrected trend vector to obtain a second-order differential value;
dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;
for each event group, acquiring text data of all news in the event group;
converting the text data into TF-IDF vectors by a feature extraction method;
performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;
for each event news group, performing word frequency part-of-speech analysis through an NLP tool to generate a corresponding event title;
performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups, wherein the steps comprise:
step one: determining a sample set d= (x 1, x2,) xm, a neighborhood parameter (e, minPts), and a sample distance measure, wherein e represents a neighborhood distance threshold, minPts represents a number of sample points that should be included in at least a neighborhood of one point, the sample set being a set of a plurality of TF-IDF vectors;
Step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode, wherein the e-neighborhood represents a circular area with the sample Xj as a center and the radius being e, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) contains all other samples with the distance not exceeding e from the sample Xj;
step three: comparing the absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj larger than the MinPts into a core object sample set omega;
step four: when the core object sample set Ω is not empty, randomly selecting a core object o from the core object sample set Ω, and executing the following algorithm:
initializing a current cluster core object queue omega cur= { o };
initializing a class sequence number k=k+1;
initializing a current cluster sample set Ck= { o };
updating the unvisited sample set Γ = Γ - { o };
step five: if the current cluster core object queue omega cur is empty, finishing the generation of the current cluster Ck; after generating the cluster Ck, updating the cluster partition c=c { Ck }, and updating the core object sample set Ω=Ω -Ck;
step six: if the current cluster core object queue Ω cur is not empty, then the following algorithm is performed:
Taking out a core object o' from a current cluster core object queue omega cur;
determining all e-neighborhood subsampled sets Ne (o') by a neighborhood distance threshold e;
let Δ=ne (o')Γ;
updating a current cluster sample set Ck=Ck U delta, and updating an unvisited sample set Γ=Γ -delta;
update Ω cur=Ω cur & (ΔΣΩ) - { o' };
repeating the fifth step;
step seven: output cluster division c= { C1, C2,..and Ck }, resulting in multiple event news groups.
2. The news topic data mining method of claim 1, wherein the converting the text data into TF-IDF vectors by the feature extraction method includes:
for text data in any event group, the following algorithm is performed:
word segmentation processing is carried out on the text data;
calculating word frequencies of various words in the text data by the following formula:
TF(t,d)=count(t,d)/total_terms(d);
wherein d represents any text data, t represents a given word, count (t, d) represents the number of times the given word t appears in the text data d, total_terms (d) represents the total number of words in the text data d, and TF (t, d) represents the word frequency of the given word t;
the inverse document frequency in the text data is calculated by:
IDF(t,D)=log(total_documents(D)/documents_containing_term(t,D));
wherein D represents a set of text data in the event group, total_documents (D) represents the number of text data in the event group, documents_rotation_term (t, D) represents the number of text data containing a given term t in the set of text data, log represents a natural logarithm operation, and IDF (t, D) represents an inverse document frequency of the given term t;
The TF-IDF vector is calculated by the following equation:
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D);
where TF-IDF (t, D) represents the TF-IDF vector for a given word t.
3. The news topic data mining method of claim 1, wherein dividing the time series data into a plurality of independent event groups based on the second order differential value includes:
identifying peaks and troughs in the time sequence data according to the second-order differential value;
the time series data is divided into a plurality of independent event clusters based on the peaks and valleys.
4. The news topic data mining method of claim 1, wherein traversing from the tail of the trend vector, correcting zero values in the trend vector according to a pre-configured correction rule includes:
traversing the trend vector from the tail and making the following corrections:
trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;
trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;
where Trend (i) represents the i-th Trend vector from the tail.
5. The news topic data mining method of claim 1, wherein the performing word frequency part of speech analysis on each event news group through the NLP tool to generate the corresponding event title includes:
Counting the word frequency of nouns, prepositions and verbs contained in the titles of each event news group, and determining the nouns, prepositions and verbs with the highest word frequency;
the maximum class word frequency sum is calculated by the following equation:
Cnvp=Cn+Cv+Cp;
wherein Cnvp represents the sum of the most part of word frequencies, cn represents the word frequency of the most noun, cv represents the most verb word frequency, cp represents the most preposition word frequency;
the keyword word frequency threshold is calculated by the following formula:
C_threshold=((1-Cnvp)/(Csum))×(Cnvp/3);
wherein, C_threshold represents the word frequency threshold of the keyword, and the sum of word frequencies of all the words of Csum;
determining all keywords with word frequencies greater than the word frequency threshold value of the keywords, and forming a keyword array;
traversing the title of each event news group and generating event titles based on the keyword arrays.
6. The news topic data mining method of claim 5 wherein traversing each event news group and generating event titles based on the keyword array includes:
calculating the inclusion degree and word number of each event news group on the keyword array;
the title with the largest inclusion and the smallest word number is taken as the event title.
7. The news topic data mining method of claim 6, wherein said calculating the inclusion of each event news group into the keyword array includes:
For each title, the ratio of the number of keywords contained therein to the total number of keyword arrays is calculated.
8. A news topic data mining apparatus, comprising:
the acquisition unit is used for acquiring time sequence data of news manuscript sending quantity and dividing the time sequence data through a preconfigured time window;
a conversion unit for converting the time series data into a one-dimensional vector based on the time scale of the time window;
a first-order calculation unit for calculating a first-order difference vector of the one-dimensional vector;
the trend vector generation unit is used for traversing the first-order difference vector through a symbol function to generate a trend vector;
the correcting unit is used for traversing from the tail part of the trend vector and correcting zero values in the trend vector according to a preset correcting rule;
the second-order computing unit is used for carrying out first-order difference computation on the corrected trend vector to obtain a second-order difference value;
the event group dividing unit is used for dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;
a text data obtaining unit, configured to obtain, for each event group, text data of all news in the event group;
A vector conversion unit for converting the text data into TF-IDF vectors by a feature extraction method;
the clustering unit is used for carrying out text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;
the event title generation unit is used for carrying out word frequency part-of-speech analysis on each event news group through the NLP tool to generate a corresponding event title;
the clustering unit is specifically configured to perform the following steps:
step one: determining a sample set d= (x 1, x2,) xm, a neighborhood parameter (e, minPts), and a sample distance measure, wherein e represents a neighborhood distance threshold, minPts represents a number of sample points that should be included in at least a neighborhood of one point, the sample set being a set of a plurality of TF-IDF vectors;
step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode, wherein the e-neighborhood represents a circular area with the sample Xj as a center and the radius being e, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) contains all other samples with the distance not exceeding e from the sample Xj;
step three: comparing the absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj larger than the MinPts into a core object sample set omega;
Step four: when the core object sample set Ω is not empty, randomly selecting a core object o from the core object sample set Ω, and executing the following algorithm:
initializing a current cluster core object queue omega cur= { o };
initializing a class sequence number k=k+1;
initializing a current cluster sample set Ck= { o };
updating the unvisited sample set Γ = Γ - { o };
step five: if the current cluster core object queue omega cur is empty, finishing the generation of the current cluster Ck; after generating the cluster Ck, updating the cluster partition c=c { Ck }, and updating the core object sample set Ω=Ω -Ck;
step six: if the current cluster core object queue Ω cur is not empty, then the following algorithm is performed:
taking out a core object o' from a current cluster core object queue omega cur;
determining all e-neighborhood subsampled sets Ne (o') by a neighborhood distance threshold e;
let Δ=ne (o')Γ;
updating a current cluster sample set Ck=Ck U delta, and updating an unvisited sample set Γ=Γ -delta;
update Ω cur=Ω cur & (ΔΣΩ) - { o' };
repeating the fifth step;
step seven: output cluster division c= { C1, C2,..and Ck }, resulting in multiple event news groups.
9. A news topic data mining apparatus, the apparatus comprising:
A processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 7.
10. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 7.
CN202311639781.4A 2023-12-04 2023-12-04 News topic data mining method, device and storage medium Active CN117391071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311639781.4A CN117391071B (en) 2023-12-04 2023-12-04 News topic data mining method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311639781.4A CN117391071B (en) 2023-12-04 2023-12-04 News topic data mining method, device and storage medium

Publications (2)

Publication Number Publication Date
CN117391071A CN117391071A (en) 2024-01-12
CN117391071B true CN117391071B (en) 2024-02-27

Family

ID=89465162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311639781.4A Active CN117391071B (en) 2023-12-04 2023-12-04 News topic data mining method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117391071B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118332035B (en) * 2024-06-17 2024-08-27 湖北华中电力科技开发有限责任公司 Data processing method and system for power system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN113009906A (en) * 2021-03-04 2021-06-22 青岛弯弓信息技术有限公司 Big data prediction analysis method and system based on industrial Internet
CN113627788A (en) * 2021-08-10 2021-11-09 中国电信股份有限公司 Service policy determination method and device, electronic equipment and storage medium
CN113742464A (en) * 2021-07-28 2021-12-03 北京智谱华章科技有限公司 News event discovery algorithm and device based on heterogeneous information network
CN114648388A (en) * 2022-04-01 2022-06-21 左黎明 Big data analysis method and system for dealing with personalized service customization
CN115730589A (en) * 2022-11-04 2023-03-03 中电科大数据研究院有限公司 News propagation path generation method based on word vector and related device
WO2023134075A1 (en) * 2022-01-12 2023-07-20 平安科技(深圳)有限公司 Text topic generation method and apparatus based on artificial intelligence, device, and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10157223B2 (en) * 2016-03-15 2018-12-18 Accenture Global Solutions Limited Identifying trends associated with topics from natural language text

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN105335349A (en) * 2015-08-26 2016-02-17 天津大学 Time window based LDA microblog topic trend detection method and apparatus
CN113009906A (en) * 2021-03-04 2021-06-22 青岛弯弓信息技术有限公司 Big data prediction analysis method and system based on industrial Internet
CN113742464A (en) * 2021-07-28 2021-12-03 北京智谱华章科技有限公司 News event discovery algorithm and device based on heterogeneous information network
CN113627788A (en) * 2021-08-10 2021-11-09 中国电信股份有限公司 Service policy determination method and device, electronic equipment and storage medium
WO2023134075A1 (en) * 2022-01-12 2023-07-20 平安科技(深圳)有限公司 Text topic generation method and apparatus based on artificial intelligence, device, and medium
CN114648388A (en) * 2022-04-01 2022-06-21 左黎明 Big data analysis method and system for dealing with personalized service customization
CN115730589A (en) * 2022-11-04 2023-03-03 中电科大数据研究院有限公司 News propagation path generation method based on word vector and related device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Multi-dimension topic mining based on hierarchical semantic graph model;Zhang Tingting 等;《IEEE access》;20200330;第8卷;64820-64835 *
The detection of low-rate DoS attacks using the SADBSCAN algorithm;Tang Dan 等;《Information Sciences》;20210701;第565卷;229-247 *
新闻鉴别关键技术研究与原型实现;邓媛丹;《中国优秀硕士学位论文全文数据库信息科技辑》;20220115(第01期);I138-2753 *
结合抽样和分组的密度聚类算法研究;郑言蹊;《中国优秀硕士学位论文全文数据库信息科技辑》;20220415(第04期);I138-410 *

Also Published As

Publication number Publication date
CN117391071A (en) 2024-01-12

Similar Documents

Publication Publication Date Title
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US6826576B2 (en) Very-large-scale automatic categorizer for web content
US8526735B2 (en) Time-series analysis of keywords
CN107229668B (en) Text extraction method based on keyword matching
CN110825877A (en) Semantic similarity analysis method based on text clustering
Selvakuberan et al. Feature selection for web page classification
CN107463548B (en) Phrase mining method and device
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
JP4885842B2 (en) Search method for content, especially extracted parts common to two computer files
CN117391071B (en) News topic data mining method, device and storage medium
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN117235137B (en) Professional information query method and device based on vector database
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
JPH06282587A (en) Automatic classifying method and device for document and dictionary preparing method and device for classification
Triwijoyo et al. Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
CN116186067A (en) Industrial data table storage query method and equipment
CN115329173A (en) Method and device for determining enterprise credit based on public opinion monitoring
Williams Results of classifying documents with multiple discriminant functions
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
Selivanov et al. Package ‘text2vec’
CN111737469A (en) Data mining method and device, terminal equipment and readable storage medium
Luo et al. A comparison of som based document categorization systems
CN116414939B (en) Article generation method based on multidimensional data
CN114297479B (en) API recommendation method combining LDA topic model and GloVe word vector technology
CN109977269B (en) Data self-adaptive fusion method for XML file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant