CN104281607A

CN104281607A - Microblog hot topic analyzing method

Info

Publication number: CN104281607A
Application number: CN201310284081.8A
Authority: CN
Inventors: 肖江; 严时浪; 肖伦文
Original assignee: SHANGHAI RUIYING SOFTWARE TECHNOLOGY Co Ltd
Current assignee: SHANGHAI RUIYING SOFTWARE TECHNOLOGY Co Ltd
Priority date: 2013-07-08
Filing date: 2013-07-08
Publication date: 2015-01-14

Abstract

The invention discloses a microblog hot topic analyzing method. The method comprises the following steps that a microblog collection module obtains microblog data in the mode of combination of a web spider and a microblog third-party api technology according to a collection strategy; key words and sensitive words are called from a word bank through a word segmentation technology, and key words and sensitive words are analyzed out from microblog text data; the microblog webpage text data are filtered according to the analyzed key words, the analyzed sensitive words and emotional tendency words; a hot topic module marks the content involved between the symbols of # and # and between the symbols of [] as a topic through a clustering analysis technology, so that the number of microblog comments is counted; a hot people module analyzes the number of microblog fans and the number of the comments through the clustering analysis technology; a microblog early warning module analyzes out microblog information related to the key words and the sensitive words from the network microblog; an analyzing and counting module automatically generates a brief report through relevant data analyzed out from the system. The accuracy of topic analysis is improved, and detection efficiency is improved.

Description

Microblog hot topic analysis method

Technical Field

The invention relates to an analysis method, in particular to a microblog hot topic analysis method. .

Background

The microblog is an information sharing, spreading and acquiring platform based on user relationship, and a user can update information in characters of about 140 words through WEB, WAP and various client side components and realize instant sharing. The microblog is used as a network platform for fast sharing and spreading, and has the characteristics of huge information amount, diverse information dispersion and the like. In China, the Sina microblog and the Teng microblog are the hottest microblog systems, and according to public data, the Sina microblog has more than 2 hundred million registered users, and the Teng microblog has more than 3 hundred million registered users. The public opinion analysis system based on the microblog social network can gather hot topics in microblog opinions, track and analyze the hot topics, and provide a public opinion early warning function. At present, the main ways for discovering discussion hotspots on a microblog platform include: a hot topic finding method and a text classification method based on word frequency. Wherein,

word frequency statistics is a main mode for finding discussion hot spots on a current microblog platform. The method is derived from the traditional tf-idf indexing method. In a certain time range, the platform carries out word segmentation and word screening on microblogs issued by all users, and establishes an inverted index, then the words are sorted according to frequency, the words with higher frequency sorting become hot topics on the microblogs, and the users can use the words provided by the platform to find out related microblog entries on the microblog platform through the internal inverted index. The traditional hot word discovery system workflow chart frequency statistical method is simple and easy to implement, has better working efficiency under manual intervention, and is widely adopted in service providers at present. However, the frequency statistical method is basically incapable of dealing with semantic phenomena such as synonyms and word ambiguity, which greatly interfere with the synonyms and word ambiguity. The method based on word matching has the phenomenon of false alarm or missing report in text matching. On a microblog platform, due to the fact that the content is large, the user personality is strong, and therefore the accuracy of hot topic finding work based on text matching cannot be well guaranteed. In addition, the single hot word can only bring one-sided information to the user, and more likely to provide the user with an information index rather than the information itself. In order to improve the user experience, a certain amount of manual screening work is required to be added, so that the efficiency of the system is reduced; moreover, the frequency statistical method can hardly provide effective assistance for meeting the increasingly-rising user personalized recommendation requirements.

The traditional text classification method can also be applied to a microblog platform for hot spot information screening, and the automatic classifiers widely used at present comprise a Bayesian classifier, an example-based kNN classifier, a support vector machine and the like. Due to the fact that the number of microblog users is quite large, topics concerned by the users are quite wide, obvious mutual influence relationships exist among the users, and the whole user network can capture hot events quite quickly. If a classifier can be designed to fit the current hot spot event, the variation trend of the information in the category can be detected in real time. However, hotspot events and topics are unknown before they occur, so the problem shifts to fixed monitoring of some specific, sensitive topics. The classifier method has a good effect on screening specific topics, however, since the distribution range of text contents on the microblog is very wide, it is almost impossible to design a complete dictionary-type classifier so that all information falls into specific categories. Hot topic discovery requires rapid capture of multiple different topics, and a general classifier is not adequate for such tasks. In addition, due to the burstiness and uncertainty of news information, if the change trend of hot spot information on a microblog is to be tracked, the result of the classifier must be monitored at a low cost.

As described above, the conventional microblog hot topic analysis algorithm has the following two problems:

firstly, the traditional microblog hot topic analysis method does not pay attention to the word accuracy of the search result, namely the traditional method is limited by the mutual connection among the essential split words, so that the phenomena of great interference on synonyms and word ambiguity are basically impossible to process, and the user experience is influenced to a great extent. Because the characters adopted by human beings during narration have high randomness and uncertainty, users are often troubled by results with similar texts and substantially irrelevant contents when searching for massive information. The microblog hot topic analysis must consider the word accuracy of the search result, and the search result must consider the difference of the similar words.

Secondly, the traditional microblog hot topic analysis method does not pay attention to the real-time performance of the search results, namely the generation time of the hot topic analysis results has no or little influence on the result ranking. However, the microblog messages have strong real-time performance and are dynamically generated by microblog users, and the contents of the microblog messages often relate to real-time messages and contents, so that the real-time performance of search results must be considered in the microblog hot topic analysis method, and the generation time of the search results must be used as a basis for ranking.

However, research in the related field of microblog hot topic analysis methods is limited, and the current research work mainly focuses on passive data acquisition of known topics, so that timeliness of microblog public opinion discovery cannot be guaranteed. The work of public opinion analysis and early warning usually needs a large amount of network crawlers to collect mass data to read and write out, and the traditional file storage or database storage cannot meet the performance requirements of the public opinion analysis work.

Disclosure of Invention

The invention aims to provide a microblog hot topic analysis method, and the microblog hot topic analysis method is used for solving the technical problem.

The invention solves the technical problems through the following technical scheme: a microblog hot topic analysis method is characterized by comprising the following steps:

the method comprises the following steps that firstly, a microblog acquisition module acquires microblog data in a mode of combining a web crawler and a microblog third-party api technology according to an acquisition strategy;

step two, calling keywords and sensitive words from a word bank by using a word segmentation processing technology, and analyzing the keywords and the sensitive words from microblog text data;

thirdly, filtering the microblog webpage text data according to the analyzed keywords, sensitive words and emotional tendency words, and storing filtering records;

fourthly, the hot topic module marks the included content between the # # and the [ ] symbol as a topic through a cluster analysis technology, and analyzes the current hot topic according to statistics of the number of microblog comments, the forwarding times and the like, so that the accuracy of topic analysis is greatly improved;

step five, the hot character module analyzes the number of microblog fans and the number of comments by a clustering analysis technology to determine the hot characters under specified conditions;

a microblog early warning module analyzes microblog information related to the keywords and the sensitive words from the network microblog and timely gives early warning notification to the user;

and step seven, the analysis and statistics module automatically generates a brief report for analysis and use on the related data analyzed in the system.

Preferably, the data collected in the step one not only include domestic newwave and flight microblog, but also include data of foreign twitter microblog.

Preferably, the keywords in the second step are defined by the user in addition to the sensitive words specified by the relevant national laws and regulations.

Preferably, the hot topics of interest in the fourth step can be viewed not only by content, but also by source and propagation trend.

Preferably, the sending of the warning notification in the sixth step is sent through a mailbox, a website prompt and a mobile phone.

Preferably, after the required information is analyzed in the seventh step, the microblog system user is bound with the system through a microblog account.

Preferably, the microblog hot topic analysis method is applied to a microblog early warning system, and the microblog early warning system comprises a microblog acquisition module, a microblog analysis module, a microblog service module and a microblog data warehouse.

The positive progress effects of the invention are as follows: the invention provides a breadth-first webpage acquisition technology based on time judgment. By adding the time analyzer in the webpage collection process, whether the time in a to-be-collected page is earlier than a preset time point is judged, and therefore whether the page is only subjected to breadth collection is determined. The method avoids the early collection of useless information, improves the collection efficiency and ensures the collection coverage rate. And providing an agglomeration type hierarchical clustering algorithm for topic detection. According to the characteristic of flexible words used in the microblog, the cluster analysis model is used for analyzing the current hot topic, so that the topic analysis accuracy is greatly improved, the detection efficiency is improved, and the topic detection quality is improved. The invention provides a method for monitoring microblog information by a microblog early warning system, which is characterized in that data acquisition is carried out on three microblog systems, namely a green wave system, a flight system and a twitter system on the Internet by a microblog data acquisition technology, word segmentation processing, sensitive word processing and text clustering analysis are carried out on acquired mass data, and the current hot topic is analyzed, so that a user can timely and conveniently browse the latest microblog hot spots, track the microblog source, check sensitive microblogs and trend analysis, carry out early warning on dangerous information, and finally can self-set concerned content to display a statistical report. According to the invention, the technologies of webpage collection, text analysis and mining are applied to microblog information public opinion analysis, a discovery model of network hot topics is researched, a public opinion analysis system based on a microblog social network is realized, the requirement of current microblog public opinion analysis is met, and the blank of important public opinion source mining is filled.

Drawings

FIG. 1 is a flowchart of a microblog hot topic analysis method.

Detailed Description

The following provides a detailed description of the preferred embodiments of the present invention with reference to the accompanying drawings.

As shown in fig. 1, the microblog hot topic analysis method of the invention includes the following steps:

The data collected in the first step not only comprise domestic Xinlang and Teng-Wen microblog, but also comprise data of foreign twitter microblog.

In the second step, the user can define the keywords and the sensitive words except the sensitive words specified by the relevant national laws and regulations.

In the fourth step, not only the content but also the source and the propagation trend of the interested hot topic can be viewed.

The sending of the early warning notification in the sixth step can be sent through various ways such as a mailbox, a website prompt, a mobile phone and the like.

After the required information is analyzed in the seventh step, the microblog system user can be bound with the system through the microblog account number, and operations similar to those on the newwave, vacation, twitter microblog, such as paying attention, commenting, publishing a microblog and the like are performed.

According to the characteristics of strong timeliness of microblog information, high information updating and transmission speed and strong user interactivity, the invention designs a breadth-first webpage acquisition technology based on time judgment. The core idea of the acquisition technology comprises two aspects, namely, link information is automatically acquired from webpages through the link relation among the webpages of the microblog, original webpages are automatically acquired according to the links, and the original webpages in the whole microblog are acquired through continuous circulation; and secondly, if the information time of one page is all earlier than the preset time, deep acquisition is not carried out, and only breadth acquisition is carried out through the page.

The invention can be applied to a microblog early warning system, is set as a microblog early warning monitoring system of colleges and universities through a system user interface, monitors all microblog information related to the colleges and universities, pays attention to hot topics and hot characters of colleges and universities, tracks emergencies related to the colleges and universities in time, gives early warning to microblog contents with negative influence on the designated colleges and universities, maintains the image of the colleges and universities, improves the education quality and maintains the social harmony and stability.

The microblog early warning system applied to the invention comprises a microblog acquisition module, a microblog analysis module, a microblog service module, a microblog data warehouse and the like.

The microblog collection module comprises: the system is in charge of real-time acquisition, tracking and monitoring of three microblog systems of Xinlang, Tencent and twitter on the Internet, one key technology in a microblog acquisition module is an intelligent information acquisition technology, intelligent distributed cooperative crawlers are adopted, the number of crawler servers and the number of crawlers can be dynamically configured, calculation resources used for acquisition are dynamically increased and decreased under different acquisition requirements, microblog information is acquired on the Internet through a crawler module in a webpage acquisition subsystem, the crawler module can be provided with the number of crawlers, the acquisition speed, the initial URL, the regular expression of the URL meeting the acquisition requirements, the termination condition of the crawler thread and other constraints to acquire related webpage information, and the acquired webpage information is cleaned through a webpage cleaning module to extract microblog texts, link addresses, copyright descriptions and other noise data in the related webpage, Collecting data such as time and the like.

(II) a microblog analysis module: and carrying out information duplication elimination, propagation chain analysis, trend analysis and the like on the information obtained by the microblog acquisition module through a microblog analysis module to obtain valuable microblog information, analyzing public opinion hotspots in real time and mastering certain trends of the microblog information. The microblog analysis module specifically comprises:

the page filtering can be used for analyzing and filtering the content of the microblog webpage, automatically removing useless information and accurately acquiring the main information of the target content;

analyzing a propagation chain, tracking the source, the reprinting amount, the publisher and other related information elements of a certain hot topic for a period of time, and finally forming a propagation chain analysis graph;

automatic classification, namely traversing and scanning microblog contents according to a keyword rule defined by a user, identifying microblogs where the keywords are located, automatically classifying identifications, obtaining a classification feature vector space model according to sample training, and then realizing automatic classification identification of the microblogs according to feature vectors of the microblogs;

performing multiple clustering, namely performing multiple clustering analysis on the content of the microblog by adopting a multiple clustering algorithm, and performing intelligent classification processing on massive microblog information;

finding hot spots and key words, analyzing the hot degree of the microblog by adopting a hot spot weight calculation model, automatically finding hot spot words in the microblog and helping a user to intuitively know network hot spots;

trend analysis, namely for high-attention events caused by microblogs, the outbreak points and the situations of the microblogs can be mastered in time, and hot events in different time periods are provided;

analyzing tendency, namely performing cluster analysis and commendatory and derogatory analysis on the netizen comments of the microblog by adopting a text cluster and commendatory and derogatory analysis technology, analyzing and inducing main viewpoints of the netizen, and counting the commendatory and derogatory tendency distribution condition of the netizen;

the public opinion research and judgment is based on the analysis function, and performs source analysis, authenticity analysis, classification analysis, directional analysis, correction analysis and the like, so that various hotspots and public opinion trends can be comprehensively known and mastered in time on the whole, and various social emergencies and public opinion crises can be flexibly dealt with.

And (III) the microblog service module is visually experienced by a user, can clearly know the function of the microblog early warning system, can more specifically and conveniently know the latest hot spot of the whole microblog through the operation of the user, can set keywords for the matters concerned by the user, searches the keywords and timely acquires some required information. The microblog service module specifically comprises:

monitoring setting, namely monitoring related information of a microblog user through keyword setting, key people setting, area setting and key monitoring word setting;

topic tracking, namely analyzing hot topics by a microblog system according to a microblog acquired from a network;

the microblog system analyzes the hot character according to the microblog acquired from the network;

an emergency, an event that occurs in a short time (within 24 hours) causing a large reverberation on the network;

searching microblogs, wherein a user can search all microblogs captured by a microblog system to obtain microblog data wanted by the user;

statistical analysis, wherein corresponding modules of the microblog system are statistically analyzed as follows: marking statistics, marking reports, topic statistics, topic reports, monitoring word statistics and user behavior statistics;

a microblog early warning step, wherein a microblog system analyzes a microblog according to a keyword set by a user and displays the microblog on a microblog early warning page;

on-line microblog, the microblog system user can perform operations similar to those on the Xinlang, Tencent and twitter microblogs, such as paying attention, commenting and publishing microblogs.

And (IV) the microblog data warehouse can store massive unstructured information, and a real-time dynamic indexing technology is adopted, so that indexes are quickly and synchronously updated during data addition, deletion and modification, the whole index and local reconstructed indexes are not required to be reconstructed, namely, the data can be immediately retrieved after being changed, the real-time performance and effectiveness of information search are ensured, and the core retrieval requirement of public opinion application is met. The microblog data warehouse specifically comprises:

a database storage service capable of storing massive unstructured information and calling the information of the database at any time;

the data index service adopts a real-time dynamic index technology, and ensures the real-time performance and effectiveness of information search.

The microblog early warning system has the following specific functions:

(1) collecting microblog information, collecting data of three microblog systems of new wave, Tencent and twitter on the Internet, and sending the collected data to the step (2) for analysis.

(2) And microblog analysis, namely performing information duplication removal, propagation chain analysis, trend analysis and the like on the acquired information. Extracting effective intelligence data, and then transmitting the intelligence data to (3) for intelligence mining and analysis.

(3) And (4) information mining, namely further performing information mining on the information, such as information of a target and dynamic mining, and then further processing the information through the steps (4) and (5).

(4) The microblog service displays information required by a user through an interface according to the requirements of the user, and the functions available to the user include monitoring setting, topic tracking, hot people, emergencies, microblog searching, statistical analysis, online microblog, microblog early warning and the like.

(5) And the microblog data warehouse stores the excavated information in the microblog data warehouse, and waits for the searching and using of the user at any time, so that the real-time performance and the effectiveness of information searching are ensured.

Compared with the prior art, the invention has the following advantages and beneficial effects: the microblog acquisition module acquires data of three microblog systems, namely, a new wave system, an vacation system and a twitter system on the Internet, and then transmits the data to the microblog analysis module for information duplication elimination, trend analysis and the like. After effective information is extracted, the information is displayed to a user through an interface, and functions available for the user include monitoring setting, topic tracking, hot people, emergencies, microblog searching, statistical analysis, online microblog, microblog early warning and the like. The user is more humanized in the operation of the interface, the realized functions are many, the microblog system can be monitored in an all-around manner, hot topics can be fed back in real time, and the excessive speeches can be tracked and early warned. The crawler-based intelligent distributed collaborative crawler system adopts an intelligent information acquisition technology, intelligently and distributively collaborates the crawlers, can dynamically configure the number of the crawler servers and the number of the crawlers, and dynamically increases and decreases the computing resources used for acquisition under different acquisition requirements. The system acquires microblog information on the Internet through a crawler module in a webpage acquisition subsystem, and can set the number, the acquisition speed, the initial URL, the regular expression of the URL meeting the acquisition requirement, the crawler thread termination condition and other constraints on the crawler module to acquire related webpage information. And for the acquired webpage, eliminating noise data such as advertisements, navigation information, pictures, copyright descriptions and the like in the webpage through a webpage cleaning module, extracting data such as microblog texts, link addresses, acquisition time and the like in the related webpage, and storing the data in a database.

Performing the following operations on each piece of microblog data acquired by a microblog search engine:

the data acquired in the step 1-1) are mainly stored in two types, one type is

The User data User, the other is microblog data Tweet;

step 1-2) using a relational database to store User and sweet data for follow-up

And associating the query.

Step 2-1) using Chinese word segmentation technology to process microblog content in Tweet data

content carries out word segmentation;

step 2-2) establishing an inverted index by using a full-text retrieval technology, and making a search for data analysis

Inquiring the index;

step 2-3) extracting the content while establishing index for the content field

Content tag bracketed by "#" and "[ signs;

step 2-4) and establishing an inverted index for the tag field;

step 3-1) establishing a timer program, and performing data entry on the Tway data every 1 hour

Performing query, counting all collected tag data within one hour, wherein the query condition is time = [ now () -1h TO now () ] & face.field = tag;

step 3-2) performing reverse sequencing according to the data amount tag _ count of the tag, and taking out the data before

100 tags;

step 4-1) traversing the 100 tags extracted in the step 3-2), and using Chinese word segmentation

Performing word segmentation by the technology, wherein each term after word segmentation is term;

and 4-2) continuously querying the full text retrieval server. When term is less than 3, it is required

All term must match, if term is greater than 3, then it is required that at least 75% of term must match. If the term number is less than or equal TO 3, the query condition is (content = term1 AND term2 AND term3) & time = [ now () -24h TO now () ]; if the number of term is greater than 3, the query condition should be (content = (term1 AND term2 AND term3) OR (term4 OR term5 …) & time = [ now () -24h TO now () ];

and 4-3) inquiring current microblog data corresponding to 100 tags by using the method, and then sequencing the current microblog data in a reverse order according to the number t _ count of the microblogs corresponding to the 100 tags to obtain 100 hot topics on the day.

The invention has the advantages that: by means of the cluster analysis technology, the accuracy of the current microblog retrieval result is improved. The calculation method of the analysis statistics is simple and efficient, the real-time performance is remarkably improved, the microblog system can be monitored in all directions in time, hot topics can be fed back in real time, and intelligent tracking and early warning can be performed on some over-excited speeches.

In one embodiment, the collector may periodically collect microblog messages. However, collecting all users periodically makes the collector inefficient, because a large part of microblog users have long posting periods, such as updating once every few days, and if the part of users is many, collecting once for example for 3 minutes by the collector will bring about a great drop in efficiency.

Various modifications and changes may be made to the present invention by those skilled in the art. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A microblog hot topic analysis method is characterized by comprising the following steps:

2. The microblog-based emergency analysis method according to claim 1, wherein the data collected in the first step includes not only domestic newwave and flight microblog but also foreign twitter microblog data.

3. The microblog-based emergency analysis method according to claim 1, wherein the keywords in the second step define keywords and sensitive words by the user in addition to the sensitive words specified by the national relevant laws and regulations.

4. The microblog-based emergency analysis method according to claim 1, wherein in the fourth step, not only the content but also the source and the propagation trend of the interested hot topic can be viewed.

5. The microblog-based emergency analysis method according to claim 1, wherein the sending of the warning notice in the sixth step is sent through a mailbox, a website prompt, and a mobile phone.

6. The method for analyzing emergency events based on microblogs according to claim 1, wherein in the seventh step, after the required information is analyzed, the user of the microblog system is bound with the system through a microblog account.

7. The microblog-based emergency analysis method according to claim 1, wherein the microblog hot topic analysis method is applied to a microblog early warning system, and the microblog early warning system comprises a microblog acquisition module, a microblog analysis module, a microblog service module and a microblog data warehouse.