CN104281607A - Microblog hot topic analyzing method - Google Patents
Microblog hot topic analyzing method Download PDFInfo
- Publication number
- CN104281607A CN104281607A CN201310284081.8A CN201310284081A CN104281607A CN 104281607 A CN104281607 A CN 104281607A CN 201310284081 A CN201310284081 A CN 201310284081A CN 104281607 A CN104281607 A CN 104281607A
- Authority
- CN
- China
- Prior art keywords
- microblog
- analysis
- data
- hot
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000004458 analytical method Methods 0.000 claims abstract description 74
- 238000005516 engineering process Methods 0.000 claims abstract description 31
- 230000011218 segmentation Effects 0.000 claims abstract description 11
- 230000002996 emotional effect Effects 0.000 claims abstract description 4
- 238000001914 filtration Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000007621 cluster analysis Methods 0.000 claims description 6
- 230000033228 biological regulation Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 4
- 241000239290 Araneae Species 0.000 abstract 1
- 238000012544 monitoring process Methods 0.000 description 10
- 238000005065 mining Methods 0.000 description 6
- 238000007619 statistical method Methods 0.000 description 6
- 238000012216 screening Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000007480 spreading Effects 0.000 description 2
- 238000003892 spreading Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 238000005054 agglomeration Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010224 classification analysis Methods 0.000 description 1
- 235000009508 confectionery Nutrition 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a microblog hot topic analyzing method. The method comprises the following steps that a microblog collection module obtains microblog data in the mode of combination of a web spider and a microblog third-party api technology according to a collection strategy; key words and sensitive words are called from a word bank through a word segmentation technology, and key words and sensitive words are analyzed out from microblog text data; the microblog webpage text data are filtered according to the analyzed key words, the analyzed sensitive words and emotional tendency words; a hot topic module marks the content involved between the symbols of # and # and between the symbols of [] as a topic through a clustering analysis technology, so that the number of microblog comments is counted; a hot people module analyzes the number of microblog fans and the number of the comments through the clustering analysis technology; a microblog early warning module analyzes out microblog information related to the key words and the sensitive words from the network microblog; an analyzing and counting module automatically generates a brief report through relevant data analyzed out from the system. The accuracy of topic analysis is improved, and detection efficiency is improved.
Description
Technical Field
The invention relates to an analysis method, in particular to a microblog hot topic analysis method. .
Background
The microblog is an information sharing, spreading and acquiring platform based on user relationship, and a user can update information in characters of about 140 words through WEB, WAP and various client side components and realize instant sharing. The microblog is used as a network platform for fast sharing and spreading, and has the characteristics of huge information amount, diverse information dispersion and the like. In China, the Sina microblog and the Teng microblog are the hottest microblog systems, and according to public data, the Sina microblog has more than 2 hundred million registered users, and the Teng microblog has more than 3 hundred million registered users. The public opinion analysis system based on the microblog social network can gather hot topics in microblog opinions, track and analyze the hot topics, and provide a public opinion early warning function. At present, the main ways for discovering discussion hotspots on a microblog platform include: a hot topic finding method and a text classification method based on word frequency. Wherein,
word frequency statistics is a main mode for finding discussion hot spots on a current microblog platform. The method is derived from the traditional tf-idf indexing method. In a certain time range, the platform carries out word segmentation and word screening on microblogs issued by all users, and establishes an inverted index, then the words are sorted according to frequency, the words with higher frequency sorting become hot topics on the microblogs, and the users can use the words provided by the platform to find out related microblog entries on the microblog platform through the internal inverted index. The traditional hot word discovery system workflow chart frequency statistical method is simple and easy to implement, has better working efficiency under manual intervention, and is widely adopted in service providers at present. However, the frequency statistical method is basically incapable of dealing with semantic phenomena such as synonyms and word ambiguity, which greatly interfere with the synonyms and word ambiguity. The method based on word matching has the phenomenon of false alarm or missing report in text matching. On a microblog platform, due to the fact that the content is large, the user personality is strong, and therefore the accuracy of hot topic finding work based on text matching cannot be well guaranteed. In addition, the single hot word can only bring one-sided information to the user, and more likely to provide the user with an information index rather than the information itself. In order to improve the user experience, a certain amount of manual screening work is required to be added, so that the efficiency of the system is reduced; moreover, the frequency statistical method can hardly provide effective assistance for meeting the increasingly-rising user personalized recommendation requirements.
The traditional text classification method can also be applied to a microblog platform for hot spot information screening, and the automatic classifiers widely used at present comprise a Bayesian classifier, an example-based kNN classifier, a support vector machine and the like. Due to the fact that the number of microblog users is quite large, topics concerned by the users are quite wide, obvious mutual influence relationships exist among the users, and the whole user network can capture hot events quite quickly. If a classifier can be designed to fit the current hot spot event, the variation trend of the information in the category can be detected in real time. However, hotspot events and topics are unknown before they occur, so the problem shifts to fixed monitoring of some specific, sensitive topics. The classifier method has a good effect on screening specific topics, however, since the distribution range of text contents on the microblog is very wide, it is almost impossible to design a complete dictionary-type classifier so that all information falls into specific categories. Hot topic discovery requires rapid capture of multiple different topics, and a general classifier is not adequate for such tasks. In addition, due to the burstiness and uncertainty of news information, if the change trend of hot spot information on a microblog is to be tracked, the result of the classifier must be monitored at a low cost.
As described above, the conventional microblog hot topic analysis algorithm has the following two problems:
firstly, the traditional microblog hot topic analysis method does not pay attention to the word accuracy of the search result, namely the traditional method is limited by the mutual connection among the essential split words, so that the phenomena of great interference on synonyms and word ambiguity are basically impossible to process, and the user experience is influenced to a great extent. Because the characters adopted by human beings during narration have high randomness and uncertainty, users are often troubled by results with similar texts and substantially irrelevant contents when searching for massive information. The microblog hot topic analysis must consider the word accuracy of the search result, and the search result must consider the difference of the similar words.
Secondly, the traditional microblog hot topic analysis method does not pay attention to the real-time performance of the search results, namely the generation time of the hot topic analysis results has no or little influence on the result ranking. However, the microblog messages have strong real-time performance and are dynamically generated by microblog users, and the contents of the microblog messages often relate to real-time messages and contents, so that the real-time performance of search results must be considered in the microblog hot topic analysis method, and the generation time of the search results must be used as a basis for ranking.
However, research in the related field of microblog hot topic analysis methods is limited, and the current research work mainly focuses on passive data acquisition of known topics, so that timeliness of microblog public opinion discovery cannot be guaranteed. The work of public opinion analysis and early warning usually needs a large amount of network crawlers to collect mass data to read and write out, and the traditional file storage or database storage cannot meet the performance requirements of the public opinion analysis work.
Disclosure of Invention
The invention aims to provide a microblog hot topic analysis method, and the microblog hot topic analysis method is used for solving the technical problem.
The invention solves the technical problems through the following technical scheme: a microblog hot topic analysis method is characterized by comprising the following steps:
the method comprises the following steps that firstly, a microblog acquisition module acquires microblog data in a mode of combining a web crawler and a microblog third-party api technology according to an acquisition strategy;
step two, calling keywords and sensitive words from a word bank by using a word segmentation processing technology, and analyzing the keywords and the sensitive words from microblog text data;
thirdly, filtering the microblog webpage text data according to the analyzed keywords, sensitive words and emotional tendency words, and storing filtering records;
fourthly, the hot topic module marks the included content between the # # and the [ ] symbol as a topic through a cluster analysis technology, and analyzes the current hot topic according to statistics of the number of microblog comments, the forwarding times and the like, so that the accuracy of topic analysis is greatly improved;
step five, the hot character module analyzes the number of microblog fans and the number of comments by a clustering analysis technology to determine the hot characters under specified conditions;
a microblog early warning module analyzes microblog information related to the keywords and the sensitive words from the network microblog and timely gives early warning notification to the user;
and step seven, the analysis and statistics module automatically generates a brief report for analysis and use on the related data analyzed in the system.
Preferably, the data collected in the step one not only include domestic newwave and flight microblog, but also include data of foreign twitter microblog.
Preferably, the keywords in the second step are defined by the user in addition to the sensitive words specified by the relevant national laws and regulations.
Preferably, the hot topics of interest in the fourth step can be viewed not only by content, but also by source and propagation trend.
Preferably, the sending of the warning notification in the sixth step is sent through a mailbox, a website prompt and a mobile phone.
Preferably, after the required information is analyzed in the seventh step, the microblog system user is bound with the system through a microblog account.
Preferably, the microblog hot topic analysis method is applied to a microblog early warning system, and the microblog early warning system comprises a microblog acquisition module, a microblog analysis module, a microblog service module and a microblog data warehouse.
The positive progress effects of the invention are as follows: the invention provides a breadth-first webpage acquisition technology based on time judgment. By adding the time analyzer in the webpage collection process, whether the time in a to-be-collected page is earlier than a preset time point is judged, and therefore whether the page is only subjected to breadth collection is determined. The method avoids the early collection of useless information, improves the collection efficiency and ensures the collection coverage rate. And providing an agglomeration type hierarchical clustering algorithm for topic detection. According to the characteristic of flexible words used in the microblog, the cluster analysis model is used for analyzing the current hot topic, so that the topic analysis accuracy is greatly improved, the detection efficiency is improved, and the topic detection quality is improved. The invention provides a method for monitoring microblog information by a microblog early warning system, which is characterized in that data acquisition is carried out on three microblog systems, namely a green wave system, a flight system and a twitter system on the Internet by a microblog data acquisition technology, word segmentation processing, sensitive word processing and text clustering analysis are carried out on acquired mass data, and the current hot topic is analyzed, so that a user can timely and conveniently browse the latest microblog hot spots, track the microblog source, check sensitive microblogs and trend analysis, carry out early warning on dangerous information, and finally can self-set concerned content to display a statistical report. According to the invention, the technologies of webpage collection, text analysis and mining are applied to microblog information public opinion analysis, a discovery model of network hot topics is researched, a public opinion analysis system based on a microblog social network is realized, the requirement of current microblog public opinion analysis is met, and the blank of important public opinion source mining is filled.
Drawings
FIG. 1 is a flowchart of a microblog hot topic analysis method.
Detailed Description
The following provides a detailed description of the preferred embodiments of the present invention with reference to the accompanying drawings.
As shown in fig. 1, the microblog hot topic analysis method of the invention includes the following steps:
the method comprises the following steps that firstly, a microblog acquisition module acquires microblog data in a mode of combining a web crawler and a microblog third-party api technology according to an acquisition strategy;
step two, calling keywords and sensitive words from a word bank by using a word segmentation processing technology, and analyzing the keywords and the sensitive words from microblog text data;
thirdly, filtering the microblog webpage text data according to the analyzed keywords, sensitive words and emotional tendency words, and storing filtering records;
fourthly, the hot topic module marks the included content between the # # and the [ ] symbol as a topic through a cluster analysis technology, and analyzes the current hot topic according to statistics of the number of microblog comments, the forwarding times and the like, so that the accuracy of topic analysis is greatly improved;
step five, the hot character module analyzes the number of microblog fans and the number of comments by a clustering analysis technology to determine the hot characters under specified conditions;
a microblog early warning module analyzes microblog information related to the keywords and the sensitive words from the network microblog and timely gives early warning notification to the user;
and step seven, the analysis and statistics module automatically generates a brief report for analysis and use on the related data analyzed in the system.
The data collected in the first step not only comprise domestic Xinlang and Teng-Wen microblog, but also comprise data of foreign twitter microblog.
In the second step, the user can define the keywords and the sensitive words except the sensitive words specified by the relevant national laws and regulations.
In the fourth step, not only the content but also the source and the propagation trend of the interested hot topic can be viewed.
The sending of the early warning notification in the sixth step can be sent through various ways such as a mailbox, a website prompt, a mobile phone and the like.
After the required information is analyzed in the seventh step, the microblog system user can be bound with the system through the microblog account number, and operations similar to those on the newwave, vacation, twitter microblog, such as paying attention, commenting, publishing a microblog and the like are performed.
According to the characteristics of strong timeliness of microblog information, high information updating and transmission speed and strong user interactivity, the invention designs a breadth-first webpage acquisition technology based on time judgment. The core idea of the acquisition technology comprises two aspects, namely, link information is automatically acquired from webpages through the link relation among the webpages of the microblog, original webpages are automatically acquired according to the links, and the original webpages in the whole microblog are acquired through continuous circulation; and secondly, if the information time of one page is all earlier than the preset time, deep acquisition is not carried out, and only breadth acquisition is carried out through the page.
The invention can be applied to a microblog early warning system, is set as a microblog early warning monitoring system of colleges and universities through a system user interface, monitors all microblog information related to the colleges and universities, pays attention to hot topics and hot characters of colleges and universities, tracks emergencies related to the colleges and universities in time, gives early warning to microblog contents with negative influence on the designated colleges and universities, maintains the image of the colleges and universities, improves the education quality and maintains the social harmony and stability.
The microblog early warning system applied to the invention comprises a microblog acquisition module, a microblog analysis module, a microblog service module, a microblog data warehouse and the like.
The microblog collection module comprises: the system is in charge of real-time acquisition, tracking and monitoring of three microblog systems of Xinlang, Tencent and twitter on the Internet, one key technology in a microblog acquisition module is an intelligent information acquisition technology, intelligent distributed cooperative crawlers are adopted, the number of crawler servers and the number of crawlers can be dynamically configured, calculation resources used for acquisition are dynamically increased and decreased under different acquisition requirements, microblog information is acquired on the Internet through a crawler module in a webpage acquisition subsystem, the crawler module can be provided with the number of crawlers, the acquisition speed, the initial URL, the regular expression of the URL meeting the acquisition requirements, the termination condition of the crawler thread and other constraints to acquire related webpage information, and the acquired webpage information is cleaned through a webpage cleaning module to extract microblog texts, link addresses, copyright descriptions and other noise data in the related webpage, Collecting data such as time and the like.
(II) a microblog analysis module: and carrying out information duplication elimination, propagation chain analysis, trend analysis and the like on the information obtained by the microblog acquisition module through a microblog analysis module to obtain valuable microblog information, analyzing public opinion hotspots in real time and mastering certain trends of the microblog information. The microblog analysis module specifically comprises:
the page filtering can be used for analyzing and filtering the content of the microblog webpage, automatically removing useless information and accurately acquiring the main information of the target content;
analyzing a propagation chain, tracking the source, the reprinting amount, the publisher and other related information elements of a certain hot topic for a period of time, and finally forming a propagation chain analysis graph;
automatic classification, namely traversing and scanning microblog contents according to a keyword rule defined by a user, identifying microblogs where the keywords are located, automatically classifying identifications, obtaining a classification feature vector space model according to sample training, and then realizing automatic classification identification of the microblogs according to feature vectors of the microblogs;
performing multiple clustering, namely performing multiple clustering analysis on the content of the microblog by adopting a multiple clustering algorithm, and performing intelligent classification processing on massive microblog information;
finding hot spots and key words, analyzing the hot degree of the microblog by adopting a hot spot weight calculation model, automatically finding hot spot words in the microblog and helping a user to intuitively know network hot spots;
trend analysis, namely for high-attention events caused by microblogs, the outbreak points and the situations of the microblogs can be mastered in time, and hot events in different time periods are provided;
analyzing tendency, namely performing cluster analysis and commendatory and derogatory analysis on the netizen comments of the microblog by adopting a text cluster and commendatory and derogatory analysis technology, analyzing and inducing main viewpoints of the netizen, and counting the commendatory and derogatory tendency distribution condition of the netizen;
the public opinion research and judgment is based on the analysis function, and performs source analysis, authenticity analysis, classification analysis, directional analysis, correction analysis and the like, so that various hotspots and public opinion trends can be comprehensively known and mastered in time on the whole, and various social emergencies and public opinion crises can be flexibly dealt with.
And (III) the microblog service module is visually experienced by a user, can clearly know the function of the microblog early warning system, can more specifically and conveniently know the latest hot spot of the whole microblog through the operation of the user, can set keywords for the matters concerned by the user, searches the keywords and timely acquires some required information. The microblog service module specifically comprises:
monitoring setting, namely monitoring related information of a microblog user through keyword setting, key people setting, area setting and key monitoring word setting;
topic tracking, namely analyzing hot topics by a microblog system according to a microblog acquired from a network;
the microblog system analyzes the hot character according to the microblog acquired from the network;
an emergency, an event that occurs in a short time (within 24 hours) causing a large reverberation on the network;
searching microblogs, wherein a user can search all microblogs captured by a microblog system to obtain microblog data wanted by the user;
statistical analysis, wherein corresponding modules of the microblog system are statistically analyzed as follows: marking statistics, marking reports, topic statistics, topic reports, monitoring word statistics and user behavior statistics;
a microblog early warning step, wherein a microblog system analyzes a microblog according to a keyword set by a user and displays the microblog on a microblog early warning page;
on-line microblog, the microblog system user can perform operations similar to those on the Xinlang, Tencent and twitter microblogs, such as paying attention, commenting and publishing microblogs.
And (IV) the microblog data warehouse can store massive unstructured information, and a real-time dynamic indexing technology is adopted, so that indexes are quickly and synchronously updated during data addition, deletion and modification, the whole index and local reconstructed indexes are not required to be reconstructed, namely, the data can be immediately retrieved after being changed, the real-time performance and effectiveness of information search are ensured, and the core retrieval requirement of public opinion application is met. The microblog data warehouse specifically comprises:
a database storage service capable of storing massive unstructured information and calling the information of the database at any time;
the data index service adopts a real-time dynamic index technology, and ensures the real-time performance and effectiveness of information search.
The microblog early warning system has the following specific functions:
(1) collecting microblog information, collecting data of three microblog systems of new wave, Tencent and twitter on the Internet, and sending the collected data to the step (2) for analysis.
(2) And microblog analysis, namely performing information duplication removal, propagation chain analysis, trend analysis and the like on the acquired information. Extracting effective intelligence data, and then transmitting the intelligence data to (3) for intelligence mining and analysis.
(3) And (4) information mining, namely further performing information mining on the information, such as information of a target and dynamic mining, and then further processing the information through the steps (4) and (5).
(4) The microblog service displays information required by a user through an interface according to the requirements of the user, and the functions available to the user include monitoring setting, topic tracking, hot people, emergencies, microblog searching, statistical analysis, online microblog, microblog early warning and the like.
(5) And the microblog data warehouse stores the excavated information in the microblog data warehouse, and waits for the searching and using of the user at any time, so that the real-time performance and the effectiveness of information searching are ensured.
Compared with the prior art, the invention has the following advantages and beneficial effects: the microblog acquisition module acquires data of three microblog systems, namely, a new wave system, an vacation system and a twitter system on the Internet, and then transmits the data to the microblog analysis module for information duplication elimination, trend analysis and the like. After effective information is extracted, the information is displayed to a user through an interface, and functions available for the user include monitoring setting, topic tracking, hot people, emergencies, microblog searching, statistical analysis, online microblog, microblog early warning and the like. The user is more humanized in the operation of the interface, the realized functions are many, the microblog system can be monitored in an all-around manner, hot topics can be fed back in real time, and the excessive speeches can be tracked and early warned. The crawler-based intelligent distributed collaborative crawler system adopts an intelligent information acquisition technology, intelligently and distributively collaborates the crawlers, can dynamically configure the number of the crawler servers and the number of the crawlers, and dynamically increases and decreases the computing resources used for acquisition under different acquisition requirements. The system acquires microblog information on the Internet through a crawler module in a webpage acquisition subsystem, and can set the number, the acquisition speed, the initial URL, the regular expression of the URL meeting the acquisition requirement, the crawler thread termination condition and other constraints on the crawler module to acquire related webpage information. And for the acquired webpage, eliminating noise data such as advertisements, navigation information, pictures, copyright descriptions and the like in the webpage through a webpage cleaning module, extracting data such as microblog texts, link addresses, acquisition time and the like in the related webpage, and storing the data in a database.
Performing the following operations on each piece of microblog data acquired by a microblog search engine:
the data acquired in the step 1-1) are mainly stored in two types, one type is
The User data User, the other is microblog data Tweet;
step 1-2) using a relational database to store User and sweet data for follow-up
And associating the query.
Step 2-1) using Chinese word segmentation technology to process microblog content in Tweet data
content carries out word segmentation;
step 2-2) establishing an inverted index by using a full-text retrieval technology, and making a search for data analysis
Inquiring the index;
step 2-3) extracting the content while establishing index for the content field
Content tag bracketed by "#" and "[ signs;
step 2-4) and establishing an inverted index for the tag field;
step 3-1) establishing a timer program, and performing data entry on the Tway data every 1 hour
Performing query, counting all collected tag data within one hour, wherein the query condition is time = [ now () -1h TO now () ] & face.field = tag;
step 3-2) performing reverse sequencing according to the data amount tag _ count of the tag, and taking out the data before
100 tags;
step 4-1) traversing the 100 tags extracted in the step 3-2), and using Chinese word segmentation
Performing word segmentation by the technology, wherein each term after word segmentation is term;
and 4-2) continuously querying the full text retrieval server. When term is less than 3, it is required
All term must match, if term is greater than 3, then it is required that at least 75% of term must match. If the term number is less than or equal TO 3, the query condition is (content = term1 AND term2 AND term3) & time = [ now () -24h TO now () ]; if the number of term is greater than 3, the query condition should be (content = (term1 AND term2 AND term3) OR (term4 OR term5 …) & time = [ now () -24h TO now () ];
and 4-3) inquiring current microblog data corresponding to 100 tags by using the method, and then sequencing the current microblog data in a reverse order according to the number t _ count of the microblogs corresponding to the 100 tags to obtain 100 hot topics on the day.
The invention has the advantages that: by means of the cluster analysis technology, the accuracy of the current microblog retrieval result is improved. The calculation method of the analysis statistics is simple and efficient, the real-time performance is remarkably improved, the microblog system can be monitored in all directions in time, hot topics can be fed back in real time, and intelligent tracking and early warning can be performed on some over-excited speeches.
In one embodiment, the collector may periodically collect microblog messages. However, collecting all users periodically makes the collector inefficient, because a large part of microblog users have long posting periods, such as updating once every few days, and if the part of users is many, collecting once for example for 3 minutes by the collector will bring about a great drop in efficiency.
Various modifications and changes may be made to the present invention by those skilled in the art. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
Claims (7)
1. A microblog hot topic analysis method is characterized by comprising the following steps:
the method comprises the following steps that firstly, a microblog acquisition module acquires microblog data in a mode of combining a web crawler and a microblog third-party api technology according to an acquisition strategy;
step two, calling keywords and sensitive words from a word bank by using a word segmentation processing technology, and analyzing the keywords and the sensitive words from microblog text data;
thirdly, filtering the microblog webpage text data according to the analyzed keywords, sensitive words and emotional tendency words, and storing filtering records;
fourthly, the hot topic module marks the included content between the # # and the [ ] symbol as a topic through a cluster analysis technology, and analyzes the current hot topic according to statistics of the number of microblog comments, the forwarding times and the like, so that the accuracy of topic analysis is greatly improved;
step five, the hot character module analyzes the number of microblog fans and the number of comments by a clustering analysis technology to determine the hot characters under specified conditions;
a microblog early warning module analyzes microblog information related to the keywords and the sensitive words from the network microblog and timely gives early warning notification to the user;
and step seven, the analysis and statistics module automatically generates a brief report for analysis and use on the related data analyzed in the system.
2. The microblog-based emergency analysis method according to claim 1, wherein the data collected in the first step includes not only domestic newwave and flight microblog but also foreign twitter microblog data.
3. The microblog-based emergency analysis method according to claim 1, wherein the keywords in the second step define keywords and sensitive words by the user in addition to the sensitive words specified by the national relevant laws and regulations.
4. The microblog-based emergency analysis method according to claim 1, wherein in the fourth step, not only the content but also the source and the propagation trend of the interested hot topic can be viewed.
5. The microblog-based emergency analysis method according to claim 1, wherein the sending of the warning notice in the sixth step is sent through a mailbox, a website prompt, and a mobile phone.
6. The method for analyzing emergency events based on microblogs according to claim 1, wherein in the seventh step, after the required information is analyzed, the user of the microblog system is bound with the system through a microblog account.
7. The microblog-based emergency analysis method according to claim 1, wherein the microblog hot topic analysis method is applied to a microblog early warning system, and the microblog early warning system comprises a microblog acquisition module, a microblog analysis module, a microblog service module and a microblog data warehouse.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310284081.8A CN104281607A (en) | 2013-07-08 | 2013-07-08 | Microblog hot topic analyzing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310284081.8A CN104281607A (en) | 2013-07-08 | 2013-07-08 | Microblog hot topic analyzing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104281607A true CN104281607A (en) | 2015-01-14 |
Family
ID=52256483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310284081.8A Pending CN104281607A (en) | 2013-07-08 | 2013-07-08 | Microblog hot topic analyzing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104281607A (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899324A (en) * | 2015-06-19 | 2015-09-09 | 成都国腾实业集团有限公司 | Sample training system based on IDC (internet data center) harmful information monitoring system |
CN104965894A (en) * | 2015-06-19 | 2015-10-07 | 成都国腾实业集团有限公司 | Data analysis system for IDC hazardous information monitoring platform |
CN105791091A (en) * | 2016-03-02 | 2016-07-20 | 四川长虹电器股份有限公司 | System and method for evaluating operation quality of official microblog and wechat public numbers |
CN106202222A (en) * | 2016-06-28 | 2016-12-07 | 北京小米移动软件有限公司 | The determination method and device of focus incident |
WO2016206395A1 (en) * | 2015-06-25 | 2016-12-29 | 中兴通讯股份有限公司 | Weekly report information processing method and device |
CN106302407A (en) * | 2016-08-02 | 2017-01-04 | 四川秘无痕信息安全技术有限责任公司 | A kind of method monitoring wechat circle of friends transmission data |
CN106339389A (en) * | 2015-07-09 | 2017-01-18 | 天津市国瑞数码安全系统股份有限公司 | Control method of sensitive information based on microblog website |
CN106354846A (en) * | 2016-08-31 | 2017-01-25 | 成都广电视讯文化传播有限公司 | Intelligent news manuscript selection method and system based on big data |
CN106779827A (en) * | 2016-12-02 | 2017-05-31 | 上海晶樵网络信息技术有限公司 | A kind of Internet user's behavior collection and the big data method of analysis detection |
CN106777236A (en) * | 2016-12-27 | 2017-05-31 | 北京百度网讯科技有限公司 | The exhibiting method and device of the Query Result based on depth question and answer |
CN106886579A (en) * | 2017-01-23 | 2017-06-23 | 北京航空航天大学 | Real-time streaming textual hierarchy monitoring method and device |
CN106980692A (en) * | 2016-05-30 | 2017-07-25 | 国家计算机网络与信息安全管理中心 | A kind of influence power computational methods based on microblogging particular event |
CN107168943A (en) * | 2017-04-07 | 2017-09-15 | 平安科技(深圳)有限公司 | The method and apparatus of topic early warning |
CN107341160A (en) * | 2016-05-03 | 2017-11-10 | 北京京东尚科信息技术有限公司 | A kind of method and device for intercepting reptile |
CN107622333A (en) * | 2017-11-02 | 2018-01-23 | 北京百分点信息科技有限公司 | A kind of event prediction method, apparatus and system |
CN108335110A (en) * | 2017-01-17 | 2018-07-27 | 阿里巴巴集团控股有限公司 | Chat message processing method and processing device |
CN109241380A (en) * | 2018-08-24 | 2019-01-18 | 北京信息科技大学 | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API |
CN110083701A (en) * | 2019-03-20 | 2019-08-02 | 重庆邮电大学 | A kind of cyberspace Mass disturbance early warning system based on average influence |
CN110502703A (en) * | 2019-07-12 | 2019-11-26 | 北京邮电大学 | Social networks incident detection method based on character string dictionary building |
CN111401648A (en) * | 2020-03-20 | 2020-07-10 | 李惠芳 | Event prediction method under condition of mutual influence of internet hotspots |
CN111783468A (en) * | 2020-06-28 | 2020-10-16 | 百度在线网络技术(北京)有限公司 | Text processing method, device, equipment and medium |
CN111931098A (en) * | 2019-04-28 | 2020-11-13 | 北京仝睿科技有限公司 | Monitoring object determination method and device and electronic equipment |
CN112115263A (en) * | 2020-09-08 | 2020-12-22 | 浙江嘉兴数字城市实验室有限公司 | NLP-based social management big data monitoring and early warning method |
CN112632361A (en) * | 2020-12-29 | 2021-04-09 | 中科院计算技术研究所大数据研究院 | Iterative data acquisition method |
CN112818234A (en) * | 2021-02-02 | 2021-05-18 | 中慧绿浪科技(天津)集团有限公司 | Network public opinion information analysis processing method and system |
CN112929235A (en) * | 2021-02-06 | 2021-06-08 | 珠海市鸿瑞信息技术股份有限公司 | Network monitoring system based on internet |
CN113010689A (en) * | 2021-03-22 | 2021-06-22 | 平安科技(深圳)有限公司 | Buddhism knowledge discrimination method, device, equipment and storage medium |
CN113127576A (en) * | 2021-04-15 | 2021-07-16 | 微梦创科网络科技(中国)有限公司 | Hotspot discovery method and system based on user content consumption analysis |
CN117093762A (en) * | 2023-07-18 | 2023-11-21 | 南京特尔顿信息科技有限公司 | Public opinion data evaluation analysis system and method |
CN117216418A (en) * | 2023-11-08 | 2023-12-12 | 一网互通(北京)科技有限公司 | Method and device for extracting popular phrase data in real time based on emotion and propagation force |
-
2013
- 2013-07-08 CN CN201310284081.8A patent/CN104281607A/en active Pending
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965894A (en) * | 2015-06-19 | 2015-10-07 | 成都国腾实业集团有限公司 | Data analysis system for IDC hazardous information monitoring platform |
CN104899324B (en) * | 2015-06-19 | 2018-09-11 | 成都国腾实业集团有限公司 | One kind monitoring systematic sample training system based on IDC harmful informations |
CN104899324A (en) * | 2015-06-19 | 2015-09-09 | 成都国腾实业集团有限公司 | Sample training system based on IDC (internet data center) harmful information monitoring system |
WO2016206395A1 (en) * | 2015-06-25 | 2016-12-29 | 中兴通讯股份有限公司 | Weekly report information processing method and device |
CN106339389A (en) * | 2015-07-09 | 2017-01-18 | 天津市国瑞数码安全系统股份有限公司 | Control method of sensitive information based on microblog website |
CN105791091A (en) * | 2016-03-02 | 2016-07-20 | 四川长虹电器股份有限公司 | System and method for evaluating operation quality of official microblog and wechat public numbers |
CN107341160B (en) * | 2016-05-03 | 2020-09-01 | 北京京东尚科信息技术有限公司 | Crawler intercepting method and device |
CN107341160A (en) * | 2016-05-03 | 2017-11-10 | 北京京东尚科信息技术有限公司 | A kind of method and device for intercepting reptile |
CN106980692A (en) * | 2016-05-30 | 2017-07-25 | 国家计算机网络与信息安全管理中心 | A kind of influence power computational methods based on microblogging particular event |
CN106980692B (en) * | 2016-05-30 | 2020-12-08 | 国家计算机网络与信息安全管理中心 | Influence calculation method based on microblog specific events |
CN106202222A (en) * | 2016-06-28 | 2016-12-07 | 北京小米移动软件有限公司 | The determination method and device of focus incident |
CN106302407A (en) * | 2016-08-02 | 2017-01-04 | 四川秘无痕信息安全技术有限责任公司 | A kind of method monitoring wechat circle of friends transmission data |
CN106302407B (en) * | 2016-08-02 | 2019-05-17 | 四川秘无痕信息安全技术有限责任公司 | A method of monitoring wechat circle of friends sends data |
CN106354846A (en) * | 2016-08-31 | 2017-01-25 | 成都广电视讯文化传播有限公司 | Intelligent news manuscript selection method and system based on big data |
CN106779827A (en) * | 2016-12-02 | 2017-05-31 | 上海晶樵网络信息技术有限公司 | A kind of Internet user's behavior collection and the big data method of analysis detection |
CN106777236A (en) * | 2016-12-27 | 2017-05-31 | 北京百度网讯科技有限公司 | The exhibiting method and device of the Query Result based on depth question and answer |
CN106777236B (en) * | 2016-12-27 | 2020-11-03 | 北京百度网讯科技有限公司 | Method and device for displaying query result based on deep question answering |
CN108335110B (en) * | 2017-01-17 | 2022-04-12 | 阿里巴巴集团控股有限公司 | Chat information processing method and device |
CN108335110A (en) * | 2017-01-17 | 2018-07-27 | 阿里巴巴集团控股有限公司 | Chat message processing method and processing device |
CN106886579B (en) * | 2017-01-23 | 2020-01-14 | 北京航空航天大学 | Real-time streaming text grading monitoring method and device |
CN106886579A (en) * | 2017-01-23 | 2017-06-23 | 北京航空航天大学 | Real-time streaming textual hierarchy monitoring method and device |
CN107168943A (en) * | 2017-04-07 | 2017-09-15 | 平安科技(深圳)有限公司 | The method and apparatus of topic early warning |
US11205046B2 (en) | 2017-04-07 | 2021-12-21 | Ping An Technology (Shenzhen) Co., Ltd. | Topic monitoring for early warning with extended keyword similarity |
CN107622333A (en) * | 2017-11-02 | 2018-01-23 | 北京百分点信息科技有限公司 | A kind of event prediction method, apparatus and system |
CN107622333B (en) * | 2017-11-02 | 2020-08-18 | 北京百分点信息科技有限公司 | Event prediction method, device and system |
CN109241380A (en) * | 2018-08-24 | 2019-01-18 | 北京信息科技大学 | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API |
CN110083701A (en) * | 2019-03-20 | 2019-08-02 | 重庆邮电大学 | A kind of cyberspace Mass disturbance early warning system based on average influence |
CN111931098A (en) * | 2019-04-28 | 2020-11-13 | 北京仝睿科技有限公司 | Monitoring object determination method and device and electronic equipment |
CN110502703A (en) * | 2019-07-12 | 2019-11-26 | 北京邮电大学 | Social networks incident detection method based on character string dictionary building |
CN111401648B (en) * | 2020-03-20 | 2021-01-19 | 李惠芳 | Event prediction method under condition of mutual influence of internet hotspots |
CN111401648A (en) * | 2020-03-20 | 2020-07-10 | 李惠芳 | Event prediction method under condition of mutual influence of internet hotspots |
CN111783468A (en) * | 2020-06-28 | 2020-10-16 | 百度在线网络技术(北京)有限公司 | Text processing method, device, equipment and medium |
CN111783468B (en) * | 2020-06-28 | 2023-08-15 | 百度在线网络技术(北京)有限公司 | Text processing method, device, equipment and medium |
CN112115263A (en) * | 2020-09-08 | 2020-12-22 | 浙江嘉兴数字城市实验室有限公司 | NLP-based social management big data monitoring and early warning method |
CN112632361A (en) * | 2020-12-29 | 2021-04-09 | 中科院计算技术研究所大数据研究院 | Iterative data acquisition method |
CN112818234A (en) * | 2021-02-02 | 2021-05-18 | 中慧绿浪科技(天津)集团有限公司 | Network public opinion information analysis processing method and system |
CN112929235A (en) * | 2021-02-06 | 2021-06-08 | 珠海市鸿瑞信息技术股份有限公司 | Network monitoring system based on internet |
CN113010689A (en) * | 2021-03-22 | 2021-06-22 | 平安科技(深圳)有限公司 | Buddhism knowledge discrimination method, device, equipment and storage medium |
CN113127576A (en) * | 2021-04-15 | 2021-07-16 | 微梦创科网络科技(中国)有限公司 | Hotspot discovery method and system based on user content consumption analysis |
CN113127576B (en) * | 2021-04-15 | 2024-05-24 | 微梦创科网络科技(中国)有限公司 | Hot spot discovery method and system based on user content consumption analysis |
CN117093762A (en) * | 2023-07-18 | 2023-11-21 | 南京特尔顿信息科技有限公司 | Public opinion data evaluation analysis system and method |
CN117093762B (en) * | 2023-07-18 | 2024-02-13 | 南京特尔顿信息科技有限公司 | Public opinion data evaluation analysis system and method |
CN117216418A (en) * | 2023-11-08 | 2023-12-12 | 一网互通(北京)科技有限公司 | Method and device for extracting popular phrase data in real time based on emotion and propagation force |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104281607A (en) | Microblog hot topic analyzing method | |
CN105447184B (en) | Information extraction method and device | |
Vargiu et al. | Exploiting web scraping in a collaborative filtering-based approach to web advertising. | |
CN102208992B (en) | The malicious information filtering system of Internet and method thereof | |
US10269024B2 (en) | Systems and methods for identifying and measuring trends in consumer content demand within vertically associated websites and related content | |
CN105022827B (en) | A kind of Web news dynamic aggregation method of domain-oriented theme | |
CN103218431B (en) | A kind ofly can identify the system that info web gathers automatically | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
CN102254265A (en) | Rich media internet advertisement content matching and effect evaluation method | |
CN105718587A (en) | Network content resource evaluation method and evaluation system | |
CN101751458A (en) | Network public sentiment monitoring system and method | |
US20080104034A1 (en) | Method For Scoring Changes to a Webpage | |
CN105117484A (en) | Internet public opinion monitoring method and system | |
WO2013030133A1 (en) | Search and discovery system | |
WO2007015990A2 (en) | Techniques for analyzing and presenting information in an event-based data aggregation system | |
CN110705288A (en) | Big data-based public opinion analysis system | |
CN103365924A (en) | Method, device and terminal for searching information | |
CN103365839A (en) | Recommendation search method and device for search engines | |
CN104778208A (en) | Method and system for optimally grasping search engine SEO (search engine optimization) website data | |
CN111447575B (en) | Short message pushing method, device, equipment and storage medium | |
WO2018237098A1 (en) | Methods and systems for identifying markers of coordinated activity in social media movements | |
Dongo et al. | A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis | |
CN107330076B (en) | Network public opinion information display system and method | |
CN116401459A (en) | Internet information processing method, system and recording medium | |
Zhao et al. | Web information credibility: From web 1.0 to web 2.0 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150114 |