WO2020000847A1 - News big data-based method and system for monitoring and analyzing risk perception index - Google Patents
News big data-based method and system for monitoring and analyzing risk perception index Download PDFInfo
- Publication number
- WO2020000847A1 WO2020000847A1 PCT/CN2018/113857 CN2018113857W WO2020000847A1 WO 2020000847 A1 WO2020000847 A1 WO 2020000847A1 CN 2018113857 W CN2018113857 W CN 2018113857W WO 2020000847 A1 WO2020000847 A1 WO 2020000847A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- social
- panic
- data
- index
- news
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the invention belongs to the technical field of big data monitoring analysis and emotion measurement, and particularly relates to a panic index monitoring analysis method and system based on news big data.
- GRPI can be basically independent of indexes such as S & P500 when measuring panic.
- GRPI has a completely separate data warehouse. It uses global news big data and calculates through complex algorithms. The efficiency of GRPI calculation depends greatly on the big data. Scale and structure.
- the GRPI Global Panic Index (Global Risk Index) is an index standard used to measure the degree of panic fluctuations of global media and netizens on events under historical and timely conditions. It is calculated using media report data and netizen social activity track data.
- Social panic refers to a wide range of fear and anxiety caused by something unexpected, such as the public safety panic caused by "911", the degree of public panic also changes with time and other factors, and due to "avian flu” Caused by public health panic, etc., these events have affected the global public to varying degrees through various media.
- Internet public opinion is transmitted through the Internet, and the public has strong influence and tendentious opinions and opinions on some hot spots and focus issues in real life. Internet public opinion is mainly reflected and strengthened through online news and social media.
- Hadoop is an open source distributed computing platform, and its core includes HDFS (Hadoop Distributed File Systems).
- HDFS High fault tolerance and high scalability allow users to deploy Hadoop on low-cost hardware, build distributed clusters, and form distributed systems.
- HBase Hadoop DataBase, Hadoop database
- HDFS Hadoop Database
- HDFS is a distributed database system built on a distributed file system HDFS that provides high reliability, high performance, column storage, scalability, and real-time read and write. It is mainly used to store unstructured and Semi-structured loose data.
- Existing big data-based Internet public opinion monitoring and analysis systems obtain network information collected by network information through a network information collation module, and perform keyword extraction on the network information.
- the website credit evaluation module evaluates the website of the network information source in real time, and the public opinion tendency analysis module refers to the trustworthy weight when calculating the emotional tendency.
- the monitoring of the degree of panic caused by special social events is mainly focused on the monitoring of crowd panic caused by emergencies in the field of public safety, including image processing analysis of monitoring pictures and simulation of crowd behavior in panic situations.
- the monitoring of the degree of panic caused by special social events is the micro-monitoring of the panic crowd. It is mainly used to solve the problem of the expansion of the consequences of an accident, such as stamping out. Not applicable in major events.
- the current global monitoring system indicators are not comprehensive enough and the calculation accuracy is low.
- the present invention provides a method and system for monitoring and analyzing network emotional fluctuations based on news big data, which aims to solve the current method of monitoring the degree of panic caused by special social events from a micro perspective.
- the monitoring results of panic crowds are mainly used to solve the expansion of the consequences of accidents such as trampling, and it is impossible to observe the impact of the event on the whole society in a macroscopic way.
- the problem of low accuracy is mainly used to solve the expansion of the consequences of accidents such as trampling, and it is impossible to observe the impact of the event on the whole society in a macroscopic way.
- the present invention is achieved in this way.
- a method for monitoring and analyzing Internet emotional fluctuations based on news big data adopts social risk amplification theory and panic psychological transmission, and performs real-time statistics on data in the database by dividing dimensions and indicators according to the panic index.
- the specific value of the index is obtained;
- the calculation model of the panic index is established by using the neural network model, the corpus is input, the weights of each dimension are matched by machine learning, and the index is comprehensively calculated to determine the panic index.
- the method for monitoring and analyzing online mood fluctuation based on news big data specifically includes:
- Step 1 Establish a massive database: including:
- a) Collection Use existing technology to collect news media and social media.
- Social media is mainly based on the Python programming language for full collection through the open data interface of Weibo or Facebook, and is directly stored in the data center through the formation of news queues; news media must be collected through domestic and overseas news web pages through the dispatcher, Browser, task manager, text parsing, storage, data management, etc., to traverse a given news data source, find out the list pages for further collection, and the scheduler sends the list pages to each collector through the task manager to collect The crawler crawls the list page to get the html page of the article.
- Structured data is processed through the data management algorithm of the data center to obtain structured data;
- the post-structured data is stored in the Nosql database, and finally the data in the database is imported into the data governance module through the message queue; the data governance module labels the data with a response through the governance algorithm and applies the mainstream sentiment algorithm A sentiment calculation is performed on each article to get sentiment labels.
- News media tags include title, abstract, body text, keywords, time, emotion; social media includes account name, post content, repost content, comment content, likes, fans, etc., post time, emotion, etc.
- the processed data can be used for data calling, mining, and machine learning based on further needs.
- d) Use big data to collect all relevant media information to form a database. Collect all news and social media data including monitoring keywords.
- Step 2 Divide the dimensions and indicators according to the panic index and perform real-time statistics to obtain the specific values of each indicator;
- Step 3 Use the neural network model to build a calculation model of the panic index, input the existing corpus, and rely on machine learning to repeatedly and repeatedly train the panic index model. This includes:
- A. Neural network-based machine learning model construction Before using the neural network model to build a calculation model of the panic index, the data needs to be normalized; according to the normalized features, the model is trained using a multilayer fully connected neural network structure . According to the normalized features, a multi-layer fully connected neural network structure is used to train the model; where Layer L 1 is the input layer, which represents the value corresponding to each feature; Layer L 2 is the hidden layer, and the hidden features are calculated; Layer L 3 is Output layer, output the final result;
- Another object of the present invention is to provide a security monitoring system using the method for monitoring and analyzing network mood fluctuations based on news big data.
- Another object of the present invention is to provide a security analysis system using the method for monitoring and analyzing network mood fluctuations based on news big data.
- Another object of the present invention is to provide a security early warning system using the method for monitoring and analyzing network mood fluctuations based on news big data.
- the integrated monitoring method for news media and social media of the present invention adds text-based semantic analysis and feature extraction on the basis of statistics and analysis of public opinion volume, emotions, and hotspots, which enriches the dimension of public opinion monitoring and enables event monitoring.
- the degree of panic was tracked and monitored more accurately, which resolved the problem of insufficient comprehensive monitoring system indicators.
- the calculation time of the model parameters and weights of the present invention is equivalent to the current monitoring system time in practical applications, reducing the complexity of the model in actual application.
- the invention analyzes the panic degree of the public from the monitoring of network public opinion and assists various decision-making.
- the degree of social panic in related fields is an important indicator of market changes and an important criterion for investment development. If social panic is not valued, it may affect the survival of the company.
- the immediate monitoring of the degree of public panic in online public opinion has important practical significance for the active resolution of the crisis of online public opinion, the maintenance of social stability, and the promotion of national development.
- the present invention uses crawler technology and other data sources to cover the network and other types of data.
- Computer data is used to automatically collect, intelligently analyze, all-round structured, and mass storage, which solves the problem of massive coverage and analysis of information sources. accumulation.
- the invention continuously updates the reserve data and algorithm learning iterative basis; the monitoring process takes the keywords entered by the user as the core, and counts the dimensions of time, content, quantity, identity and other dimensions in the dissemination of public opinion. Comprehensive analysis of transmission characteristics, comprehensive analysis of the multi-factor effect and common effect of panic in the dissemination of public opinion, the monitoring results are more accurate.
- the present invention compares real-time and historical data with a database through semantic analysis technology, covering more details of public opinion, more comprehensively analyzing the content tendency of users in public opinion, and better for monitoring the degree of panic.
- Master collect and analyze massive data through big data technology, expand the analysis of sample data and cases, make full use of a large number of historically accumulated cases, and start from the social amplification theory of risk, divide panic into statistical models of multiple indicators, and then The neural network learning to generate the panic index calculation model is more scientific and reasonable.
- the statistical indicators and calculation models have been continuously improved and reached a certain degree of accuracy.
- the present invention adds a statistical module and calculation based on the big data network monitoring technology (based on the Hadoop distributed computing platform, which collects network data and then performs data pre-processing "collection and pre-processing system").
- the module uses the preset monitoring indicators, standardized statistical models, and intelligent algorithm models of neural networks to monitor the degree of panic in the public opinion expressed by the public in a specific event occurrence and development scenario.
- the present invention integrates automatic collection and feature extraction to determine multiple dimensions and multiple monitoring indicators of an event. Through statistical analysis of news and social media text information obtained within a certain time range, real-time information about a specific event is obtained. Panic Index. Through the data service provided by the present invention, the government, enterprises, and related organizations can grasp the change of the panic index of the incident at the first time, and when the panic value exceeds a certain range, it can timely make a reasonable response.
- the present invention monitors social panic caused by events from a macro perspective through big data real-time collection technology, big data database technology, big data processing and statistical technology, and neural network algorithms.
- the invention overcomes the existing shortcomings of manual methods for combing, discriminating, analyzing, and inefficiently relying on knowledge and experience after presenting data from a big data monitoring system; it expands the traditional monitoring scope of social panic and is no longer limited to a specific time
- the panic in the place is another way to achieve the social panic level through big data and semantic analysis technology using neural network algorithms to greatly improve the recognition accuracy, discrimination efficiency and applicable scenarios of social panic.
- FIG. 1 is a flowchart of a method for monitoring and analyzing a panic index based on news big data provided by the present invention.
- FIG. 2 is a diagram for training a model using a multi-layer fully connected neural network structure according to normalized features provided by the implementation of the present invention
- Layer L 1 is the input layer, which represents the value corresponding to each feature
- Layer L 2 is the hidden layer, and the hidden features are calculated
- Layer L 3 is the output layer.
- FIG. 3 is a diagram of a forward propagation algorithm for outputting a final result in a training model using a multi-layer fully connected neural network structure provided by the implementation of the present invention.
- FIG. 4 is a schematic diagram of an operation process in the calculation of the global panic index provided by the implementation of the present invention.
- FIG. 5 is a schematic diagram of a panic index monitoring and analysis system based on news big data provided by the present invention.
- the present invention adds text-based semantic analysis and feature extraction on the basis of statistics and analysis of global sound volume, emotion, and content.
- the subject and length of news reports, social commentary, news and social sentiment, regional characteristics, transmission time, and path And stock market indexes which have enriched the dimension of big data monitoring, sensitive and accurate tracking and monitoring of the global network as a whole, specific events or panic levels within a specified time range, and resolved the current global monitoring system indicators are not comprehensive enough, and the calculation accuracy is not high. High problem.
- the calculation time of the model parameters and weights in the present invention is equivalent to the current monitoring system time in practical applications, reducing the complexity of the model in actual application.
- the invention collects data through information channels such as the network, and establishes a corresponding database.
- the database construction method is as follows: Establish a thesaurus by a linguistic expert: 1 Construct a multilingual thesaurus for words that express "protesting behavior”, “radical speech”, “political controversy”, and “political mobilization”. 2 Multilingual expressions including “collective petitions”, “collective strikes”, “violent group fights”, “violent assaults”, “political rallies”, “demonstrations”, “ethnic conflicts”, “religious conflicts” and “turmoil” Thesaurus building. 3 Construct multilingual thesaurus for vocabulary expressing helplessness such as "helplessness", “can't change” and “what can be done”.
- the present invention quantifies the concept of panic into a panic index, and divides the measurement of panic into multiple dimensions and indicators based on a statistical model of data based on the theory of social risk amplification and the spread of panic psychology.
- a neural network model is used to build a panic index calculation model. Input the existing corpus and rely on machine learning to form a complete algorithm.
- the panic index algorithm provided by the embodiment of the present invention is based on big data monitoring technology. After the data is collected, the data in the database is further processed. According to the theoretical foundation of society and psychology, the concept of social panic is quantified, so that the panic index becomes Measurable indicators of social panic. It can more easily and conveniently show the social psychological state and guide decision-making in all aspects of politics and economy.
- a method for monitoring and analyzing a panic index based on news big data includes the following steps:
- S102 According to the theory of social risk amplification and the theory of panic psychology, the data in the database is divided into multiple dimensions and multiple indicators according to the panic index to perform real-time statistics, and specific values of each indicator are obtained.
- S103 Use the neural network model to build a calculation model of the panic index, input the existing corpus, and rely on machine learning to repeatedly and repeatedly train the panic index model to match the weights of various indicators.
- step S101 the method for forming a database mainly includes the following steps:
- the first step is to collect: use existing technology to collect news media and social media.
- Social media is mainly based on the Python programming language for full collection through Weibo or Facebook's open data interface, and it is directly stored in the data center through message queues.
- the collection of news media mainly includes the breadth traversal of a given news data source through domestic and overseas news web pages, through schedulers, collectors, task managers, text parsing, storage, data management, etc.
- the scheduler sends the list page to each collector through the task manager, and the collector crawls the list page to get the html webpage of the article.
- the second step is processing: through the data management algorithm of the data center, the data is structured to obtain structured data.
- the text parsing module performs text parsing on the html webpage, extracts the article title, article publication time, and article body content, and removes garbled characters in the article.
- the third step is to store: store the post-processed data in the Nosql database, and import the data in the database to the data management module through the message queue.
- the data management module labels the data with a response through the management algorithm, and applies mainstream sentiment algorithms to Each article is sentiment calculation to get sentiment labels.
- News media tags include title, abstract, body text, keywords, time, emotion; social media includes account name, post content, repost content, comment content, likes, fans, etc., post time, emotion, etc.
- the processed data can be used for data calling, mining, and machine learning based on further needs.
- the fourth step is to use big data to collect all relevant media information to form a database.
- the data collection tag contains data content, publishing media or social network publishing account, publishing time, regional attributes, and put it into the database.
- W ⁇ N S ⁇ .
- step S102 the eight dimensions divided by the panic index are: degree of harm, degree of attention, degree of concentration, degree of subjectivity, degree of out of control, degree of strangeness, degree of agitation, and degree of trust.
- the specific calculation methods of the eight dimensions are as follows:
- the direct harm caused by the panic incident includes the number of people affected, the number of casualties, the size of the affected area, the length of time affected (whether it will harm future generations), and the direct economic loss and social consequences.
- a 1 argmax (TFa 1 );
- the keyword “a 2 people are seriously injured” is captured in N
- Tfa 2 represents the frequency of occurrence of each a 2
- the keyword “a 3 deaths” is captured
- TDa 3 represents the frequency of occurrence of each a
- a 3 ar gmax (TFa 3 ) represents the value with the highest frequency of occurrence of a 3 .
- Z "town / county / township / district”
- S "city”
- Sh "province”
- G "country”
- the protest vocabulary lexicon K is a collection of lexicons explaining the incident at the stage of public opinion.
- the resistance vocabulary lexicon F is a collection of lexicons that describe the rise of an event into an action phase.
- Attention is used to calculate the degree of news media attention brought by the panic incident.
- the number of relevant news reports b 1 is the number of the set keywords in the news media; it is known that N ⁇ n 1 , n 2 , n 3 ?? n n ⁇ , b 1 is right N counts:
- the number of related social discussions b 2 is the number of social keywords in which the set keywords appear; it is known that S ⁇ s 1 , s 2 , s 3 ?? s m ⁇ , b 2 is right S counts:
- T Nn time of the current news report
- T N1 time of the first relevant news report
- T Sm time of the current relevant social media
- T S1 time of the first relevant social media
- the suddenness of panic outbreaks is reflected in the sharp increase or slow increase in the number of reports.
- This indicator uses the concept of half-life statistics, that is, the time taken from the beginning of the report to half of the total of all reports so far.
- the sentiment d 1 of the relevant social comment is the mean sentiment of the social media content where the keywords appear:
- the sentiment value specifies the vocabulary, where r is the frequency (number of articles) corresponding to each E value, and m is the total number of social media articles.
- P UC> 2 indicates the number of articles in which “any vocabulary in the collection UC appears twice or more”, and m indicates the total number of social media articles.
- P UK> 2 indicates the number of "any vocabulary in the set UK appears twice or more", and m indicates the total number of social media articles.
- P DU> 2 indicates the number of “an arbitrary word in the set DU appears twice or more”, and m indicates the total number of social media articles.
- AN "Worried”, “annoying”, “anxious”, ... ⁇
- P AN > 2 indicates the number of "any vocabulary in the set AN appeared twice or more", and m indicates the total number of social media articles;
- P BL> 2 indicates the number of articles in which "any vocabulary in the collection BL has appeared more than twice"
- m indicates the total number of articles on social media
- P PR> 7 indicates the number of articles in which "any vocabulary in the collection PR appears twice or more"
- m indicates the total number of articles in social media.
- A ⁇ A 1 , A 2 , ..., A u ⁇
- A all account names in the set
- G thesaurus representing official identity
- A ⁇ A 1 , A 2 , ..., A u ⁇
- A a database of all social account names
- X a database of all scholarly vocabulary
- a calculation model of the panic index is established by using a neural network model, inputting an existing corpus, and relying on machine learning to repeatedly and repeatedly train the panic index model, specifically including:
- Machine learning model based on neural network (1) Machine learning model based on neural network:
- x is the value of the current sample
- x mean represents the average value of the current feature
- x max represents the maximum value in the current sample
- x min represents the minimum value of the current sample
- Layer L 1 is an input layer, and in the present invention represents a value corresponding to each feature.
- Layer L 2 is a hidden layer that calculates hidden features.
- Layer L 3 is the output layer and outputs the final result.
- Z (l) W (l-1) x (l-1) + b (l-1) ;
- l is the first layer
- L is the last layer
- x (1) is the feature of the input
- W, b are the weights and offsets
- h W, b (x) is the output.
- the multi-layer fully connected neural network structure training model uses the back-propagation algorithm to optimize the objective function according to the objective function to obtain the optimal model:
- panic prediction uses feature extraction
- the corresponding features are extracted by the method; the extracted features are input into the input layer of a multi-layer fully connected neural network structure training model, and the result is obtained through the forward propagation algorithm and used as the input of the next layer model; Panic value.
- Machine learning based on neural network models is trained through a corpus.
- the corpus contains the following fields (23 in total), each field sets a different event theme, and searches for text.
- the labeled corpus is used for training of sentiment analysis models based on machine learning.
- the distribution of categories in the labeled corpus should be as uniform as possible, suitable for training of classifiers, and most articles in the labeled corpus must have polarity. Since most corpus resources come from the Internet, there are often some irregularities in its encoding, format, and content, such as: abuse of punctuation marks, multiple spaces, typos, and so on. Therefore, it is necessary to correct these irregular formats before labeling, and uniformly adopt UTF-8 encoding.
- Model training Put no less than 10,000 corpora into a neural network-based machine learning model for machine learning and training, so that the calculation result of each group a1 ⁇ h4 data is equal to the corresponding result of each group GPRI, and then achieve the "machine pair After the "multiple index calculation results” pass the model, the obtained results can approach the purpose of "expert group score results” indefinitely.
- Panic prediction Based on the trained model, for all news and social media texts at the current moment, the panic index calculation model uses feature extraction to extract corresponding features; the extracted features are input into a multi-layer fully connected neural network structure training model In the input layer, the result is obtained through the forward propagation algorithm and used as the input of the next layer model; the calculation of the three layer model is used to obtain the final panic value.
- a multi-layer fully connected neural network structure is used to train a model diagram.
- the present invention provides a forward propagation algorithm graph for outputting final results in a training model using a multi-layer fully connected neural network structure.
- GRPI's algorithm definition indicators are based on the principles of journalism, psychosocial risks, and Internet emotion fluctuations. Before selecting a series of factors by selecting dimensions and measures that have a strong influence on Internet emotion fluctuations, Measure and analyze the weights, and then use large-scale machine learning to formulate the weights of the indicators that are consistent with the goal during the conversion of the indicators into mathematical models. Finally, under the conditions of comprehensive calculation of data mining and data analysis based on global news big data Calculate the GRPI index, that is, the existing neural network-based machine algorithm is applied.
- the machine learning method is set as shown in Figure 3.
- Layer L1 corresponds to multiple indicators
- Layer L2 is an hidden layer
- Layer 3 Hw, b (X) corresponds to the indicator.
- the output result During machine training, set a training scale of 10,000 groups, manually input values to 10,000 groups of Layer L1, and match 10,000 corresponding Layer 3 output values, and then use the machine to learn in the hidden layer of the Layer 2 network, match the coefficient weights and finally output result.
- the main factors considered by the GRPI index are: global news volume and total news volume, global social volume and total social volume, global news volume growth and social volume growth rate, strong or general positive and negative news and social growth rate, Trigger factors and keywords that measure the direct physical impact of an event (such as the degree of death and injury, the duration of the impact, economic loss, the geographical scope of radiation, etc.), the degree of attention to the event or subject in news media reports, the event or The subject ’s concentration of topics, the cycle of decline, and a series of individual and group attitudes expressed by media and social network behaviors, such as the subject ’s ability to control developments, the degree of agitation and anxiety, and the degree of trust in those in power.
- Trigger factors and keywords that measure the direct physical impact of an event (such as the degree of death and injury, the duration of the impact, economic loss, the geographical scope of radiation, etc.), the degree of attention to the event or subject in news media reports, the event or The subject ’s concentration of topics, the cycle of decline, and a series of individual and group
- the running process specifically includes: When analyzing the panic index of a target event or subject, the background will first retrieve N articles related to the event or subject, and then select the neighboring dimension based on the attributes of the event and Operation: In predicting the future panic index trend, the background is based on statistical machine learning methods, such as regression or classification, to calculate the historical data of various factors, and then use the machine learning model to calculate the data to realize the calculation of the panic trend indicator.
- statistical machine learning methods such as regression or classification
- S203 The system performs data statistics on multiple index contents of corpus 1 to obtain data group C1;
- S205 Perform the second process S201, select corpus 2, and repeat the processes S201-S204 to obtain C2 and E2;
- S208 The model is ready for use after learning. Enter the subject / event subject / target area / target time, for example, enter “unmanned”, set the panic index monitoring period to January 2018, and the region range to any country;
- S210 The system extracts and counts data on the topic of "unmanned driving" in this area and time range to obtain a data group;
- S211 The statistical result is input into a machine, and the machine calculates a panic index on the subject of "unmanned driving”.
- a panic index monitoring and analysis system based on news big data includes:
- Database formation module 1 collects all media information related to big data to form a database
- the panic index index value acquisition module 2 performs real-time statistics on the data in the database according to multiple dimensions and multiple indexes divided by the panic index to obtain specific values of each index;
- the panic index acquisition module 3 uses a neural network model and a machine learning algorithm to match the weights of each dimension to form a complete model.
- the required data is retrieved from the database and statistical indicators are calculated. Output the panic index.
- the invention relies on big data resources of news media and social platforms in more than 200 countries and more than 60 languages worldwide, combines panic and social risk cognitive models, applies theoretical models to the algorithm level, and can be used for various types of customizable topics
- the event is to calculate the panic index and monitor the development status.
- the present invention is based on a mature theory of social panic and risk cognition.
- the calculation index of the panic index refers to the research results of top institutions at home and abroad on social panic and network monitoring and measurement.
- algorithm update on the one hand, top experts and scholars in academic circles at home and abroad will regularly discuss and test the measurement of indicators and propose amendments; on the other hand, through background technical means, the present invention combines a large number of authoritative data related to panic to automatically compare Match machine learning to verify and update algorithm weights.
- the invention adopts a customized visual application method, which can perform macro-monitoring, and realize real-time semantic retrieval and calculation in 5 major risk areas (environment, economy, society, regional politics and technology) and 30 panic topics.
- 5 major risk areas environment, economy, society, regional politics and technology
- 30 panic topics For example, in the economic field, it mainly monitors "energy price shocks", “unemployed”, “asset bubble”, “deflation”, “financial crisis”, etc .; in the technical field, it mainly monitors "network attacks", "data fraud or theft” Themed risk labels.
- the invention can also perform micro-customization, and the panic monitoring platform can provide customized search according to the needs of different users.
- Panic monitoring platforms can provide decision-making references in national security, corporate security, corporate crisis emergency management, and industry development crisis scenarios.
- panic monitoring platforms can show expected changes in the market, such as the real estate industry prices and panic indexes in Hong Kong and Xi'an, respectively.
- GRPI Since GRPI's algorithm incorporates most of the tags of news big data, it can be split and used at the same time. For example, the entity naming and semantic recognition technology behind GRPI can help people quickly obtain or compare the scope of the earthquake and the extent of death and injury (grades 1-10) in all historical data. The degree of protest (1-10) was discussed, and the relationship between the emotional fluctuations of the people's network and social phenomena such as immigration and economic phenomena such as the rise and fall of Bitcoin were analyzed.
- the GRPI index as an index for monitoring and predicting the public's emotional volatility, has a certain guiding role in the stock market.
- the data results show that there is a certain negative correlation between the two. That is, when the GRPI index of a listed company is at a high level, its stock price is often in a downward trend, and when the stock index is strengthened, the GRPI index is mostly at a low level.
- the extreme state of the GRPI index is particularly worthy of attention. When the GRPI is at a high extreme value, it usually indicates that major events are brewing or happening. For example, the above picture shows the global monitoring of the GRPI. Appeared in 2008 when the global financial crisis broke out.
- the method and system provided by the present invention may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
- the computer program product includes one or more computer instructions.
- the computer program instructions When the computer program instructions are loaded or executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, a computer, a server, or a data center.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration.
- the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk (SSD)), and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A big news data-based method and system for monitoring and analyzing a risk perception index, comprising: employing a social risk amplification theory and panic psychology propagation to perform real-time statistics on data in a database according to risk perception index division dimensions and indicators so as to obtain specific values of the indicators; utilizing a neural network model to construct a calculation model of the risk perception index, inputting a corpus, and determining the risk perception index by means of machine learning, matching a weight of each dimension and comprehensively calculating the indicators. On the basis of big data monitoring technology, after collecting the data, the data in the library is further processed, and the concept of social panic is quantified according to the theoretical basis of social and sexual science, so that the risk perception index becomes a measurable indication of the degree of social panic. The present invention may more easily and quickly show the psychological state of a society and guide policy making in all aspects of politics and economy.
Description
本发明属于大数据监测分析及情绪测算技术领域,尤其涉及一种基于新闻大数据的恐慌指数监测分析方法及系统。The invention belongs to the technical field of big data monitoring analysis and emotion measurement, and particularly relates to a panic index monitoring analysis method and system based on news big data.
社会恐慌是指因未能预料到的某事而产生的大范围恐惧和焦虑,例如因“911”引起的公共安全恐慌,因“禽流感”引起的公共卫生恐慌等,这些事件通过各类媒体对全球公众产生了不同程度的影响,公众的恐慌程度也随时间等因素的变化而变化。著名的芝加哥期权交易所波动率指数(VIX-Chicago Board Options Exchange Volatility Index)及中国波指(iVIX)与GRPI同被称为“恐慌指数”,而它们与GPRI全球恐慌指数最大的不同之处在于——VIX及iVIX的编制采用S&P500等不同指数的近月及邻月认购/认沽期权价格计算得出,是基于期权的隐含波动率进行编制。不同的是,GRPI在测量恐慌的时候能基本独立于S&P500等指数,GRPI拥有完全分离的数据仓库,它利用全球新闻大数据并通过复杂算法运算得出,GRPI运算效率极大取决于大数据的规模及结构化。Social panic refers to a wide range of fear and anxiety caused by something unexpected, such as public safety panic caused by “911”, public health panic caused by “bird flu”, etc. These events have passed various media It has different degrees of impact on the global public, and the level of public panic has also changed over time and other factors. The well-known Chicago Board Options Exchange Volatility Index (VIX-Chicago Board Options Exchange Index) and China Po Index (iVIX) and GRPI are both called "panic indexes", and the biggest difference between them and the GPRI global panic index is that ——The compilation of VIX and iVIX is calculated based on the prices of the subscription and put options in different months and adjacent months of the S & P500 index. It is based on the implied volatility of the options. The difference is that GRPI can be basically independent of indexes such as S & P500 when measuring panic. GRPI has a completely separate data warehouse. It uses global news big data and calculates through complex algorithms. The efficiency of GRPI calculation depends greatly on the big data. Scale and structure.
GRPI全球恐慌指数(Global Risk Perception Index)是用以衡量历史及时下全球媒体及网民对事件的综合恐慌波动程度的指数标准,它使用媒体报道数据及网民社交活动轨迹数据综合计算得出。社会恐慌是指因未能预料到的某事而产生的大范围恐惧和焦虑,例如因“911”引起的公共安全恐慌,公众恐慌程度也随时间等因素的变化而变化,因“禽流感”引起的公共卫生恐慌等,这些事件通过各类媒体对全球公众产生了不同程度的影响。网络舆情是通过互联网传播、公众对现实生活中某些热点、焦点问题所持的有较强影响力、倾向性的言论和观点,网络舆情主要通过网络新闻及社交媒体体现并加以强化。The GRPI Global Panic Index (Global Risk Index) is an index standard used to measure the degree of panic fluctuations of global media and netizens on events under historical and timely conditions. It is calculated using media report data and netizen social activity track data. Social panic refers to a wide range of fear and anxiety caused by something unexpected, such as the public safety panic caused by "911", the degree of public panic also changes with time and other factors, and due to "avian flu" Caused by public health panic, etc., these events have affected the global public to varying degrees through various media. Internet public opinion is transmitted through the Internet, and the public has strong influence and tendentious opinions and opinions on some hot spots and focus issues in real life. Internet public opinion is mainly reflected and strengthened through online news and social media.
随着互联网的快速发展,网络媒体作为一种新的信息传播形式,已深入人们的日常生活。人们习惯于利用互联网来接收和发布信息。国内、国际大事件一旦发生,马上形成网上舆论。人们通过网络表达观点、传播思想,有时会形成巨大的社会力量。网络媒体时效快、互动性强等特点,社会恐慌在网络舆情中集中体现。近年来,随着大数据技术的发展,产生了一些基于大数据的网络舆情监测分析方法及系统。例如,现有的基于大数据的舆情分析方法基于Hadoop分布式计算平台,对网络数据进行数据采集,而后进行数据预处理,及热点事件抽取, 舆情分析推演等。其中,Hadoop是一个开源分布式计算平台,其核心包括HDFS(Hadoop Distributed Files System,Hadoop分布式文件系统)。HDFS的高容错性、高伸缩性等特点允许用户将Hadoop部署在低廉的硬件上,搭建分布式集群,构成分布式系统。HBase(Hadoop DataBase,Hadoop数据库)是建立在分布式文件系统HDFS之上的提供高可靠性、高性能、列存储、可伸缩、实时读写的分布式数据库系统,主要用来存储非结构化和半结构化的松散数据。现有的基于大数据的互联网舆情监测分析系统通过网络信息整理模块获取网络信息收集的网络信息,并对网络信息进行关键词提取。通过网站信用评估模块对网络信息来源的网站实时评估,舆情倾向分析模块在计算情感倾向时参考类信权值。基于网络获取舆情信息,根据关键词对获取的信息进行归类处理,并根据情感倾向对舆情进行整体判断。With the rapid development of the Internet, online media, as a new form of information dissemination, has penetrated into people's daily lives. People are used to using the Internet to receive and publish information. Once major domestic and international events occur, online public opinion will be formed immediately. When people express their opinions and spread their ideas through the Internet, they sometimes form huge social forces. With the characteristics of fast timeliness and strong interactivity of online media, social panic is concentratedly reflected in online public opinion. In recent years, with the development of big data technology, some methods and systems for monitoring and analyzing public opinion based on big data have emerged. For example, the existing big data-based public opinion analysis method is based on the Hadoop distributed computing platform, which collects network data, then performs data preprocessing, and extracts hot events, public opinion analysis and deduction. Among them, Hadoop is an open source distributed computing platform, and its core includes HDFS (Hadoop Distributed File Systems). HDFS's high fault tolerance and high scalability allow users to deploy Hadoop on low-cost hardware, build distributed clusters, and form distributed systems. HBase (Hadoop DataBase, Hadoop database) is a distributed database system built on a distributed file system HDFS that provides high reliability, high performance, column storage, scalability, and real-time read and write. It is mainly used to store unstructured and Semi-structured loose data. Existing big data-based Internet public opinion monitoring and analysis systems obtain network information collected by network information through a network information collation module, and perform keyword extraction on the network information. The website credit evaluation module evaluates the website of the network information source in real time, and the public opinion tendency analysis module refers to the trustworthy weight when calculating the emotional tendency. Obtain public opinion information based on the network, classify the acquired information according to keywords, and make overall judgments on public opinion based on emotional tendencies.
目前国内外有许多针对新闻或社交媒体平台的监测系统,但监测结果停留在简单热点新闻/话题呈现,热度趋势预测等方面,能提供的信息比较浅显表面,如果需要做出决策,仍需进一步对信息进行大量人工的分析处理。对社会特殊事件引发的恐慌程度的监测,主要集中在公共安全领域的对于突发事件引起的人群恐慌监测,包括对监控画面的图像处理分析和对恐慌情境下人群行为的模拟。对社会特殊事件引发的恐慌程度的监测是从微观上对恐慌人群的监测,主要用于解决突发事件形成踩踏等事故后果扩大的问题,无法宏观观察事件对全社会的影响,在影响力较大的事件上无法适用。此外,目前全球监测系统指标不够全面、计算精度低。At present, there are many monitoring systems for news or social media platforms at home and abroad, but the monitoring results remain in the areas of simple hot news / topic presentation, heat trend forecasting, etc. The information that can be provided is relatively superficial. If you need to make a decision, you need further A lot of manual analysis and processing of information. The monitoring of the degree of panic caused by special social events is mainly focused on the monitoring of crowd panic caused by emergencies in the field of public safety, including image processing analysis of monitoring pictures and simulation of crowd behavior in panic situations. The monitoring of the degree of panic caused by special social events is the micro-monitoring of the panic crowd. It is mainly used to solve the problem of the expansion of the consequences of an accident, such as stamping out. Not applicable in major events. In addition, the current global monitoring system indicators are not comprehensive enough and the calculation accuracy is low.
发明内容Summary of the invention
针对现有技术存在的问题,本发明提供了一种基于新闻大数据的关于网络情绪波动的监测分析方法及系统,旨在解决目前的方法对社会特殊事件引发的恐慌程度的监测是从微观上对恐慌人群的监测结果主要用于解决突发事件形成踩踏等事故后果扩大,无法宏观观察事件对全社会的影响,在影响力较大的事件上无法适用以及目前全球监测系统指标不够全面、计算精度低的问题。Aiming at the problems existing in the prior art, the present invention provides a method and system for monitoring and analyzing network emotional fluctuations based on news big data, which aims to solve the current method of monitoring the degree of panic caused by special social events from a micro perspective. The monitoring results of panic crowds are mainly used to solve the expansion of the consequences of accidents such as trampling, and it is impossible to observe the impact of the event on the whole society in a macroscopic way. The problem of low accuracy.
本发明是这样实现的,一种基于新闻大数据的关于网络情绪波动的监测分析方法,采用社会风险放大理论与恐慌心理传播,对数据库中的数据,按照恐慌指数划分维度及指标进行实时统计,得出指标的具体数值;利用神经网络模型搭建恐慌指数的计算模型,输入语料,通过机器学习、匹配各维度权重并将指标综合计算,确定恐慌指数。The present invention is achieved in this way. A method for monitoring and analyzing Internet emotional fluctuations based on news big data, adopts social risk amplification theory and panic psychological transmission, and performs real-time statistics on data in the database by dividing dimensions and indicators according to the panic index. The specific value of the index is obtained; the calculation model of the panic index is established by using the neural network model, the corpus is input, the weights of each dimension are matched by machine learning, and the index is comprehensively calculated to determine the panic index.
进一步,所述基于新闻大数据的关于网络情绪波动的监测分析方法具体包括:Further, the method for monitoring and analyzing online mood fluctuation based on news big data specifically includes:
步骤一,建立海量数据库:包括:Step 1: Establish a massive database: including:
a)采集:应用现有技术对新闻媒体、社交媒体进行采集。社交媒体主要基于Python编程语言通过微博或Facebook开放的数据接口进行全量采集,通过消息列队的方式直接存储到数据中心;新闻媒体的采集要通过对境内、境外的新闻网页,通过调度器、采集器、任务管理器、文本解析、存储、数据治理等,对给定的新闻数据源进行广度遍历,找出进一步采集的列表页,调度器通过任务管理器将列表页发送给各个采集器,采集器通过对列表页进行爬取,得到文章的html网页。a) Collection: Use existing technology to collect news media and social media. Social media is mainly based on the Python programming language for full collection through the open data interface of Weibo or Facebook, and is directly stored in the data center through the formation of news queues; news media must be collected through domestic and overseas news web pages through the dispatcher, Browser, task manager, text parsing, storage, data management, etc., to traverse a given news data source, find out the list pages for further collection, and the scheduler sends the list pages to each collector through the task manager to collect The crawler crawls the list page to get the html page of the article.
b)处理:通过数据中心的数据治理算法,对数据进行结构化处理得到结构化数据;b) Processing: Structured data is processed through the data management algorithm of the data center to obtain structured data;
c)存储:最后,将结构后处理的数据存储到Nosql数据库中,最后将数据库中的数据通过消息队列导入到数据治理模块;数据治理模块通过治理算法对数据打上响应的标签,应用主流情感算法对每篇文章进行情感计算得到情感标签。例如:新闻媒体类标签有标题、摘要、正文、关键词、时间、情感;社交媒体有账号名称、发布内容、转发内容、评论内容、点赞数、粉丝等、发布时间、情感等。治理后的数据根据进一步需求,可进行数据调用、挖掘、机器学习。c) Storage: Finally, the post-structured data is stored in the Nosql database, and finally the data in the database is imported into the data governance module through the message queue; the data governance module labels the data with a response through the governance algorithm and applies the mainstream sentiment algorithm A sentiment calculation is performed on each article to get sentiment labels. For example: News media tags include title, abstract, body text, keywords, time, emotion; social media includes account name, post content, repost content, comment content, likes, fans, etc., post time, emotion, etc. The processed data can be used for data calling, mining, and machine learning based on further needs.
d)利用大数据采集相关的所有媒体信息,形成数据库。采集所有包含监测关键词在内的新闻及社交媒体数据,数据采集标签包含数据内容、发布媒体或社交网络发布账号、发布时间、地域属性,并放入数据库;设该数据库集合为,按时间顺序排列,其中新闻媒体信息为N{n
1,n
2,n
3......n
n},社交媒体信息为S{s
1,s
2,s
3......s
m},得到一个为W{N,S}的数据合集。
d) Use big data to collect all relevant media information to form a database. Collect all news and social media data including monitoring keywords. The data collection tag contains data content, publishing media or social network publishing account, publishing time, regional attributes, and put it into the database. Let the database collection be, in chronological order Permutation, where the news media information is N {n 1 , n 2 , n 3 ...... n n }, and the social media information is S {s 1 , s 2 , s 3 ...... s m } To get a data set of W {N, S}.
步骤二,按照恐慌指数划分维度及指标进行实时统计,得出各指标具体数值;Step 2: Divide the dimensions and indicators according to the panic index and perform real-time statistics to obtain the specific values of each indicator;
步骤三,利用神经网络模型搭建恐慌指数的计算模型,输入既有语料,依靠机器学习对恐慌指数模型进行反复、多次训练。具体包括:Step 3: Use the neural network model to build a calculation model of the panic index, input the existing corpus, and rely on machine learning to repeatedly and repeatedly train the panic index model. This includes:
A.基于神经网络的机器学习模型搭建:利用神经网络模型搭建恐慌指数的计算模型前,需对数据进行归一化处理;根据归一化后的特征,使用多层全连接神经网络结构训练模型。根据归一化后的特征,使用多层全连接神经网络结构训练模型;其中Layer L
1为输入层,代表各特征所对应的值;Layer L
2为隐藏层,计算隐藏特征;Layer L
3为输出层,输出最终结果;
A. Neural network-based machine learning model construction: Before using the neural network model to build a calculation model of the panic index, the data needs to be normalized; according to the normalized features, the model is trained using a multilayer fully connected neural network structure . According to the normalized features, a multi-layer fully connected neural network structure is used to train the model; where Layer L 1 is the input layer, which represents the value corresponding to each feature; Layer L 2 is the hidden layer, and the hidden features are calculated; Layer L 3 is Output layer, output the final result;
B.人工标注一批可用于机器学习的语料:将不低于10000篇语料放入基于神经网络的机器学习模型中进行机器学习及训练,让每组a1~h4数据计算结果等于每组GPRI对应结果,进而达到“机器对指标计算结果”通过模型后,所得结果能无限趋近于“专家组评分结果”的目的。基于训练好的模型,对于当前时刻的所有新闻和社交媒体文本,恐慌指数计算模型利用特征抽取方法进行相应特征抽取;将抽取的特征输入多层全连接神经网络结构训练模型的输入层,经过前向传播算法得到结果,并作为下一层模型的输入;经过三层模型的计算,得到最终的恐慌值。B. Manually mark a batch of corpora that can be used for machine learning: Put no less than 10,000 corpora into a neural network-based machine learning model for machine learning and training, so that the calculation result of each group a1 ~ h4 data is equal to the GPRI corresponding to each group As a result, after the "machine-to-index calculation result" passes the model, the obtained result can approach the goal of "expert group score result" indefinitely. Based on the trained model, for all news and social media texts at the current moment, the panic index calculation model uses feature extraction to extract corresponding features; the extracted features are input to the input layer of the multi-layer fully connected neural network structure training model, The result is obtained by the forward propagation algorithm and used as the input of the next layer model. After the calculation of the three layer model, the final panic value is obtained.
本发明的另一目的在于提供一种利用所述基于新闻大数据的关于网络情绪波动的监测分析方法的安全监测系统。Another object of the present invention is to provide a security monitoring system using the method for monitoring and analyzing network mood fluctuations based on news big data.
本发明的另一目的在于提供一种利用所述基于新闻大数据的关于网络情绪波动的监测分析方法的安全分析系统。Another object of the present invention is to provide a security analysis system using the method for monitoring and analyzing network mood fluctuations based on news big data.
本发明的另一目的在于提供一种利用所述基于新闻大数据的关于网络情绪波动的监测分析方法的安全预警系统。Another object of the present invention is to provide a security early warning system using the method for monitoring and analyzing network mood fluctuations based on news big data.
本发明的优点及积极效果为:The advantages and positive effects of the invention are:
(1)本发明综合新闻媒体及社交媒体的监测方法,在统计、分析舆论声量、情感、及热点的基础上,增加基于文本的语义分析及特征提取,丰富了舆论监测的维度,能够对事件恐慌程度做更准确的跟踪及监测,解决了目前监测系统指标不够全面的问题。并且,模型训练好之后,在实际应用中本发明的模型参数及权重的运算时间与目前的监测系统时间相当,降低模型实际应用时的复杂度。本发明从网络舆情监测中分析公众的恐慌程度,辅助各类决策。对于企业而言,相关领域的社会恐慌程度是市场变化的重要指针,是投资开发的重要评判标准,如果不对社会恐慌加以重视,可能影响企业生死存亡。对于相关政府部门来说,网络舆论中的公众恐慌程度即时监测,对网络舆论危机的积极化解、维护社会稳定、促进国家发展具有重要的现实意义。(1) The integrated monitoring method for news media and social media of the present invention adds text-based semantic analysis and feature extraction on the basis of statistics and analysis of public opinion volume, emotions, and hotspots, which enriches the dimension of public opinion monitoring and enables event monitoring. The degree of panic was tracked and monitored more accurately, which resolved the problem of insufficient comprehensive monitoring system indicators. In addition, after the model is trained, the calculation time of the model parameters and weights of the present invention is equivalent to the current monitoring system time in practical applications, reducing the complexity of the model in actual application. The invention analyzes the panic degree of the public from the monitoring of network public opinion and assists various decision-making. For enterprises, the degree of social panic in related fields is an important indicator of market changes and an important criterion for investment development. If social panic is not valued, it may affect the survival of the company. For relevant government departments, the immediate monitoring of the degree of public panic in online public opinion has important practical significance for the active resolution of the crisis of online public opinion, the maintenance of social stability, and the promotion of national development.
(2)本发明采用爬虫技术及其他数据来源,覆盖网络及其他类型数据,通过计算机技术对数据进行自动采集、智能解析、全能结构化及海量存储,解决了信息源的海量覆盖及分析案例的积累。本发明为提高监测结果的准确性,不断更新储备数据和算法学习迭代基础;监测过程以用户输入的关键词为核心,统计舆论传播中的时间、内容、数量、身份等各个维度,对舆论的传播特征全面分析,对恐慌在舆论传播中的多因素作用及共同作用综合分析,监测结果更为准确。(2) The present invention uses crawler technology and other data sources to cover the network and other types of data. Computer data is used to automatically collect, intelligently analyze, all-round structured, and mass storage, which solves the problem of massive coverage and analysis of information sources. accumulation. In order to improve the accuracy of monitoring results, the invention continuously updates the reserve data and algorithm learning iterative basis; the monitoring process takes the keywords entered by the user as the core, and counts the dimensions of time, content, quantity, identity and other dimensions in the dissemination of public opinion. Comprehensive analysis of transmission characteristics, comprehensive analysis of the multi-factor effect and common effect of panic in the dissemination of public opinion, the monitoring results are more accurate.
(3)本发明通过语义分析技术,对实时及历史数据进行与数据库的比对,对于舆论的更多细节予以覆盖,更为全面的分析舆论中用户的内容倾向,对于监测恐慌程度更好的掌握;通过大数据技术采集和分析海量数据,扩大了分析的样本数据及案例,充分利用历史积累的大量案例,从风险的社会放大理论出发,将恐慌划分为多个指标的统计模型,再由神经网络学习生成恐慌指数计算模型,更为科学合理,统计指标及计算模型不断得到改善,并达到一定准确度。(3) The present invention compares real-time and historical data with a database through semantic analysis technology, covering more details of public opinion, more comprehensively analyzing the content tendency of users in public opinion, and better for monitoring the degree of panic. Master; collect and analyze massive data through big data technology, expand the analysis of sample data and cases, make full use of a large number of historically accumulated cases, and start from the social amplification theory of risk, divide panic into statistical models of multiple indicators, and then The neural network learning to generate the panic index calculation model is more scientific and reasonable. The statistical indicators and calculation models have been continuously improved and reached a certain degree of accuracy.
(4)本发明在大数据网络监测技术的基础上(基于Hadoop分布式计算平台,对网络数据进行数据采集,而后进行数据预处理的“采集及预处理系统”),添加一个统计模块和计算模块,通过预先设定的监测指标、标准化统计模型及神经网络的智能化算法模型,监测一个特 定事件发生及发展场景中,民众在网络舆情中表现出的对事件的恐慌程度。(4) The present invention adds a statistical module and calculation based on the big data network monitoring technology (based on the Hadoop distributed computing platform, which collects network data and then performs data pre-processing "collection and pre-processing system"). The module uses the preset monitoring indicators, standardized statistical models, and intelligent algorithm models of neural networks to monitor the degree of panic in the public opinion expressed by the public in a specific event occurrence and development scenario.
(5)本发明集合自动采集、特征提取,确定某一事件的多个维度、多个监测指标,通过对一定时间范围内获取的新闻、社交媒体文本信息统计分析,得到某个特定事件的实时恐慌指数。通过本发明提供的数据服务,政府、企业及相关组织可以在第一时间掌握该事件的恐慌指数变化,当恐慌数值超过一定范围时,能够及时做出合理应对。(5) The present invention integrates automatic collection and feature extraction to determine multiple dimensions and multiple monitoring indicators of an event. Through statistical analysis of news and social media text information obtained within a certain time range, real-time information about a specific event is obtained. Panic Index. Through the data service provided by the present invention, the government, enterprises, and related organizations can grasp the change of the panic index of the incident at the first time, and when the panic value exceeds a certain range, it can timely make a reasonable response.
(6)本发明在大数据监测系统的基础上,通过大数据实时采集技术、大数据数据库技术、大数据处理及统计技术、神经网络算法,从宏观角度监测事件引发的社会恐慌。本发明克服了现有在大数据监测系统数据呈现后,人工方法梳理、辨别、分析效率低下,准确度严重依赖知识经验的弊端;扩大了传统对社会恐慌的监测范围,不再局限于特定时间地点的人群恐慌,另辟蹊径从社会整体的恐慌水平,通过大数据及语义分析技术,使用神经网络算法实现,大大提升对于社会恐慌的识别准确性、判别效率及其适用场景。(6) Based on the big data monitoring system, the present invention monitors social panic caused by events from a macro perspective through big data real-time collection technology, big data database technology, big data processing and statistical technology, and neural network algorithms. The invention overcomes the existing shortcomings of manual methods for combing, discriminating, analyzing, and inefficiently relying on knowledge and experience after presenting data from a big data monitoring system; it expands the traditional monitoring scope of social panic and is no longer limited to a specific time The panic in the place is another way to achieve the social panic level through big data and semantic analysis technology using neural network algorithms to greatly improve the recognition accuracy, discrimination efficiency and applicable scenarios of social panic.
图1是本发明实施提供的基于新闻大数据的恐慌指数监测分析方法流程图。FIG. 1 is a flowchart of a method for monitoring and analyzing a panic index based on news big data provided by the present invention.
图2是本发明实施提供的根据归一化后的特征,使用多层全连接神经网络结构训练模型图;FIG. 2 is a diagram for training a model using a multi-layer fully connected neural network structure according to normalized features provided by the implementation of the present invention; FIG.
图中:Layer L
1为输入层,代表各特征所对应的值;Layer L
2为隐藏层,计算隐藏特征;Layer L
3为输出层。
In the figure: Layer L 1 is the input layer, which represents the value corresponding to each feature; Layer L 2 is the hidden layer, and the hidden features are calculated; Layer L 3 is the output layer.
图3是本发明实施提供的使用多层全连接神经网络结构训练模型中输出最终结果前向传播算法图。FIG. 3 is a diagram of a forward propagation algorithm for outputting a final result in a training model using a multi-layer fully connected neural network structure provided by the implementation of the present invention.
图4是本发明实施提供的全球恐慌指数计算中运行过程示意图。FIG. 4 is a schematic diagram of an operation process in the calculation of the global panic index provided by the implementation of the present invention.
图5是本发明实施提供的基于新闻大数据的恐慌指数监测分析系统示意图。FIG. 5 is a schematic diagram of a panic index monitoring and analysis system based on news big data provided by the present invention.
图中:1、数据库形成模块;2、恐慌指数指标数值获取模块;3、恐慌指数获取模块。In the figure: 1. Database formation module; 2. Panic index value acquisition module; 3. Panic index acquisition module.
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.
本发明在统计、分析全球声量、情感、内容的基础上,增加了基于文本的语义分析及特征提取,对新闻报道主题及篇幅、社交评论内、新闻及社交情感、地域特点、传播时长及路 径、股市股指等,丰富了大数据监测的维度,对全球网络整体、特定事件或指定时间范围内的恐慌程度做敏感而准确的跟踪及监测,解决了目前全球监测系统指标不够全面、计算精度不高的问题。模型训练好之后,在实际应用中本发明中的模型参数及权重的运算时间与目前的监测系统时间相当,降低模型实际应用时的复杂度。The present invention adds text-based semantic analysis and feature extraction on the basis of statistics and analysis of global sound volume, emotion, and content. The subject and length of news reports, social commentary, news and social sentiment, regional characteristics, transmission time, and path And stock market indexes, which have enriched the dimension of big data monitoring, sensitive and accurate tracking and monitoring of the global network as a whole, specific events or panic levels within a specified time range, and resolved the current global monitoring system indicators are not comprehensive enough, and the calculation accuracy is not high. High problem. After the model is trained, the calculation time of the model parameters and weights in the present invention is equivalent to the current monitoring system time in practical applications, reducing the complexity of the model in actual application.
本发明通过网络等信息渠道进行数据采集,建立起相应的数据库。数据库的搭建方法为:通过语言学专家建立词库:①对包含“牢骚言论”“激进言论”“政治争论”“政治动员”等表示抗议性行为词进行多语种词库搭建。②对包含“集体上访”“集体罢工”“暴力群斗”“恶性侵犯事件”“政治集会”“游行示威”“民族冲突”“宗教冲突”“动乱”等表示反抗性行为的词进行多语种词库搭建。③对如“无能为力”“改变不了”“能做什么”等表达无助感的词汇进行多语种词库搭建。④对如“怎么会”“不理解”“不懂”“不科学”等表达不理解的词汇进行多语种词库搭建。⑤对如“担心”“烦”“焦虑”等表示忧虑性的词汇进行多语种词库搭建等。The invention collects data through information channels such as the network, and establishes a corresponding database. The database construction method is as follows: Establish a thesaurus by a linguistic expert: ① Construct a multilingual thesaurus for words that express "protesting behavior", "radical speech", "political controversy", and "political mobilization". ② Multilingual expressions including "collective petitions", "collective strikes", "violent group fights", "violent assaults", "political rallies", "demonstrations", "ethnic conflicts", "religious conflicts" and "turmoil" Thesaurus building. ③ Construct multilingual thesaurus for vocabulary expressing helplessness such as "helplessness", "can't change" and "what can be done". ④ Construct multilingual thesaurus for vocabulary that does not understand such as "how can", "don't understand", "don't understand", "unscientific". ⑤ Construct multilingual thesaurus for vocabulary expressing anxiety such as "worry", "annoyance" and "anxiety".
本发明在大数据监测的基础上,将恐慌的概念量化为恐慌指数,依据社会风险放大理论与恐慌心理的传播理论,将恐慌的测量划分为多个维度和指标的数据统计模型。并利用神经网络模型搭建恐慌指数的计算模型,输入既有语料,依靠机器学习形成完整的算法。本发明实施例提供的恐慌指数算法,基于大数据监测技术,在采集数据后,将库内数据进行进一步处理,依据社会及心理学理论基础,将社会的恐慌这一概念量化,使恐慌指数成为可测量的评价社会恐慌程度的示数。能更简单便捷的显示出社会心理状态,指导政治经济各方面决策。Based on big data monitoring, the present invention quantifies the concept of panic into a panic index, and divides the measurement of panic into multiple dimensions and indicators based on a statistical model of data based on the theory of social risk amplification and the spread of panic psychology. A neural network model is used to build a panic index calculation model. Input the existing corpus and rely on machine learning to form a complete algorithm. The panic index algorithm provided by the embodiment of the present invention is based on big data monitoring technology. After the data is collected, the data in the database is further processed. According to the theoretical foundation of society and psychology, the concept of social panic is quantified, so that the panic index becomes Measurable indicators of social panic. It can more easily and conveniently show the social psychological state and guide decision-making in all aspects of politics and economy.
下面结合附图对本发明提供的监测分析方法作详细的描述。The monitoring and analysis method provided by the present invention is described in detail below with reference to the accompanying drawings.
如图1所示,本发明实施例提供的基于新闻大数据的恐慌指数监测分析方法包括以下步骤:As shown in FIG. 1, a method for monitoring and analyzing a panic index based on news big data according to an embodiment of the present invention includes the following steps:
S101:建立海量数据库。S101: Establish a massive database.
S102:依据社会风险放大理论与恐慌心理的传播理论,对数据库中的数据,按照恐慌指数划分多个维度及多个指标进行实时统计,得出各指标具体数值。S102: According to the theory of social risk amplification and the theory of panic psychology, the data in the database is divided into multiple dimensions and multiple indicators according to the panic index to perform real-time statistics, and specific values of each indicator are obtained.
S103:利用神经网络模型搭建恐慌指数的计算模型,输入既有语料,依靠机器学习对恐慌指数模型进行反复、多次训练,进而能够匹配各项指标权重。S103: Use the neural network model to build a calculation model of the panic index, input the existing corpus, and rely on machine learning to repeatedly and repeatedly train the panic index model to match the weights of various indicators.
S104:每次计算目标对象的恐慌指数时,将针对目标的各指标基于海量数据进行多个指标的数据统计,将统计结果放入训练好的恐慌模型进行计算,最终输出恐慌指数。S104: Each time the panic index of the target object is calculated, the statistics of multiple indicators are calculated based on the massive data for each target of the target, the statistical results are put into the trained panic model for calculation, and the panic index is finally output.
在步骤S101中,形成数据库的方法主要包括以下步骤:In step S101, the method for forming a database mainly includes the following steps:
第一步,采集:应用现有技术对新闻媒体、社交媒体进行采集。社交媒体主要基于Python编程语言通过微博或Facebook开放的数据接口进行全量采集,通过消息列队的方式直接存储到数据中心。新闻媒体的采集主要通过对境内、境外的新闻网页,通过调度器、采集器、任 务管理器、文本解析、存储、数据治理等,对给定的新闻数据源进行广度遍历,找出可进一步采集的列表页,调度器通过任务管理器将列表页发送给各个采集器,采集器通过对列表页进行爬取,得到文章的html网页。The first step is to collect: use existing technology to collect news media and social media. Social media is mainly based on the Python programming language for full collection through Weibo or Facebook's open data interface, and it is directly stored in the data center through message queues. The collection of news media mainly includes the breadth traversal of a given news data source through domestic and overseas news web pages, through schedulers, collectors, task managers, text parsing, storage, data management, etc. The scheduler sends the list page to each collector through the task manager, and the collector crawls the list page to get the html webpage of the article.
第二步,处理:通过数据中心的数据治理算法,对数据进行结构化处理得到结构化数据。如:文本解析模块对html网页进行文本解析,提取其中的文章标题,文章发布时间和文章正文等内容,同时去除文章中的乱码。The second step is processing: through the data management algorithm of the data center, the data is structured to obtain structured data. For example, the text parsing module performs text parsing on the html webpage, extracts the article title, article publication time, and article body content, and removes garbled characters in the article.
第三步,存储:将结构后处理的数据存储到Nosql数据库中,将数据库中的数据通过消息队列导入到数据治理模块,数据治理模块通过治理算法对数据打上响应的标签,应用主流情感算法对每篇文章进行情感计算得到情感标签。例如:新闻媒体类标签有标题、摘要、正文、关键词、时间、情感;社交媒体有账号名称、发布内容、转发内容、评论内容、点赞数、粉丝等、发布时间、情感等。治理后的数据根据进一步需求,可进行数据调用、挖掘、机器学习。The third step is to store: store the post-processed data in the Nosql database, and import the data in the database to the data management module through the message queue. The data management module labels the data with a response through the management algorithm, and applies mainstream sentiment algorithms to Each article is sentiment calculation to get sentiment labels. For example: News media tags include title, abstract, body text, keywords, time, emotion; social media includes account name, post content, repost content, comment content, likes, fans, etc., post time, emotion, etc. The processed data can be used for data calling, mining, and machine learning based on further needs.
第四步,利用大数据采集相关的所有媒体信息,形成数据库。采集所有包含监测关键词在内的新闻及社交媒体数据,数据采集标签包含数据内容、发布媒体或社交网络发布账号、发布时间、地域属性,并放入数据库;设该数据库集合为,按时间顺序排列,其中新闻媒体信息为N{n
1,n
2,n
3......n
n},社交媒体信息为S{s
1,s
2,s
3......s
m},得到一个为W{N,S}的数据合集。
The fourth step is to use big data to collect all relevant media information to form a database. Collect all news and social media data including monitoring keywords. The data collection tag contains data content, publishing media or social network publishing account, publishing time, regional attributes, and put it into the database. Let the database collection be, in chronological order Permutation, where the news media information is N {n 1 , n 2 , n 3 ...... n n }, and the social media information is S {s 1 , s 2 , s 3 ...... s m } To get a data set of W {N, S}.
在步骤S102中,恐慌指数划分的八个维度分别为:危害度、关注度、集中度、主观度、失控度、陌生度、激惹度、信任度。八个维度各具体计算方法如下:In step S102, the eight dimensions divided by the panic index are: degree of harm, degree of attention, degree of concentration, degree of subjectivity, degree of out of control, degree of strangeness, degree of agitation, and degree of trust. The specific calculation methods of the eight dimensions are as follows:
(1)危害度(1) Harm
恐慌事件带来的直接伤害,包括受影响的人数,伤亡的人数,受影响地域的大小,受影响时间的长短(是否祸及子孙后代),以及造成的直接经济损失和社会后果。The direct harm caused by the panic incident includes the number of people affected, the number of casualties, the size of the affected area, the length of time affected (whether it will harm future generations), and the direct economic loss and social consequences.
(1.1)波及人数a
1:
(1.1) Number of people affected a 1 :
a
1=argmax(TFa
1);
a 1 = argmax (TFa 1 );
在N中抓取关键词“a
1人受影响”,在N中将出现≥0个a
1取值,用TFa
1表示各a
1取值分别对应出现的频率,则a
1=ar gmax(TFa
1)表示,a
1取值出现频率最高的值。
Grab the keyword "a 1 person is affected" in N, and there will be ≥0 values of a 1 in N, and use TFA 1 to indicate that each a 1 value corresponds to the frequency of occurrence, then a 1 = ar gmax ( TFa 1 ) indicates that a 1 takes a value with the highest frequency of occurrence.
(1.2)伤亡人数a
2,a
3:
(1.2) Number of casualties a 2 , a 3 :
重伤人数 a
2=ar gmax(TFa
2);
Number of severe injuries a 2 = ar gmax (TFa 2 );
死亡人数 a
3=ar gmax(TFa
3);
Number of deaths a 3 = ar gmax (TFa 3 );
同理,在N中抓取关键词“a
2人受重伤”,TFa
2表示各a
2出现的频率,a
2=ar gmax(TFa
2) 表示a
2取值出现频率最高的值。在采集到的所有新闻报道中抓取关键词“a
3人死亡”,TDa
3表示各a出现的频率,a
3=ar gmax(TFa
3)表示a
3取值出现频率最高的值。
In the same way, the keyword "a 2 people are seriously injured" is captured in N, Tfa 2 represents the frequency of occurrence of each a 2 , a 2 = ar gmax (TFa 2 ) represents the value with the highest occurrence frequency of a 2 . In all the news reports collected, the keyword “a 3 deaths” is captured, TDa 3 represents the frequency of occurrence of each a, and a 3 = ar gmax (TFa 3 ) represents the value with the highest frequency of occurrence of a 3 .
(1.3)危害地域大小a
4:
(1.3) Hazardous area size a 4 :
当监测区域仅限某某国内时,When the monitoring area is limited to a certain country,
其中,Z=“镇/县/乡/区”,S=“市”,Sh=“省”,G=“国”;Among them, Z = "town / county / township / district", S = "city", Sh = "province", and G = "country";
在集合N中,通过命名实体识别技术,识别集合N中涉地域信息的数据后,统计出“镇/县/乡/区”、“市”、“省”、“国”四者出现的频数。“镇/县/乡/区”频数最多时a
4=1,“市”出现最多时a
4=2,“省”出现最多时a
4=3,“国”出现最多时a
4=4。
In the set N, through the named entity recognition technology, after identifying the data of the area information in the set N, the frequency of the occurrence of the "town / county / township / district", "city", "province", and "country" is counted. . When "town / county / township / district" has the most frequency, a 4 = 1, when "city" appears the most, a 4 = 2, when "province" appears the most, a 4 = 3, and "country" appears the most, a 4 = 4.
当监测区域为全球时:When the monitoring area is global:
统计{W}中的IP出现区域,以国家为单位。如{W}中IP地址出现在1个国家,则计为f(1),出现在2个国家,则计为f(2),以此类推。以x表示出现的国家数,现计:Count the areas where IP appears in {W}, taking the country as a unit. If {W} appears in one country, it will be counted as f (1), if it appears in two countries, it will be counted as f (2), and so on. The number of countries represented by x, now:
危害时长a
5:
Hazard duration a 5 :
a
5=ar gmax(TFa
5);
a 5 = ar gmax (TFa 5 );
在{N}中抓取关键词“预计a
5可恢复”,取频率最高a
5对应数值。
Grab the keyword "expected a 5 can be recovered" in {N}, and take the value corresponding to the highest frequency a 5 .
直接经济损失a
6:
Direct economic loss a 6 :
a
6=ar gmax(TFa
6);
a 6 = ar gmax (TFa 6 );
在{N}中抓取关键词“损失a
6元”,取频率最高a
6对应数值。
Grab the keyword "loss a 6 yuan" in {N}, and take the value corresponding to the highest frequency a 6 .
直接社会后果a
7:
Direct social consequences a 7 :
其中,抗议性行为词汇词库K为阐述事件在舆论阶段的词库集合,Among them, the protest vocabulary lexicon K is a collection of lexicons explaining the incident at the stage of public opinion.
反抗性行为词汇词库F为阐述事件上升为行动阶段的词库集合,The resistance vocabulary lexicon F is a collection of lexicons that describe the rise of an event into an action phase.
(2)关注度(2) Attention
关注度为计算恐慌事件所带来的新闻媒体关注程度。Attention is used to calculate the degree of news media attention brought by the panic incident.
(2.1)相关新闻报道篇数b
1为新闻媒体中,出现设定关键词的篇数;已知N{n
1,n
2,n
3......n
n},b
1是对N进行计数,则有:
(2.1) The number of relevant news reports b 1 is the number of the set keywords in the news media; it is known that N {n 1 , n 2 , n 3 ...... n n }, b 1 is right N counts:
b
1=n。
b 1 = n.
(2.2)相关社交讨论篇数b
2为社交媒体中,出现设定关键词的篇数;已知S{s
1,s
2,s
3......s
m},b
2是对S进行计数,则有:
(2.2) The number of related social discussions b 2 is the number of social keywords in which the set keywords appear; it is known that S {s 1 , s 2 , s 3 ...... s m }, b 2 is right S counts:
b
2=m。
b 2 = m.
(2.3)相关新闻的报道类型:(2.3) Report types of related news:
其中,n=相关新闻总报道篇数(b
1);Z
i=第i篇报道字数。
Where, n = number of news reports Total (b 1) articles; Z i = the i-th word stories.
(2.4)新闻报道时长b
4:
(2.4) News report duration b 4 :
T
Nn=当前新闻报道的时间,T
N1=第一篇相关新闻报道时间。
T Nn = time of the current news report, T N1 = time of the first relevant news report.
(2.5)社交讨论时长b
5:
(2.5) Social discussion duration b 5 :
T
Sm=当前相关社交媒体的时间;T
S1=第一篇相关社交媒体的时间。
T Sm = time of the current relevant social media; T S1 = time of the first relevant social media.
(3)集中度(3) Concentration
恐慌爆发的突发程度,表现在报道量陡增还是缓慢增长。该指标采用半衰期概念统计,即从开始有报道到目前所有报道总量的一半所用的时长。The suddenness of panic outbreaks is reflected in the sharp increase or slow increase in the number of reports. This indicator uses the concept of half-life statistics, that is, the time taken from the beginning of the report to half of the total of all reports so far.
(3.1)新闻集中度c
1:
(3.1) News concentration c 1 :
从开始有新闻报道到目前所有新闻总量的一半所用的时长。The amount of time it took from the beginning of news coverage to half of all current news.
(3.2)社交集中度c
2:
(3.2) Social concentration c 2 :
从开始有社交讨论到目前所有社交讨论总量的一半所用的时长。The time it took from the start of social discussions to half of all current social discussions.
(4)主观度(4) Subjectivity
社交媒体S{s
1,s
2,s
3......s
m}中个人对于恐慌事件的主观态度。
The subjective attitude of individuals in the social media S {s 1 , s 2 , s 3 ...... s m } to panic events.
(4.1)相关社交评论的情感d
1为出现关键词的社交媒体内容的情感均值:
(4.1) The sentiment d 1 of the relevant social comment is the mean sentiment of the social media content where the keywords appear:
E∈(1,5);E∈ (1,5);
设集合S中每篇的情感分别为E={E1,E2,E3......Em},i∈[1,m],E∈(1,5),为1-5的每个情感值规定词表,其中r为每档E值对应频次(篇数),m为社交媒体总篇数。Let the emotion of each piece in the set S be E = {E1, E2, E3 ... Em}, i ∈ [1, m], E ∈ (1, 5), and each of 1-5 The sentiment value specifies the vocabulary, where r is the frequency (number of articles) corresponding to each E value, and m is the total number of social media articles.
(4.2)社交媒体负面词占比d
2:
(4.2) The proportion of negative words in social media d 2 :
其中Z
neg=负面词字数总量,Z=总报道字数。
Where Z neg = total number of negative words, Z = total number of reported words.
(4.3)社交讨论中重要城市用户的发言篇数占总社交媒体发言篇数的比例值d
3:
(4.3) Proportion of the number of speeches from important urban users in the social discussion to the total number of social media speeches d 3 :
Tag={纽约,华盛顿,硅谷,伦敦,巴黎,东京,北京,上海,深圳......}。Tag = {New York, Washington, Silicon Valley, London, Paris, Tokyo, Beijing, Shanghai, Shenzhen ...}.
(5)失控度(5) Out of control
社交媒体S{s
1,s
2,s
3......s
m}中个人对于自己能否控制或影响恐慌事件发展趋势的判断。
Individuals in social media S {s 1 , s 2 , s 3 ...... s m } judge whether they can control or influence the development trend of panic events.
(5.1)社交媒体相关发言中不确定性、无助感词汇占比e:(5.1) Uncertainty and helplessness in social media related speech e:
dic3=UC={无能为力,改变不了,能做什么,......};dic3 = UC = {helpless, can't change, what can be done, ...};
P
UC>2表示“出现了集合UC中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。
P UC> 2 indicates the number of articles in which “any vocabulary in the collection UC appears twice or more”, and m indicates the total number of social media articles.
(6)陌生度(6) Unfamiliarity
社交媒体S{s
1,s
2,s
3......s
m}中个人对事件的理解程度,即事件是新发生的还是已出现过的,能否用已知的科学原理解释。
The degree of personal understanding of the event in social media S {s 1 , s 2 , s 3 ...... s m }, that is, whether the event occurred newly or has occurred, can it be explained by known scientific principles .
(6.1)表未知含义词汇占比f
1:
(6.1) The proportion of words with unknown meanings in the table f 1 :
UK={“前所未有”,“首次出现”,......};UK = {"Unprecedented", "First appearance", ...};
P
UK>2表示“出现了集合UK中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。
P UK> 2 indicates the number of "any vocabulary in the set UK appears twice or more", and m indicates the total number of social media articles.
(6.2)表不理解含义词汇占比f
2:
(6.2) Table does not understand the meaning vocabulary ratio f 2 :
DU={“怎么会”“不理解”“不懂”“不科学”,......};DU = {"how can I" "don't understand" "don't understand" "unscientific", ...};
P
DU>2表示“出现了集合DU中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。
P DU> 2 indicates the number of “an arbitrary word in the set DU appears twice or more”, and m indicates the total number of social media articles.
(7)激惹度(7) Agitation
社交媒体S{s1,s2,s3......sm}中网民对于恐慌事件的忧虑程度及忧虑反应程度。The degree of anxiety and anxiety reaction of netizens to the panic incident in social media S {s1, s2, s3 ... sm}.
(7.1)忧虑性词汇占比g
1
(7.1) Worry vocabulary proportion g 1
AN=“担心”,“烦”,“焦虑”,......}AN = "Worried", "annoying", "anxious", ...}
P
AN>2表示“出现了集合AN中的任意词汇两次以上”的篇数,m表示社交媒体总篇数;
P AN > 2 indicates the number of "any vocabulary in the set AN appeared twice or more", and m indicates the total number of social media articles;
(7.2)责备性词汇占比g
2
(7.2) Blame vocabulary g 2
BL={“都怪”,“负责”,“承担责任”,“甩锅”,......}BL = {"All blame", "responsible", "take responsibility", "throw the pot", ...}
P
BL>2表示“出现了集合BL中的任意词汇两次以上”的篇数,m表示社交媒体总篇数
P BL> 2 indicates the number of articles in which "any vocabulary in the collection BL has appeared more than twice", and m indicates the total number of articles on social media
(7.3)抗议性词汇占比g
3
(7.3) Proportion of vocabulary g 3
PR={“抗议”“反对”“拒绝”,......}PR = {"Protest", "Opposition", "Rejection", ...}
P
PR>7表示“出现了集合PR中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。
P PR> 7 indicates the number of articles in which "any vocabulary in the collection PR appears twice or more", and m indicates the total number of articles in social media.
(8)信任度(8) Trust
表示社交媒体S{s1,s2,s3......sm}中公众对于政府与专家的信任程度。Represents the degree of public trust in government and experts in social media S {s1, s2, s3 ... sm}.
(8.1)政府和社会组织的官方发言数量h
1
(8.1) Number of official speeches by government and social organizations h 1
A={A
1,A
2,......,A
u}
A = {A 1 , A 2 , ..., A u }
G={“政府”,“协会”,“组织”,......}G = {"Government", "Association", "Organization", ...}
其中,A=S集合中所有的账户名;Among them, A = all account names in the set;
G=表示官方身份的词库;G = thesaurus representing official identity;
(8.2)政府和社会组织的官方发言评论中反对含义词汇占比h
2
(8.2) Proportion of meaning vocabulary in official comments of government and social organizations h 2
B={B
1,B
2,......,B
v}
B = {B 1 , B 2 , ..., B v }
Y={“反对”“不信”“麻烦解释”“请解释”,......}Y = {"disagree" "don't believe" "trouble to explain" "please explain", ...}
B=所有账户名含有官方身份词汇的社交内容;B = all social content whose account name contains official identity words;
Y=所有“反对”含义词汇的数据库;Y = a database of all "opposing" meaning words;
(8.3)专家的发言数量h
3
(8.3) Number of expert speeches h 3
A={A
1,A
2,......,A
u}
A = {A 1 , A 2 , ..., A u }
X={“专家”,“老师”,“学者”,......}X = {"Expert", "Teacher", "Scholar", ...}
其中,A=所有社交账号名的数据库;Among them, A = a database of all social account names;
X=所有表示学者身份词汇的数据库;X = a database of all scholarly vocabulary;
(8.4)专家发言的评论中反对含义词汇占比h
4
(8.4) Proportion of vocabulary against meaning in comments made by experts h 4
C={C
1,C
2,......,C
q}
C = {C 1 , C 2 , ..., C q }
Y={“反对““不信”“麻烦解释”“请解释”,......}Y = {"Opposition" "I don't believe" "trouble to explain" "please explain", ...}
C=所有账户名含有学者身份词汇的社交内容;C = all social content whose account name contains scholarly vocabulary;
Y=所有“反对”含义词汇的数据库;Y = a database of all "opposing" meaning words;
在步骤S103中,利用神经网络模型搭建恐慌指数的计算模型,输入既有语料,依靠机器学习对恐慌指数模型进行反复、多次训练,具体包括:In step S103, a calculation model of the panic index is established by using a neural network model, inputting an existing corpus, and relying on machine learning to repeatedly and repeatedly train the panic index model, specifically including:
(1)基于神经网络的机器学习模型搭建:(1) Machine learning model based on neural network:
(1.1)利用神经网络模型搭建恐慌指数的计算模型前,对数据进行归一化处理;归一化处理公式进行计算获得:(1.1) Before using the neural network model to build a calculation model of the panic index, the data is normalized; the normalized processing formula is calculated to obtain:
其中x为当前样本的数值,x
mean代表当前特征的平均值,x
max代表当前样本中最大值,x
min代表当前样本最小值;
Where x is the value of the current sample, x mean represents the average value of the current feature, x max represents the maximum value in the current sample, and x min represents the minimum value of the current sample;
(1.2)根据已经归一化后的特征,使用多层全连接神经网络结构来训练模型。其模型结构如下图2。其中Layer L
1为输入层,在本发明中代表各特征所对应的值。Layer L
2为隐藏层,计算隐藏特征。Layer L
3为输出层,输出最终结果。
(1.2) According to the normalized features, a multi-layer fully connected neural network structure is used to train the model. Its model structure is shown in Figure 2 below. Layer L 1 is an input layer, and in the present invention represents a value corresponding to each feature. Layer L 2 is a hidden layer that calculates hidden features. Layer L 3 is the output layer and outputs the final result.
(1.3)多层全连接神经网络结构训练模型的训练阶段采用前向传播算法和反向传播算法:(1.3) The training phase of the multi-layer fully connected neural network structure training model uses a forward propagation algorithm and a back propagation algorithm:
第一,前向传播算法计算公式:First, the calculation formula of the forward propagation algorithm:
Z
(l)=W
(l-1)x
(l-1)+b
(l-1);
Z (l) = W (l-1) x (l-1) + b (l-1) ;
a
(l)=f(Z
(l));
a (l) = f (Z (l) );
h
W,b(x)=a
(L-1);
h W, b (x) = a (L-1) ;
其中:among them:
l为第l层,L为最后一层,x
(1)为输入的特征,W,b分别为权重和偏置,h
W,b(x)为输出。
l is the first layer, L is the last layer, x (1) is the feature of the input, W, b are the weights and offsets, h W, b (x) is the output.
第二,反向传播算法计算公式:Second, the calculation formula of the back propagation algorithm:
多层全连接神经网络结构训练模型根据目标函数使用反向传播算法优化目标函数,得到最优的模型:The multi-layer fully connected neural network structure training model uses the back-propagation algorithm to optimize the objective function according to the objective function to obtain the optimal model:
然后根据反向传播算法更新参数得到最优的模型:Then update the parameters according to the back-propagation algorithm to get the optimal model:
更新参数:Update parameters:
其中
为第i个节点第l层对应权重,
J为对J函数中的权重W求偏导,m表示样本个数,b为偏置数值,λ为正则项参数,取0.1,对于当前时刻的所有新闻和社交媒体文本,恐慌预测利用特征抽取方法进行相应特征抽取;将抽取的特征输入多层全连接神经网络结构训练模型的输入层,经过前向传播算法得到结果,并作为下一层模型的输入;经过三层模型的计算,得到最终的恐慌值。
among them Is the corresponding weight of the i-th node and the l-th layer, J is the partial derivative of the weight W in the J function, m is the number of samples, b is the bias value, and λ is the regular term parameter, taking 0.1, for all news and social media text at the current moment, panic prediction uses feature extraction The corresponding features are extracted by the method; the extracted features are input into the input layer of a multi-layer fully connected neural network structure training model, and the result is obtained through the forward propagation algorithm and used as the input of the next layer model; Panic value.
(2)人工标注一批可用于机器学习的语料。(2) Manually mark a batch of corpora that can be used for machine learning.
(2.1)基于神经网络模型的机器学习是通过语料库进行训练出来的,语料规模越大,训练的模型准确度越高。(2.1) Machine learning based on neural network models is trained through a corpus. The larger the corpus, the higher the accuracy of the trained model.
语料包含下列领域(共计23个),各领域设定不同事件主题,搜索文本。The corpus contains the following fields (23 in total), each field sets a different event theme, and searches for text.
如,针对“时政”主题,检索出“某某事件、2017年6月~8月、某国媒体”发布的新闻稿件X篇。将X篇稿件按照步骤②进行统计,得到关于“某某事件、2017年6月~8月、某国媒体”的多个指标的数值。专家组对每组事件的恐慌情感进行统一标准的人工打分,同时给出打分依据。For example, for the topic of "Politics", retrieve X press releases of "So-and-so incident, June-August 2017, media of a certain country". The X manuscripts were counted according to step ②, and the values of multiple indicators about “some event, June to August 2017, media of a certain country” were obtained. The expert group gave a unified standard manual scoring of the panic emotion of each group of events, and gave the scoring basis at the same time.
根据以上类似操作,标注后语料用于基于机器学习的情感分析模型训练,标注语料中类别的分布应尽量均匀,适合分类器的训练且标注语料中的大部分文章要具有极性。由于语料资源大多来自于互联网,其编码、格式、内容往往会存在某些不规范现象,比如:标点符号的滥用,多个空格,错别字等。因此,在标注之前需要对这些不规范格式进行改正,统一采用UTF-8编码。According to the similar operations above, the labeled corpus is used for training of sentiment analysis models based on machine learning. The distribution of categories in the labeled corpus should be as uniform as possible, suitable for training of classifiers, and most articles in the labeled corpus must have polarity. Since most corpus resources come from the Internet, there are often some irregularities in its encoding, format, and content, such as: abuse of punctuation marks, multiple spaces, typos, and so on. Therefore, it is necessary to correct these irregular formats before labeling, and uniformly adopt UTF-8 encoding.
(2.2)模型训练:将不低于10000篇语料放入基于神经网络的机器学习模型中进行机器学习及训练,让每组a1~h4数据计算结果等于每组GPRI对应结果,进而达到“机器对多个指标计算结果”通过模型后,所得结果能无限趋近于“专家组评分结果”的目的。(2.2) Model training: Put no less than 10,000 corpora into a neural network-based machine learning model for machine learning and training, so that the calculation result of each group a1 ~ h4 data is equal to the corresponding result of each group GPRI, and then achieve the "machine pair After the "multiple index calculation results" pass the model, the obtained results can approach the purpose of "expert group score results" indefinitely.
(2.3)恐慌预测:基于训练好的模型,对于当前时刻的所有新闻和社交媒体文本,恐慌指数计算模型利用特征抽取方法进行相应特征抽取;将抽取的特征输入多层全连接神经网络结构训练模型的输入层,经过前向传播算法得到结果,并作为下一层模型的输入;经过三层模型的计算,得到最终的恐慌值。(2.3) Panic prediction: Based on the trained model, for all news and social media texts at the current moment, the panic index calculation model uses feature extraction to extract corresponding features; the extracted features are input into a multi-layer fully connected neural network structure training model In the input layer, the result is obtained through the forward propagation algorithm and used as the input of the next layer model; the calculation of the three layer model is used to obtain the final panic value.
为了进一步证明本发明提供的监测方法的可行性和科学性,下面结合附图对本发明的理论依据和设计原理做进一步的描述。In order to further prove the feasibility and scientificity of the monitoring method provided by the present invention, the theoretical basis and design principle of the present invention will be further described below with reference to the accompanying drawings.
如图2所示,本发明实施提供的根据归一化后的特征,使用多层全连接神经网络结构训练模型图。As shown in FIG. 2, according to the normalized features provided by the implementation of the present invention, a multi-layer fully connected neural network structure is used to train a model diagram.
如图3所示,本发明实施提供的使用多层全连接神经网络结构训练模型中输出最终结果前向传播算法图。As shown in FIG. 3, the present invention provides a forward propagation algorithm graph for outputting final results in a training model using a multi-layer fully connected neural network structure.
(1)理论与算法原理(1) Theory and algorithm principle
简言之,GRPI的算法定义指标建立在新闻传播学原理、心理学社会风险及网络情绪波动等原理的基础上,通过选取对网络情绪波动具有强烈影响性的维度及衡量标准进行一系列因素前测及权重分析,然后在指标转换成数学模型的过程中通过大规模的机器学习制定不断贴 合于目标的各指标权重,最后在基于全球新闻大数据的数据挖掘与数据分析综合计算的条件下计算GRPI指数,即应用现有的基于神经网络的机器算法,该机器学习方法如图设定3级,Layer L1对应多个指标,Layer L2为一个隐含层,Layer3Hw,b(X)对应指标输出的结果。在机器训练时,设定训练规模10000组,人工给10000组Layer L1输入数值,并匹配10000个对应的Layer3输出数值,然后通过机器在Layer2网络隐含层进行学习,匹配各系数权重并最终输出结果。In short, GRPI's algorithm definition indicators are based on the principles of journalism, psychosocial risks, and Internet emotion fluctuations. Before selecting a series of factors by selecting dimensions and measures that have a strong influence on Internet emotion fluctuations, Measure and analyze the weights, and then use large-scale machine learning to formulate the weights of the indicators that are consistent with the goal during the conversion of the indicators into mathematical models. Finally, under the conditions of comprehensive calculation of data mining and data analysis based on global news big data Calculate the GRPI index, that is, the existing neural network-based machine algorithm is applied. The machine learning method is set as shown in Figure 3. Layer L1 corresponds to multiple indicators, Layer L2 is an hidden layer, and Layer 3 Hw, b (X) corresponds to the indicator. The output result. During machine training, set a training scale of 10,000 groups, manually input values to 10,000 groups of Layer L1, and match 10,000 corresponding Layer 3 output values, and then use the machine to learn in the hidden layer of the Layer 2 network, match the coefficient weights and finally output result.
(2)因素与指标(2) Factors and indicators
GRPI指数主要考虑的因素有:全球时段间新闻声量及新闻总声量、全球时段间社交声量及社交总声量、全球新闻声量增长率及社交声量增长率、强烈或一般正负面新闻及社交增长率、衡量事件发生的直接物理性影响的触发因素及关键词(如死伤程度、影响时长、经济损失、辐射地域范围等)、新闻媒体报道当中对事件或主体的关注程度、社交网络平台中对事件或主体的话题爆发集中程度、衰退周期,还有一系列媒体及社交网络行为所表现出来的个体、群体态度,例如主体对事态发展控制能力、激惹及忧虑程度及对掌权者的信任程度等,是对多个指标的概括性表述。The main factors considered by the GRPI index are: global news volume and total news volume, global social volume and total social volume, global news volume growth and social volume growth rate, strong or general positive and negative news and social growth rate, Trigger factors and keywords that measure the direct physical impact of an event (such as the degree of death and injury, the duration of the impact, economic loss, the geographical scope of radiation, etc.), the degree of attention to the event or subject in news media reports, the event or The subject ’s concentration of topics, the cycle of decline, and a series of individual and group attitudes expressed by media and social network behaviors, such as the subject ’s ability to control developments, the degree of agitation and anxiety, and the degree of trust in those in power. A general description of multiple indicators.
如图4所示,运行过程具体包括:在分析某个目标事件或主体的恐慌指数时,后台会首先检索出N篇与该事件或主体相关的文章,然后会根据事件的属性选择临近维度并运算;在对未来恐慌指数趋势预测中,后台基于回归或分类等统计机器学习方法,对各因素的历史数据进行统计,然后将数据用机器学习模型进行计算,实现对恐慌趋势指标的计算。As shown in Figure 4, the running process specifically includes: When analyzing the panic index of a target event or subject, the background will first retrieve N articles related to the event or subject, and then select the neighboring dimension based on the attributes of the event and Operation: In predicting the future panic index trend, the background is based on statistical machine learning methods, such as regression or classification, to calculate the historical data of various factors, and then use the machine learning model to calculate the data to realize the calculation of the panic trend indicator.
S201:输入主题/事件主体/目标区域/目标时间等,设为语料1;S201: Enter the subject / event subject / target area / target time, etc., and set it as Corpus 1;
S202:后台检索出关于语料1的X篇新闻文章、社交内容。(新闻媒体信息为N{n
1,n
2,n
3......n
n},社交媒体信息为S{s
1,s
2,s
3......s
m},则可得到一个为W{N,S});
S202: The X news articles and social content about Corpus 1 are retrieved in the background. (News media information is N {n 1 , n 2 , n 3 ...... n n }, and social media information is S {s 1 , s 2 , s 3 ...... s m }, then One can get W {N, S});
S203:系统对语料1的多个指标内容进行数据统计,得到数据组C1;S203: The system performs data statistics on multiple index contents of corpus 1 to obtain data group C1;
S204:人工对语料1的恐慌指数进行打分,得分E2;S204: Manually score the panic index of Corpus 1, and score E2;
S205:进行第二次流程S201,选定语料2,重复流程S201~S204,得到C2、E2;S205: Perform the second process S201, select corpus 2, and repeat the processes S201-S204 to obtain C2 and E2;
S206:流程重复流程S205,得到C3、E3;C4、E4……,Cn、En.(n越大越好);S206: The process repeats the process S205 to obtain C3, E3; C4, E4 ..., Cn, En. (The larger n is, the better);
S207:将C1~Cn……E1~En进基于神经网络的机器学习的模型训练。使得每组C的计算结果等于E,进而达到每组新的C通过模型后,所得结果无限趋近于E的目的;S207: C1 ~ Cn ... E1 ~ En are trained in a neural network-based machine learning model. Make the calculation result of each group C equal to E, and then achieve the purpose that the results obtained by each new group C approach the E infinitely;
S208:学习结束后模型可以使用。输入主题/事件主体/目标区域/目标时间等,如,输入“无人驾驶”,设定恐慌指数监测时间段为2018年1月,区域范围为任一国;S208: The model is ready for use after learning. Enter the subject / event subject / target area / target time, for example, enter “unmanned”, set the panic index monitoring period to January 2018, and the region range to any country;
S209:后台检索关于“无人驾驶”的W篇新闻文章、社交内容;S209: Retrieve W news articles and social content about "driverless" in the background;
S210:系统对关于“无人驾驶”主题在该区域及时间范围内的数据进行提取及统计,得到数据组;S210: The system extracts and counts data on the topic of "unmanned driving" in this area and time range to obtain a data group;
S211:统计结果输入机器,机器对关于此“无人驾驶”的主题进行恐慌指数的计算。S211: The statistical result is input into a machine, and the machine calculates a panic index on the subject of "unmanned driving".
S212:机器输出关于“无人驾驶”的恐慌值。S212: The machine outputs a panic value about "unmanned".
如图5,本发明实施例提供的基于新闻大数据的恐慌指数监测分析系统,包括:As shown in FIG. 5, a panic index monitoring and analysis system based on news big data according to an embodiment of the present invention includes:
数据库形成模块1,利用大数据采集相关的所有媒体信息,形成数据库; Database formation module 1 collects all media information related to big data to form a database;
恐慌指数指标数值获取模块2,将数据库中的数据,按照恐慌指数划分的多个维度及多个指标进行实时统计,得出各指标具体数值;The panic index index value acquisition module 2 performs real-time statistics on the data in the database according to multiple dimensions and multiple indexes divided by the panic index to obtain specific values of each index;
恐慌指数获取模块3,利用神经网络模型和机器学习算法,匹配各维度权重形成完整模型,计算恐慌指数时,从数据库调取并统计指标计算所需数据,将各维度统计结果放入模型,最终输出恐慌指数。The panic index acquisition module 3 uses a neural network model and a machine learning algorithm to match the weights of each dimension to form a complete model. When calculating the panic index, the required data is retrieved from the database and statistical indicators are calculated. Output the panic index.
以下通过具体的应用实例,进一步证明本发明提供的监测方法的可行性和结果的可靠性,具备很强的理论和实践价值。The following specific application examples further prove the feasibility and reliability of the monitoring method provided by the present invention, which has strong theoretical and practical value.
本发明依托全球200多个国家、60多个语种的新闻媒体和社交平台的大数据资源,结合了恐慌与社会风险认知模型,将理论模型应用到算法层面,能对可定制主题的各类事件进行恐慌指数的计算与发展状态的监测。The invention relies on big data resources of news media and social platforms in more than 200 countries and more than 60 languages worldwide, combines panic and social risk cognitive models, applies theoretical models to the algorithm level, and can be used for various types of customizable topics The event is to calculate the panic index and monitor the development status.
本发明建立在成熟的社会恐慌与风险认知理论基础之上,恐慌指数的计算指标参考了国内外顶尖学者关于社会恐慌与网络监测测量的研究成果。在算法更新上,一方面国内外学界顶级专家学者会定期针对指标的测量进行讨论和测试,提出修改意见;另一方面,通过后台技术手段,本发明结合大量与恐慌相关的权威数据进行自动比对及机器学习,以验证并更新算法权重。The present invention is based on a mature theory of social panic and risk cognition. The calculation index of the panic index refers to the research results of top scholars at home and abroad on social panic and network monitoring and measurement. In terms of algorithm update, on the one hand, top experts and scholars in academic circles at home and abroad will regularly discuss and test the measurement of indicators and propose amendments; on the other hand, through background technical means, the present invention combines a large number of authoritative data related to panic to automatically compare Match machine learning to verify and update algorithm weights.
本发明采用自定义的可视化应用方式,可以进行宏观监测,在5大风险领域(环境、经济、社会、地域政治与技术)及30个恐慌主题内系统实现实时语义检索及计算。如在经济领域中,主要监测“能源价格震荡”、“失业”、“资产泡沫”、“通货紧缩”、“财务危机”等;技术领域主要监测“网络攻击”、“数据欺诈或盗窃”等主题风险标签。The invention adopts a customized visual application method, which can perform macro-monitoring, and realize real-time semantic retrieval and calculation in 5 major risk areas (environment, economy, society, regional politics and technology) and 30 panic topics. For example, in the economic field, it mainly monitors "energy price shocks", "unemployed", "asset bubble", "deflation", "financial crisis", etc .; in the technical field, it mainly monitors "network attacks", "data fraud or theft" Themed risk labels.
本发明还可以进行微观定制,恐慌监测平台能根据不同用户的需求,提供定制搜索。在国家安全、企业安全、企业危机应急管理、行业发展危机等场景,恐慌监测平台均可提供决策参考。The invention can also perform micro-customization, and the panic monitoring platform can provide customized search according to the needs of different users. Panic monitoring platforms can provide decision-making references in national security, corporate security, corporate crisis emergency management, and industry development crisis scenarios.
例如在国家安全层面,某一事件所引起恐慌指数变化是一个案例。由比特币引发的安全问题曾经一度引起投资者的恐慌。本质上看,比特币的狂飙与骤降,是担忧、恐惧和贪婪集中宣泄的结果。For example, at the national security level, a change in the panic index caused by an event is a case. The security issues caused by Bitcoin once caused panic among investors. In essence, Bitcoin's hurricane and slump are the result of a concentrated release of worry, fear, and greed.
在行业层面,恐慌监测平台能显示市场预期变化,例如香港和西安分别的房地产行业价格与恐慌指数。At the industry level, panic monitoring platforms can show expected changes in the market, such as the real estate industry prices and panic indexes in Hong Kong and Xi'an, respectively.
实施例1:企业发展的恐慌指数监测Example 1: Monitoring of panic index of enterprise development
企业发展状况与恐慌指数有很大关系。图为乐视的恐慌指数与用户数量。2016年8月某某公司爆发资金链危机,2017年1月“融创中国”150亿元入股,2017年7月某某全面退出,某某公司企业价值危机大爆发——拖欠员工薪水、财报大幅亏损等,导致多个基金将某某公司估值下调,某某公司企业恐慌指数大幅波动。The development of enterprises has a great relationship with the panic index. The picture shows LeEco's panic index and the number of users. In August 2016, a certain company's capital chain crisis broke out. In January 2017, "Sunac China" invested 15 billion yuan in shares. In July 2017, a certain company completely withdrew. A certain company's corporate value crisis broke out-arrears of employees' salaries and financial reports. Substantial losses, etc., have led to multiple funds lowering the valuation of a certain company, and the panic index of a certain company has fluctuated significantly.
例如,7月5日,某某公司将持有股权质押给某某公司,企业换帅导致了恐慌指数的大幅震荡。此时,某某公司用户人数不断减少,这一定程度上反应了用户对某某公司企业的失信。从某某公司月度覆盖人数可以看出,2017年用户整体下降,7月用户数环比减少32%,用户重度流失,此时的恐慌指数显示出超大幅度动荡。For example, on July 5th, a certain company pledged its equity to a certain company, and the change of manager of the company led to a large shock of the panic index. At this time, the number of users of a certain company continued to decrease, which reflected to some extent the users' dishonesty to a certain company. From the monthly coverage of a certain company, it can be seen that the overall number of users decreased in 2017. In July, the number of users decreased by 32% from the previous month, and the number of users was severely lost. At this time, the panic index showed extremely large turbulence.
实施例2:社会突发事件恐慌指数监测:Example 2: Monitoring of social emergency panic index:
由于GRPI的算法几乎融合了新闻大数据的绝大部分标签,所以它同时可以被拆分使用。如:GRPI背后的实体命名及语义识别技术可帮助人们快速获取或比对所有历史数据中每场地震中的波及范围、死伤程度(1级-10级),获取某项政策出台后所引起社会讨论的抗议程度(1级-10级)等,分析民众网络情绪波动与移民等社会现象、比特币涨跌等经济现象之间的关系等。Since GRPI's algorithm incorporates most of the tags of news big data, it can be split and used at the same time. For example, the entity naming and semantic recognition technology behind GRPI can help people quickly obtain or compare the scope of the earthquake and the extent of death and injury (grades 1-10) in all historical data. The degree of protest (1-10) was discussed, and the relationship between the emotional fluctuations of the people's network and social phenomena such as immigration and economic phenomena such as the rise and fall of Bitcoin were analyzed.
实施例3:股市恐慌指数监测:Example 3: Stock market panic index monitoring:
GRPI指数作为民众情感波动率监测及预测的指标,对股票市场具有一定的指导作用。数据结果表明,二者存在一定的负相关关系。即某上市公司GRPI指数在高位时,其股价往往处于下降态势,而股票指数走强时,GRPI指数大都低位下探。GRPI指数的极值状态尤其值得关注,当GRPI处于高位极值时,往往预示着有大事正在酝酿或正在发生,例如上图为对GRPI的全球监测,GRPI指数仅有的两次突破80事件均出现在2008年,时值全球金融危机的爆发。The GRPI index, as an index for monitoring and predicting the public's emotional volatility, has a certain guiding role in the stock market. The data results show that there is a certain negative correlation between the two. That is, when the GRPI index of a listed company is at a high level, its stock price is often in a downward trend, and when the stock index is strengthened, the GRPI index is mostly at a low level. The extreme state of the GRPI index is particularly worthy of attention. When the GRPI is at a high extreme value, it usually indicates that major events are brewing or happening. For example, the above picture shows the global monitoring of the GRPI. Appeared in 2008 when the global financial crisis broke out.
在上述实施例中,本发明提供的方法和系统可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用全部或部分地以计算机程序产品的形式实现,所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线 (DSL)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输)。所述计算机可读取存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above embodiments, the method and system provided by the present invention may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in whole or in part in the form of a computer program product, the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, a computer, a server, or a data center. Transmission by wire (such as coaxial cable, fiber optic, digital subscriber line (DSL) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration. The available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk (SSD)), and the like.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above description is only the preferred embodiments of the present invention and is not intended to limit the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.
Claims (17)
- 一种基于大数据的恐慌指数监测分析方法,所述方法采用社会风险放大理论与恐慌心理传播理论,其特征在于:A panic index monitoring analysis method based on big data, which uses the theory of social risk amplification and the theory of panic psychological transmission, which is characterized by:步骤一:将采集的媒体数据形成为数据库;Step 1: forming the collected media data into a database;步骤二:将数据库中的数据按照恐慌指数划分为八个维度和27个指标进行实时统计,得出各指标的具体数值;Step 2: Divide the data in the database into eight dimensions and 27 indicators according to the panic index and perform real-time statistics to obtain the specific values of each indicator;步骤三:利用神经网络模型搭建恐慌指数的计算模型,输入语料,通过机器学习、匹配各维度权重并将所述指标综合计算,得到恐慌指数。Step 3: Use the neural network model to build a panic index calculation model, input corpus, and use machine learning to match the weights of each dimension and comprehensively calculate the index to obtain the panic index.
- 如权利要求1所述的方法,其特征在于,步骤一具体包括:The method according to claim 1, wherein step 1 specifically comprises:a)采集:对新闻媒体、社交媒体进行采集;基于Python编程语言通过社交媒体开放的数据接口进行全量采集,通过消息列队的方式直接存储到数据中心;新闻媒体的采集通过对境内、境外的新闻网页,通过调度器、采集器、任务管理器、文本解析、存储、数据治理,对给定的新闻数据源进行广度遍历,找出进一步采集的列表页,调度器通过任务管理器将列表页发送给各个采集器,采集器通过对列表页进行爬取,得到文章的html网页;a) Collection: Collect news media and social media; use Python programming language to collect data through the open data interface of social media for full collection, and store it directly to the data center by means of message queues; news media collects domestic and overseas news through Web page, through the scheduler, collector, task manager, text parsing, storage, data management, breadth traversal of a given news data source, find out the list page for further collection, and the scheduler sends the list page through the task manager For each collector, the collector obtains the HTML page of the article by crawling the list page;b)处理:通过数据中心的数据治理算法,对数据进行结构化处理得到结构化数据;b) Processing: Structured data is processed through the data management algorithm of the data center to obtain structured data;c)存储:将结构化处理的数据存储到Nosql数据库中,将数据库中的数据通过消息队列导入到数据治理模块;数据治理模块通过治理算法对数据打上相应的标签,应用情感算法对每篇文章进行情感计算得到情感标签;c) Storage: Store the structured data in Nosql database, and import the data in the database to the data management module through the message queue; the data management module tags the data through the management algorithm and applies the emotional algorithm to each article Perform sentiment calculation to get sentiment labels;d)利用大数据采集相关的所有媒体信息,形成数据库;采集所有包含监测关键词在内的新闻及社交媒体数据,数据采集标签包含数据内容、发布媒体或社交网络发布账号、发布时间、地域属性,并放入数据库;设数据库集合为W,按时间顺序排列,其中新闻媒体信息为N{n 1,n 2,n 3......n n},社交媒体信息为S{s 1,s 2,s 3......s m},得到一个为W{N,S}的数据合集。 d) Use big data to collect all media information related to form a database; collect all news and social media data including monitoring keywords, data collection tags include data content, release media or social network release account, release time, regional attributes , And put it into the database; let the database set be W, arranged in chronological order, where the news media information is N {n 1 , n 2 , n 3 ...... n n }, and the social media information is S {s 1 , s 2 , s 3 ...... s m } to get a data set of W {N, S}.
- 如权利要求1所述的方法,其特征在于,步骤三具体包括:The method according to claim 1, wherein step three specifically comprises:A.基于神经网络的机器学习模型搭建:利用神经网络模型搭建恐慌指数的计算模型前,对数据进行归一化处理;根据归一化后的特征,使用多层全连接神经网络结构训练模型;其中Layer L 1为输入层,代表各指标特征所对应的值;Layer L 2为隐藏层,计算隐藏特征;Layer L 3为输出层,输出最终结果; A. Neural network-based machine learning model construction: Before using the neural network model to build a panic index calculation model, normalize the data; according to the normalized features, use a multi-layer fully connected neural network structure to train the model; Among them, Layer L 1 is the input layer, which represents the value corresponding to each index feature; Layer L 2 is the hidden layer, and the hidden features are calculated; Layer L 3 is the output layer, and the final result is output;B.人工标注一批可用于机器学习的语料:将大于或等于10000篇语料放入基于神经网络的机器学习模型中进行机器学习及训练;B. Manually mark a batch of corpora that can be used for machine learning: Put more than or equal to 10,000 corpora into a machine learning model based on neural network for machine learning and training;基于训练好的模型,对于当前时刻的所有新闻和社交媒体文本,恐慌指数计算模型利用特征抽取方法进行相应特征抽取;将抽取的特征输入多层全连接神经网络结构训练模型的输 入层,经过前向传播算法得到结果,并作为下一层模型的输入;经过三层模型的计算,得到最终的恐慌值。Based on the trained model, for all news and social media texts at the current moment, the panic index calculation model uses feature extraction to extract corresponding features; the extracted features are input to the input layer of the multi-layer fully connected neural network structure training model, The result is obtained by the forward propagation algorithm and used as the input of the next layer model. After the calculation of the three layer model, the final panic value is obtained.
- 如权利要求1或2或3所述的方法,其特征在于,所述八个维度分别为:危害度、关注度、集中度、主观度、失控度、陌生度、激惹度、信任度;The method according to claim 1 or 2 or 3, wherein the eight dimensions are: degree of harm, degree of attention, degree of concentration, degree of subjectivity, degree of out of control, degree of strangeness, degree of agitation, degree of trust;其中:危害度包括的指标为:波及人数a 1、重伤人数a 2、死亡人数a 3、危害地域大小a 4、危害时长a 5、直接经济损失a 6以及直接社会后果a 7; Among them: the indicators of harm include: the number of affected people a 1 , the number of severe injuries a 2 , the number of deaths a 3 , the size of the hazardous area a 4 , the duration of the harm a 5 , the direct economic loss a 6 and the direct social consequences a 7关注度包括的指标为:相关新闻报道篇数b 1、相关社交讨论篇数b 2、报道字数均值b 3、新闻报道时长b 4、以及社交讨论时长b 5; The indicators of attention include: the number of relevant news reports b 1 , the number of relevant social discussions b 2 , the average number of reported words b 3 , the length of news reports b 4 , and the length of social discussions b 5 ;集中度包括的指标为:新闻集中度c 1,和社交集中度c 2; The indicators of concentration include: news concentration c 1 and social concentration c 2 ;主观度包括的指标为:相关社交评论情感d 1、社交媒体负面词占比d 2、以及社交讨论中重要城市用户的发言篇数占总社交媒体发言篇数的比例值d 3; Including the degree of subjective indicators: the relevant social commentary emotional d 1, negative word social media share d 2, as well as social discussion papers speak several important cities in the proportion of the total number of users of social media statement published value d 3;失控度包括的指标为:社交媒体相关发言中不确定、无助感词汇占比e;Out-of-control indicators include: the proportion of uncertain, helpless words e in social media-related speeches;陌生度包括的指标为:表未知含义词汇占比f 1,和表不理解含义词汇占比f 2; Unfamiliarity includes indicators: the proportion of words with unknown meanings f 1 , and the proportion of words with incomprehensible meanings f 2激惹度包括的指标为:忧虑性词汇占比g 1、责备性词汇占比g 2、以及抗议性词汇占比g 3; The indicators of irritability include: the proportion of anxiety words g 1 , the proportion of blame words g 2 , and the proportion of protest words g 3 ;信任度包括的指标为:政府和社会组织的官方发言数量h 1、政府和社会组织的官方发言评论中反对含义词汇占比h 2、专家发言数量h 3、以及专家发言的评论中反对含义词汇占比h 4。 The index of trust includes: the number of official speeches of government and social organizations h 1 , the proportion of words opposed to meanings in official comments of government and social organizations h 2 , the number of experts' speeches h 3 , and the meaning of words in comments made by experts. Proportion h 4 .
- 如权利要求4所述的方法,其特征在于,所述危害度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the indicators included in the hazard degree is:1)波及人数a 1=arg max(TFa 1) 1) Number of people involved a 1 = arg max (TFa 1 )TFa 1表示各a 1取值分别对应出现的频率,a 1=arg max(TFa 1)表示a 1取值出现频率最高的值; TFa 1 represents the frequency of occurrence of each a 1 value, a 1 = arg max (TFa 1 ) represents the value with the highest frequency of occurrence of a 1 value;2)重伤人数a 2=arg max(TFa 2) 2) Number of serious injuries a 2 = arg max (TFa 2 )TFa 2表示各a 2出现的频率,a 2=arg max(TFa 2)表示a 2取值出现频率最高的值; TFa 2 represents the frequency of occurrence of each a 2 , a 2 = arg max (TFa 2 ) represents the value with the highest frequency of occurrence of a 2 value;3)死亡人数a 3=arg max(TFa 3) 3) Number of deaths a 3 = arg max (TFa 3 )TFa 3表示各a 3出现的频率,a 3=arg max(TFa 3)表示a 3取值出现频率最高的值; TFa 3 represents the frequency of occurrence of each a 3 , a 3 = arg max (TFa 3 ) represents the value with the highest frequency of occurrence of a 3 value;4)危害地域大小a 4: 4) Harm area size a 4 :当监测区域为任一国家时When the monitoring area is any country其中,Z=“镇/县/乡/区”,S=“市”,Sh=“省”,G=“国”;Among them, Z = "town / county / township / district", S = "city", Sh = "province", and G = "country";当监测区域为全球时,When the monitoring area is global,统计W中的IP出现区域,以国家为单位;统计W中IP地址出现在1个国家,计为f(1),出现在2个国家,计为f(2);x表示出现的国家数:Count the IP appearance area in W, taking the country as the unit; count the IP address appearing in 1 country, count as f (1), appear in 2 countries, count as f (2); x indicates the number of countries appearing :5)危害时长a 5=arg max(TFa 5) 5) Harm duration a 5 = arg max (TFa 5 )在N中抓取关键词“预计a 5可恢复”,取频率最高的a 5对应数值; Grab the keyword "expected recovery of a 5 " in N, and take the corresponding value of a 5 with the highest frequency;6)直接经济损失a 6=arg max(TFa 6) 6) Direct economic loss a 6 = arg max (TFa 6 )在N中抓取关键词“损失a 6元”,取频率最高a 6对应数值; Grab the keyword "loss a 6 yuan" in N, and take the value corresponding to the highest frequency a 6 ;7)直接社会后果a 7: 7) Direct social consequences a 7 :其中,抗议性行为词汇词库K为阐述事件在舆论阶段的词库集合:Among them, the protest vocabulary lexicon K is a collection of lexicons explaining the incident at the stage of public opinion:反抗性行为词汇词库F为阐述事件上升为行动阶段的词库集合:The resistance vocabulary vocabulary F is a collection of vocabularies describing the rise of events into action stages:
- 如权利要求4所述的方法,其特征在于,所述关注度包括的指标的计算方法为:The method according to claim 4, wherein a calculation method of the index included in the attention degree is:相关新闻总报道篇数b 1为新闻媒体中,出现设定关键词的篇数; The total number of relevant news reports b 1 is the number of articles with keywords in the news media;已知N{n 1,n 2......n n},b 1对N进行计数,有: Given N {n 1 , n 2 ...... n n }, b 1 counts N, and has:b 1=n; b 1 = n;相关社交讨论篇数b 2为社交媒体中,出现设定关键词的篇数; The number of relevant social discussion articles b 2 is the number of articles with keywords in social media;已知S{s 1,s 2......s m},b 2是对S进行计数,则有: Given that S {s 1 , s 2 ...... s m }, b 2 counts S, then:b 2=m; b 2 = m;相关新闻的报道类型b 3:每篇报道字数均值: 其中,n=相关新闻总报道篇数b 1;Z i=第i篇报道字数; Type b 3 of related news: Average word count per report: Among them, n = the total number of related news reports b 1 ; Z i = the number of words in the i-th report;新闻报道时长b 4: News report duration b 4 :社交讨论时长b 5: Duration of social discussion b 5 :
- 如权利要求4所述的方法,其特征在于,所述集中度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the index included in the concentration degree is:新闻集中度c 1: News concentration c 1 :表示从开始有新闻报道的时间, 表示到目前所有新闻报道总量的一半所用的时长; Indicates when there was news coverage from the beginning, Indicate the time taken to half of all current news coverage;社交集中度c 2: Social concentration c 2 :
- 如权利要求4所述的方法,其特征在于,所述主观度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the index included in the subjective degree is:主观度表示社交媒体S{s 1,s 2......s m}中个人对于恐慌事件的主观态度; Subjectivity indicates the individual's subjective attitude towards the panic event in social media S {s 1 , s 2 ...... s m };相关社交评论的情感d 1为出现关键词的社交媒体内容的情感均值: The sentiment d 1 of the relevant social comment is the mean sentiment of the social media content where the keywords appear:E∈(1,5);E∈ (1,5);设集合S中每篇的情感分别为E={E 1,E 2......E m},E∈(1,5)为1-5的每个情感值规定词表,其中r为每档E值对应篇数,m为社交媒体总篇数; Let the emotion of each article in the set S be E = {E 1 , E 2 ...... E m }, and E∈ (1,5) is a vocabulary for each emotion value of 1-5, where r Is the number of articles corresponding to each E value, and m is the total number of articles on social media;社交媒体负面词占比d 2: Social media negative words d 2 :其中among themZ neg=负面词字数总量,Z=总报道字数; Z neg = total number of negative words, Z = total number of reported words;社交讨论中重要城市用户的发言篇数占总社交媒体发言篇数的比例值d 3: The value of the proportion of the speeches of important urban users in the social discussions to the total social media speeches d 3 :Tag={纽约,华盛顿,硅谷,伦敦,巴黎,东京,上海,北京,深圳}。Tag = {New York, Washington, Silicon Valley, London, Paris, Tokyo, Shanghai, Beijing, Shenzhen}.
- 如权利要求4所述的方法,其特征在于,所述失控度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the index included in the out-of-control degree is:失控度表示社交媒体S{s 1,s 2......s m}中个人对于自己能否控制或影响恐慌事件发展趋势的判断; The degree of out of control means the individual's judgment on social media S {s 1 , s 2 ...... s m } as to whether he can control or influence the development trend of the panic event;社交媒体相关发言中不确定性或无助感词汇占比e:Vocabulary of uncertainty or helplessness in social media related speeche:dic3=UC={无能为力,改变不了,能做什么};dic3 = UC = {helpless, can't change, what can be done};P UC>2表示“出现了集合UC中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。 P UC > 2 indicates the number of articles in which "any vocabulary in the collection UC appears twice or more", and m indicates the total number of social media articles.
- 如权利要求4所述的方法,其特征在于,所述陌生度包括的指标的计算方法为:The method according to claim 4, wherein a calculation method of the index included in the strangeness is:陌生度表示社交媒体S{s 1,s 2......s m}中个人对事件的理解程度; Unfamiliarity indicates the degree of personal understanding of the event in social media S {s 1 , s 2 ...... s m };表未知含义词汇占比f 1: Table unknown meaning vocabulary proportion f 1 :UK={前所未有,首次出现};UK = {unprecedented, first appearance};P UK>2表示“出现了集合UK中的任意词汇两次以上”的篇数,m表示社交媒体总篇数; P UK > 2 indicates the number of “any vocabulary in the collection UK appears more than twice”, and m indicates the total number of social media articles;表不理解含义词汇占比f 2: Table does not understand the meaning of vocabulary proportion f 2 :DU={怎么会,不理解,不懂,不科学};DU = {how, do n’t understand, do n’t understand, unscientific};P DU>2表示“出现了集合DU中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。 P DU > 2 indicates the number of articles in which "any vocabulary in the set DU appears twice or more", and m indicates the total number of articles in social media.
- 如权利要求4所述的方法,其特征在于,所述激惹度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the index included in the irritability is:激惹度表示社交媒体S{s 1,s 2......s m}中网民对于恐慌事件的忧虑程度及忧虑反应程度; Aggression indicates the degree of anxiety and anxiety response of the netizens to the panic incident in social media S {s 1 , s 2 ...... s m };表忧虑性词汇占比g 1; Express anxiety vocabulary proportion g 1 ;AN={担心,烦,焦虑};AN = {Worry, Annoyance, Anxiety};P AN>2表示“出现了集合AN中的任意词汇两次以上”的篇数,m表示社交媒体总篇数; P AN > 2 indicates the number of "an arbitrary word in the set AN appeared twice or more", and m indicates the total number of social media articles;责备性词汇占比g 2; Blame word proportion g 2 ;BL={都怪,负责,承担责任,甩锅};BL = {all blame, responsible, bear responsibility, dump the pot};P BL>2表示“出现了集合BL中的任意词汇两次以上”的篇数,m表示社交媒体总篇数 P BL > 2 indicates the number of articles in which "any vocabulary in the collection BL has appeared more than twice", and m indicates the total number of articles in social media抗议性词汇占比g 3; Protest vocabulary accounted g 3;PR={抗议,反对,拒绝};PR = {protest, objection, rejection};P PR>2表示“出现了集合PR中的任意词汇两次以上”的篇数,m表示社交媒体总篇数。 P PR > 2 indicates the number of articles in which "any vocabulary in the set PR appears twice or more", and m indicates the total number of articles in social media.
- 如权利要求4所述的方法,其特征在于,所述信任度包括的指标的计算方法为:The method according to claim 4, wherein the calculation method of the index included in the trust degree is:信任度表示社交媒体S{s 1,s 2......s m}中公众对于政府与专家的信任程度; The degree of trust indicates the degree of public trust in government and experts in social media S {s 1 , s 2 ...... s m };政府和社会组织的官方发言数量h 1; Number of official speeches from government and social organizations h 1 ;A={A 1,A 2......,A u}; A = {A 1 , A 2 ......, A u };G={政府,协会,组织};G = {Government, Association, Organization};其中,A=S集合中所有的账户名;Among them, A = all account names in the set;G表示官方身份的词库;G is thesaurus of official identity;政府和社会组织的官方发言评论中反对含义词汇占比h 2; The official speeches of the government and social organizations commented against the proportion of meaning words h 2 ;B={B 1,B 2......,B v}; B = {B 1 , B 2 ......, B v };Y={反对,不信,麻烦解释,请解释};Y = {disagree, do not believe, please explain, please explain};其中,B=所有账户名含有官方身份词汇的社交内容Among them, B = all social content whose account name contains official identity wordsY=所有“反对”含义词汇的数据库Y = database of all "opposing" meaning words专家的发言数量h 3; Number of experts' speeches h 3 ;A={A 1,A 2......,A u}; A = {A 1 , A 2 ......, A u };X={专家,老师,学者};X = {expert, teacher, scholar};其中,A=所有社交账号名的数据库;Among them, A = a database of all social account names;X=所有表示学者身份词汇的数据库;X = a database of all scholarly vocabulary;专家发言的评论中反对含义词汇占比h 4 Proportion of vocabulary against meaning in comments made by experts h 4C={C 1,C 2......,C q}; C = {C 1 , C 2 ......, C q };Y={反对,不信,麻烦解释,请解释};Y = {disagree, do not believe, please explain, please explain};其中,C=所有账户名含有学者身份词汇的社交内容;Among them, C = all social content whose account name contains scholarly vocabulary;Y=所有“反对”含义词汇的数据库。Y = database of all "opposing" meaning words.
- 如权利要求3-12中任一项所述的方法,其特征在于,利用神经网络模型搭建恐慌指数的计算模型前,对数据进行归一化处理:The method according to any one of claims 3 to 12, characterized in that before the calculation model of the panic index is constructed using a neural network model, the data is normalized:其中x为当前样本的数值,x mean代表当前特征的平均值,x max代表当前样本中最大值,x min代表当前样本最小值; Where x is the value of the current sample, x mean represents the average value of the current feature, x max represents the maximum value in the current sample, and x min represents the minimum value of the current sample;根据归一化后的特征,使用多层全连接神经网络结构训练模型,其中,Layer L 1为输入层,代表各特征所对应的值;Layer L 2为隐藏层,计算隐藏特征;Layer L 3为输出层,输出最终结果; According to the normalized features, a multi-layer fully connected neural network structure is used to train the model, where Layer L 1 is the input layer and represents the value corresponding to each feature; Layer L 2 is the hidden layer and the hidden features are calculated; Layer L 3 Is the output layer, which outputs the final result;多层全连接神经网络结构训练模型的训练阶段包括:前向传播算法和反向传播算法:The training phase of the multi-layer fully connected neural network structure training model includes: forward propagation algorithm and back propagation algorithm:前向传播算法计算公式:Formula of forward propagation algorithm:z (l)=W (l-1)x (l-1)+b (l-1); z (l) = W (l-1) x (l-1) + b (l-1) ;a (l)=f(z (l)); a (l) = f (z (l) );h W,b(x)=a (L-1); h W, b (x) = a (L-1) ;其中:among them:l为第l层,L为最后一层,x (l)为输入的特征,W,b分别为权重和偏置,h W,b(x)为输出; l is the first layer, L is the last layer, x (l) is the input feature, W, b are the weight and offset, and h W, b (x) is the output;反向传播算法计算公式:Backpropagation calculation formula:多层全连接神经网络结构训练模型根据目标函数使用反向传播算法优化目标函数,得到最优的模型:The multi-layer fully connected neural network structure training model uses the back-propagation algorithm to optimize the objective function according to the objective function to obtain the optimal model:然后根据反向传播算法更新参数得到最优的模型:Then update the parameters according to the back-propagation algorithm to get the optimal model:更新参数:Update parameters:其中W i (l)为第i个节点第l层对应权重, J为对J函数中的权重W求偏导,m表示样本个数,b为偏置数值,λ为正则项参数,取λ=0.1; Where W i (l) is the corresponding weight of the i-th node at the l-th layer, J is the partial derivative of the weight W in the J function, m is the number of samples, b is the offset value, λ is the regular term parameter, and λ = 0.1 is taken;恐慌预测利用特征抽取方法进行相应特征抽取;将抽取的特征输入多层全连接神经网络结构训练模型的输入层,经过前向传播算法得到结果,并作为下一层模型的输入;经过三层模型的计算,得到最终的恐慌值。Panic prediction uses feature extraction to perform corresponding feature extraction; the extracted features are input to the input layer of a multi-layer fully connected neural network structure training model, the results are obtained through a forward propagation algorithm, and used as the input of the next layer model; after a three-layer model Calculation to get the final panic value.
- 一种能够实现权利要求1-3中任一项所述的监测分析方法的基于大数据的恐慌指数监测分析系统,其特征在于,所述系统包括:A big data-based panic index monitoring and analysis system capable of implementing the monitoring and analysis method according to any one of claims 1-3, wherein the system includes:数据库形成模块,利用大数据采集相关的所有媒体信息,形成数据库;Database formation module, which uses big data to collect all relevant media information to form a database;恐慌指数指标数值获取模块,将数据库中的数据,按照恐慌指数划分为八个维度及27个指标进行实时统计,得出各指标具体数值;The panic index indicator value acquisition module divides the data in the database into eight dimensions and 27 indicators according to the panic index for real-time statistics to obtain the specific values of each index;恐慌指数获取模块,利用神经网络模型和机器学习算法,匹配各维度权重形成完整模型,计算恐慌指数时,从数据库调取并统计指标计算所需数据,将各维度统计结果放入模型,最 终输出恐慌指数。The panic index acquisition module uses neural network models and machine learning algorithms to match the weights of each dimension to form a complete model. When calculating the panic index, the required data is retrieved from the database and statistical indicators are calculated. The statistical results of each dimension are put into the model and finally output Panic Index.
- 一种能够实现如权利要求1-13中任一项所述方法的的安全监测系统。A safety monitoring system capable of implementing the method according to any one of claims 1-13.
- 一种能够实现如权利要求1-13中任一项所述方法的安全分析系统。A security analysis system capable of implementing the method according to any one of claims 1-13.
- 一种能够实现如权利要求1-13中任一项所述方法的安全预警系统。A security early warning system capable of implementing the method according to any one of claims 1-13.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810662593.6 | 2018-06-25 | ||
CN201810662593.6A CN112765442A (en) | 2018-06-25 | 2018-06-25 | Network emotion fluctuation index monitoring and analyzing method and system based on news big data |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020000847A1 true WO2020000847A1 (en) | 2020-01-02 |
Family
ID=68985480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/113857 WO2020000847A1 (en) | 2018-06-25 | 2018-11-03 | News big data-based method and system for monitoring and analyzing risk perception index |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112765442A (en) |
WO (1) | WO2020000847A1 (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859074A (en) * | 2020-07-29 | 2020-10-30 | 东北大学 | Internet public opinion information source influence assessment method and system based on deep learning |
CN112084324A (en) * | 2020-08-11 | 2020-12-15 | 同济大学 | Traffic social media data processing method based on BERT and DNN models |
CN112418269A (en) * | 2020-10-23 | 2021-02-26 | 西安电子科技大学 | Method, system and medium for predicting social media network event propagation key time |
CN112434933A (en) * | 2020-11-20 | 2021-03-02 | 温州大学瓯江学院 | Quantitative evaluation method for media influence of public social platform |
CN112559845A (en) * | 2020-12-23 | 2021-03-26 | 北京清博大数据科技有限公司 | Method and system for identifying identity and motivation of atypical media account |
CN113128207A (en) * | 2021-05-10 | 2021-07-16 | 安徽博约信息科技股份有限公司 | News speaking right evaluation and prediction method based on big data |
CN113222471A (en) * | 2021-06-04 | 2021-08-06 | 西安交通大学 | Asset wind control method and device based on new media data |
CN113420946A (en) * | 2021-01-20 | 2021-09-21 | 广州麦媒信息科技有限公司 | News media evaluation method |
CN113537206A (en) * | 2020-07-31 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Pushed data detection method and device, computer equipment and storage medium |
CN113569188A (en) * | 2021-06-03 | 2021-10-29 | 大连交通大学 | DI-SCIR-based double-layer coupling social network public opinion propagation model construction method |
CN113742401A (en) * | 2020-05-27 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Data display method, device, equipment and storage medium |
CN113779195A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Hot event state evaluation method |
CN113807645A (en) * | 2021-07-26 | 2021-12-17 | 北京清博智能科技有限公司 | Industrial chain risk deduction method based on open source information |
CN113946680A (en) * | 2021-10-20 | 2022-01-18 | 河南师范大学 | Online network rumor identification method based on graph embedding and information flow analysis |
CN114021941A (en) * | 2021-11-01 | 2022-02-08 | 航天科工网络信息发展有限公司 | Method for risk assessment by using unstructured data |
CN117131161A (en) * | 2023-10-24 | 2023-11-28 | 北京社会管理职业学院(民政部培训中心) | Electric wheelchair user demand extraction method and system and electronic equipment |
WO2024098516A1 (en) * | 2022-11-07 | 2024-05-16 | 中电科大数据研究院有限公司 | Social network key node mining method and device and storage medium |
CN118051631A (en) * | 2024-02-23 | 2024-05-17 | 武汉理工大学 | Information analysis management method and system for digital new media based on big data |
CN118171920A (en) * | 2024-05-15 | 2024-06-11 | 山东浪潮智慧建筑科技有限公司 | LLM model-based park safety emergency response method, device and medium |
CN118229150A (en) * | 2024-04-09 | 2024-06-21 | 北京麦克斯泰科技有限公司 | Media influence calculation method and system |
CN118227666A (en) * | 2024-04-12 | 2024-06-21 | 中国标准化研究院 | Regional development data comparison query method based on index quantization model |
CN118551094A (en) * | 2024-04-26 | 2024-08-27 | 中国标准化研究院 | Public opinion information generation method based on knowledge graph |
CN118733780A (en) * | 2024-08-30 | 2024-10-01 | 山东福生佳信科技股份有限公司 | Thematic data processing method and system |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114328907A (en) * | 2021-10-22 | 2022-04-12 | 浙江嘉兴数字城市实验室有限公司 | Natural language processing method for early warning risk upgrade event |
CN114116961B (en) * | 2021-10-26 | 2024-09-06 | 福州外语外贸学院 | Information analysis method based on big data |
CN117670413A (en) * | 2023-12-13 | 2024-03-08 | 中教畅享科技股份有限公司 | Market crowd behavior-based market prediction method |
CN118114664A (en) * | 2024-04-25 | 2024-05-31 | 一网互通(北京)科技有限公司 | Data processing method and device of social media mixing platform and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104951548A (en) * | 2015-06-24 | 2015-09-30 | 烟台中科网络技术研究所 | Method and system for calculating negative public opinion index |
CN106408343A (en) * | 2016-09-23 | 2017-02-15 | 广州李子网络科技有限公司 | Modeling method and device for user behavior analysis and prediction based on BP neural network |
CN107592306A (en) * | 2017-09-08 | 2018-01-16 | 四川省绵阳太古软件有限公司 | Information security monitoring management method and system based on environment of internet of things big data |
CN108108454A (en) * | 2017-12-28 | 2018-06-01 | 中译语通科技(青岛)有限公司 | A kind of tourism big data system based on multilingual the analysis of public opinion |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120296845A1 (en) * | 2009-12-01 | 2012-11-22 | Andrews Sarah L | Methods and systems for generating composite index using social media sourced data and sentiment analysis |
CN105068991A (en) * | 2015-07-30 | 2015-11-18 | 成都鼎智汇科技有限公司 | Big data based public sentiment discovery method |
CN105740228B (en) * | 2016-01-25 | 2019-06-04 | 云南大学 | A kind of internet public feelings analysis method and system |
CN106227885A (en) * | 2016-08-08 | 2016-12-14 | 星河互联集团有限公司 | Processing method, device and the terminal of a kind of big data |
CN107229610B (en) * | 2017-03-17 | 2019-06-21 | 咪咕数字传媒有限公司 | A kind of analysis method and device of affection data |
CN107357860A (en) * | 2017-06-30 | 2017-11-17 | 中山大学 | A kind of personal share mood assemblage method based on news data |
-
2018
- 2018-06-25 CN CN201810662593.6A patent/CN112765442A/en active Pending
- 2018-11-03 WO PCT/CN2018/113857 patent/WO2020000847A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104951548A (en) * | 2015-06-24 | 2015-09-30 | 烟台中科网络技术研究所 | Method and system for calculating negative public opinion index |
CN106408343A (en) * | 2016-09-23 | 2017-02-15 | 广州李子网络科技有限公司 | Modeling method and device for user behavior analysis and prediction based on BP neural network |
CN107592306A (en) * | 2017-09-08 | 2018-01-16 | 四川省绵阳太古软件有限公司 | Information security monitoring management method and system based on environment of internet of things big data |
CN108108454A (en) * | 2017-12-28 | 2018-06-01 | 中译语通科技(青岛)有限公司 | A kind of tourism big data system based on multilingual the analysis of public opinion |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113742401A (en) * | 2020-05-27 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Data display method, device, equipment and storage medium |
CN111859074B (en) * | 2020-07-29 | 2023-12-29 | 东北大学 | Network public opinion information source influence evaluation method and system based on deep learning |
CN111859074A (en) * | 2020-07-29 | 2020-10-30 | 东北大学 | Internet public opinion information source influence assessment method and system based on deep learning |
CN113537206B (en) * | 2020-07-31 | 2023-11-10 | 腾讯科技(深圳)有限公司 | Push data detection method, push data detection device, computer equipment and storage medium |
CN113537206A (en) * | 2020-07-31 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Pushed data detection method and device, computer equipment and storage medium |
CN112084324B (en) * | 2020-08-11 | 2024-06-04 | 同济大学 | Traffic social media data processing method based on BERT and DNN models |
CN112084324A (en) * | 2020-08-11 | 2020-12-15 | 同济大学 | Traffic social media data processing method based on BERT and DNN models |
CN112418269A (en) * | 2020-10-23 | 2021-02-26 | 西安电子科技大学 | Method, system and medium for predicting social media network event propagation key time |
CN112418269B (en) * | 2020-10-23 | 2024-04-16 | 西安电子科技大学 | Social media network event propagation key time prediction method, system and medium |
CN112434933A (en) * | 2020-11-20 | 2021-03-02 | 温州大学瓯江学院 | Quantitative evaluation method for media influence of public social platform |
CN112559845A (en) * | 2020-12-23 | 2021-03-26 | 北京清博大数据科技有限公司 | Method and system for identifying identity and motivation of atypical media account |
CN113420946A (en) * | 2021-01-20 | 2021-09-21 | 广州麦媒信息科技有限公司 | News media evaluation method |
CN113420946B (en) * | 2021-01-20 | 2024-02-09 | 广州麦媒信息科技有限公司 | News media evaluation method |
CN113128207B (en) * | 2021-05-10 | 2024-03-29 | 安徽博约信息科技股份有限公司 | News speaking right assessment and prediction method based on big data |
CN113128207A (en) * | 2021-05-10 | 2021-07-16 | 安徽博约信息科技股份有限公司 | News speaking right evaluation and prediction method based on big data |
CN113569188A (en) * | 2021-06-03 | 2021-10-29 | 大连交通大学 | DI-SCIR-based double-layer coupling social network public opinion propagation model construction method |
CN113569188B (en) * | 2021-06-03 | 2024-04-09 | 大连交通大学 | DI-SCIR-based double-layer coupling social network public opinion propagation model construction method |
CN113222471B (en) * | 2021-06-04 | 2023-06-06 | 西安交通大学 | Asset wind control method and device based on new media data |
CN113222471A (en) * | 2021-06-04 | 2021-08-06 | 西安交通大学 | Asset wind control method and device based on new media data |
CN113807645A (en) * | 2021-07-26 | 2021-12-17 | 北京清博智能科技有限公司 | Industrial chain risk deduction method based on open source information |
CN113779195A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Hot event state evaluation method |
CN113779195B (en) * | 2021-08-31 | 2023-12-22 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Hot event state evaluation method |
CN113946680B (en) * | 2021-10-20 | 2024-04-16 | 河南师范大学 | Online network rumor identification method based on graph embedding and information flow analysis |
CN113946680A (en) * | 2021-10-20 | 2022-01-18 | 河南师范大学 | Online network rumor identification method based on graph embedding and information flow analysis |
CN114021941A (en) * | 2021-11-01 | 2022-02-08 | 航天科工网络信息发展有限公司 | Method for risk assessment by using unstructured data |
WO2024098516A1 (en) * | 2022-11-07 | 2024-05-16 | 中电科大数据研究院有限公司 | Social network key node mining method and device and storage medium |
CN117131161A (en) * | 2023-10-24 | 2023-11-28 | 北京社会管理职业学院(民政部培训中心) | Electric wheelchair user demand extraction method and system and electronic equipment |
CN118051631A (en) * | 2024-02-23 | 2024-05-17 | 武汉理工大学 | Information analysis management method and system for digital new media based on big data |
CN118229150A (en) * | 2024-04-09 | 2024-06-21 | 北京麦克斯泰科技有限公司 | Media influence calculation method and system |
CN118227666A (en) * | 2024-04-12 | 2024-06-21 | 中国标准化研究院 | Regional development data comparison query method based on index quantization model |
CN118551094A (en) * | 2024-04-26 | 2024-08-27 | 中国标准化研究院 | Public opinion information generation method based on knowledge graph |
CN118171920A (en) * | 2024-05-15 | 2024-06-11 | 山东浪潮智慧建筑科技有限公司 | LLM model-based park safety emergency response method, device and medium |
CN118733780A (en) * | 2024-08-30 | 2024-10-01 | 山东福生佳信科技股份有限公司 | Thematic data processing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN112765442A (en) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020000847A1 (en) | News big data-based method and system for monitoring and analyzing risk perception index | |
Abd El-Jawad et al. | Sentiment analysis of social media networks using machine learning | |
Meng et al. | Rating the crisis of online public opinion using a multi-level index system | |
Su et al. | Analyzing public sentiments online: Combining human-and computer-based content analysis | |
Zhou et al. | Real world city event extraction from Twitter data streams | |
CN111967761A (en) | Monitoring and early warning method and device based on knowledge graph and electronic equipment | |
CN109300042A (en) | A kind of air control system based on big data | |
CN111914141B (en) | Public opinion knowledge base construction method and public opinion knowledge base | |
Jia et al. | The public sentiment analysis of double reduction policy on weibo platform | |
Yuan et al. | A hybrid method for multi-class sentiment analysis of micro-blogs | |
CN112632218A (en) | Network public opinion monitoring method for enterprise crisis public customs | |
Su et al. | An improved BERT method for the evolution of network public opinion of major infectious diseases: Case Study of COVID-19 | |
Tong et al. | Multimedia network public opinion supervision prediction algorithm based on big data | |
CN111241288A (en) | Emergency sensing system of large centralized power customer service center and construction method | |
Zhou | Detecting the public’s information behaviour preferences in multiple emergency events | |
Wan et al. | Evaluation model of power operation and maintenance based on text emotion analysis | |
Mu et al. | EventSys: Tracking event evolution on microblogging platforms | |
Fahim et al. | Identifying social media content supporting proud boys | |
Chen et al. | A smart urban management information public opinion analysis system | |
Li et al. | Sentiment analysis and prediction model based on Chinese government affairs microblogs | |
Vitiugin et al. | Multilingual Serviceability Model for Detecting and Ranking Help Requests on Social Media during Disasters | |
Song et al. | [Retracted] Network Sentiment Analysis of College Students in Different Epidemic Stages Based on Text Clustering | |
Cui et al. | [Retracted] Big Data Enabled the Development of Public Sports Health Emergency Corpus: Taking MACPHE as an Example | |
Cherichi et al. | Using big data values to enhance social event detection pattern | |
Lyu et al. | Analysis of gender sentiment expression in network based on TF-LDA algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18924850 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18924850 Country of ref document: EP Kind code of ref document: A1 |