CN104965894A - Data analysis system for IDC hazardous information monitoring platform - Google Patents
Data analysis system for IDC hazardous information monitoring platform Download PDFInfo
- Publication number
- CN104965894A CN104965894A CN201510343194.XA CN201510343194A CN104965894A CN 104965894 A CN104965894 A CN 104965894A CN 201510343194 A CN201510343194 A CN 201510343194A CN 104965894 A CN104965894 A CN 104965894A
- Authority
- CN
- China
- Prior art keywords
- search
- unit
- keyword
- harmful information
- idc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9038—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a data analysis system for an IDC hazardous information monitoring platform, wherein a hazardous information searching unit also comprises one or a combination of a plurality of a keyword filter, a label field filter, a metadata field filter and a time filter and completes accurate search by means of various filters and the combination thereof; a keyword processing unit is used for generating a keyword search instruction so as to enable the hazardous information searching unit to execute a hazardous information search task according to the keyword search instruction; a fuzzy matching unit is used for carrying out matching of a similar word according to an input search character string, so that while searching for the search character string, the hazardous information searching unit further completes search for the similar word and returns a search result of the similar word; and an automatic word partitioning unit is used for performing automatic extraction of a keyword on the input search character string, so that the hazardous information searching unit completes accurate search according to the automatically extracted keyword.
Description
Technical field
The present invention relates to a kind of data analysis system for IDC harmful information monitoring platform.
Background technology
Along with developing rapidly of network, WWW becomes the carrier of bulk information, how effectively to extract and to utilize these information to become a huge challenge.Search engine becomes as the instrument of auxiliary people's retrieving information entrance and the guide that user accesses WWW.But these versatility search engines also also exist certain limitation.
In the face of the Web Community's environment become increasingly active, each netizen may become publisher and the diffuser of harmful information, and network is harmful to route of transmission and more and more extensively comprises blog, news, forum, microblogging and other approach.Web crawlers is the precursor technique that various search engine can realize, the arriving of large data age and the develop rapidly of Internet technology, makes web crawlers have more great Research Significance.Reply web data amount has a big increase, the network text update cycle is short and the series of challenges such as structure of web page dynamic change, high-level efficiency and the web crawlers of non-stop run becomes the study hotspot that harmful information excavates.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, a kind of data analysis system for IDC harmful information monitoring platform is provided, from mass data, collect the data relevant with sensitive word, accomplish initiatively to find harmful webpage, realize searching for more accurately by harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
The object of the invention is to be achieved through the following technical solutions: a kind of data analysis system for IDC harmful information monitoring platform, it comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
Harmful information search unit comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality.Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node.
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, completes precise search by multiple screening washer and combination thereof.
Keyword processing unit is for generating keyword search instruction, and harmful information search unit performs harmful information search mission according to this keyword search instruction.
Fuzzy matching unit is used for, according to the akin approximate vocabularies of searching character String matching of input, while harmful information search unit is searched for search string, also completing the search of approximate vocabularies, and returning approximate vocabularies Search Results.
Automatic word segmentation unit is used for the search string of input automatically to extract key word, makes harmful information search unit complete precise search according to this automatic key word that extracts.
Described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time.Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
The present invention also comprises autoabstract generation unit, and autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof.
Described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extracts critical field generating web page summary.
The present invention also comprises result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
Described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprises harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
Described report generation module is used for according to search result information generating report forms.
Described task trend analysis module is for generating increment graph.
Described duty profile analysis module is for generating task list, website distribution plan and media distribution figure.
Described Search Results comprises harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant information.
The invention has the beneficial effects as follows: a kind of data analysis system for IDC harmful information monitoring platform proposed by the invention, from mass data, collects the data relevant with sensitive word, accomplish initiatively to find to be harmful to; Include the relevant informations such as harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate, participant, assistant analysis is harmful to temperature, importance, the development trend of webpage, accomplishes that accurate analysis is harmful to; A suspect's virtual identity is set and carries out key monitoring, according to collecting data analysis scope of activities, scattering content, activity time etc.; The analysis of speech qualitative data can be set; Event temperature quick position is analyzed.
Accompanying drawing explanation
Fig. 1 is data analysis system structured flowchart of the present invention.
Embodiment
Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail, but protection scope of the present invention is not limited to the following stated.
As shown in Figure 1, a kind of data analysis system for IDC harmful information monitoring platform, it comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit.
1, harmful information search unit comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality.Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node.
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, precise search is completed, as provided the search weight of keyword, the weight combinatorial search etc. of multiple metadata fields by multiple screening washer and combination thereof.
Key word screening washer: support the combination of keyword logical expression, comprise AND, OR, NOT etc.
Label field screening washer: support the logic AND-OR INVERTER limit search combined by multiple label field.
Metadata fields screening washer: multiple metadata fields can be defined, select Search Results by parameter.
Time screening washer: support the ranking function according to date, the degree of correlation and other field combination.
Field label search is label field by setting up index text, and user can select tag combination targetedly, thus returns and limit result accordingly.
Harmful information search unit carries out the whole network search according to the hot word of burst deleterious network, harmful quantity of quick search accident, distribution site, harmful temperature.
2, keyword processing unit is for generating keyword search instruction, and harmful information search unit adopts boolean logical expression, and performs harmful information search mission according to this keyword search instruction.
Described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time.Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
3, fuzzy matching unit is used for, according to the akin approximate vocabularies of searching character String matching of input, while harmful information search unit is searched for search string, also completing the search of approximate vocabularies, and returning approximate vocabularies Search Results.
User can input a word, passage or even an entire article, and system can analyze the contents concept of user search condition, then finds out the result of user's care from the degree of correlation of concept.If user does not know how the content of inquiring about spells, can by searching for generally, system, except returning corresponding Search Results, also returns other vocabulary close with input of character string, thus allows user find other results of being correlated with.
4, automatic word segmentation unit is used for the search string of input automatically to extract key word, makes harmful information search unit complete precise search according to this automatic key word that extracts.Automatic word segmentation module is the basis of Chinese information processing and analysis.Based on dictionary Sum fanction, fully utilize the language model method based on probability analysis, and the participle of applicable particular requirement can be carried out according to different application.
5, the present invention also comprises autoabstract generation unit, and autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof.Webpage can generate different web-page summarization dynamically according to the different search string of input, according to this web-page summarization, user can judge whether that needing to open this webpage investigates, and by dynamic web-page summarization understand return results in relation between each webpage.
Described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extracts critical field generating web page summary.When user checks the particular content of webpage, autoabstract generation unit can, automatically to article content generating web page summary, now not need to analyze webpage according to search string and approximate vocabularies thereof yet.
Autoabstract generation unit can consider word frequency, part of speech, positional information, realizes accurately extraction and analysis keyword automatically, and according to the automatic generating web page summary of its key word analyzed.
6, the present invention also comprises result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
Described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprises harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics.
Described report generation module is used for according to search result information generating report forms, comprises histogram, broken line graph list rod figure, double stick figure, three rod figure, multiple chart and X-Y figure.
Described task trend analysis module, for generating increment graph, comprises increment graph every day, weekly increment graph, monthly increment graph etc.
Described duty profile analysis module is for generating patterned task list, website distribution plan and media distribution figure.
Described Search Results comprises harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant information.
Statistical analysis unit is that user provides powerful query function, carries out analyzing, representing for real-time and historical data, carries out data mining, comprise historical data, patrol and examine data, network data, monitor node data for historical data application.Can be as required, various querying condition is set flexibly, multiple statistical forms is provided, as the form such as single rod figure, double stick figure, three rod figure, multiple chart, X-Y figure (coordinate points drawing), and can combine with dispatch service, the form generating multiple output format, as word form, PDF, Excel form etc., sends to designated user, enrich decision analysis function, facilitate user's data query, analytic trend, formulation Adjusted Option.Meanwhile, system has extendability, is user's editing picture.
Claims (7)
1. for a data analysis system for IDC harmful information monitoring platform, it is characterized in that: it comprises harmful information search unit, automatic word segmentation unit, keyword processing unit and fuzzy matching unit;
Harmful information search unit comprises local search port and web search port, and local search port, for starting the search engine of local reptile node, performs this harmful information search mission in this locality; Web search port, for starting the search engine of multiple reptile node, performs this harmful information search mission by multiple reptile node simultaneously, also by this web search port, Search Results is turned back to this local reptile node;
Harmful information search unit also comprises one or more the combination in key word screening washer, label field screening washer, metadata fields screening washer and time screening washer, completes precise search by multiple screening washer and combination thereof;
Keyword processing unit is for generating keyword search instruction, and harmful information search unit performs harmful information search mission according to this keyword search instruction;
Fuzzy matching unit is used for, according to the akin approximate vocabularies of searching character String matching of input, while harmful information search unit is searched for search string, also completing the search of approximate vocabularies, and returning approximate vocabularies Search Results;
Automatic word segmentation unit is used for the search string of input automatically to extract key word, makes harmful information search unit complete precise search according to this automatic key word that extracts.
2. a kind of data analysis system for IDC harmful information monitoring platform according to claim 1, is characterized in that: described keyword search instruction comprises No. ID, classification, event title, keyword option, eliminating keyword option, weight, initial time; Described eliminating keyword option can not be regarded as harmful information webpage by coupling for making to comprise the webpage getting rid of arbitrary key word in keyword option.
3. a kind of data analysis system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: also comprise autoabstract generation unit, autoabstract generation unit is made a summary to the dynamic generating web page of target web according to the search string of input and approximate vocabularies thereof.
4. a kind of data analysis system for IDC harmful information monitoring platform according to claim 3, it is characterized in that: described autoabstract generation unit also carries out keyword analyses by keyword processing unit to webpage, automatically extract critical field generating web page summary.
5. a kind of data analysis system for IDC harmful information monitoring platform according to claim 1, it is characterized in that: also comprise result statistical analysis unit, result statistical analysis unit is used for carrying out analytic statistics to the Search Results returned, and described statistical analysis unit comprises task public sentiment figure generation module, report generation module, task paper statistics module, task trend analysis module and duty profile analysis module.
6. a kind of data analysis system for IDC harmful information monitoring platform according to claim 5, it is characterized in that: described task public sentiment figure generation module generates task public sentiment figure according to search condition and Search Results, comprise harmful information content statistics, acceptance of the bid keyword quantity statistics and webpage quantitative classification statistics;
Described report generation module is used for according to search result information generating report forms;
Described task trend analysis module is for generating increment graph;
Described duty profile analysis module is for generating task list, website distribution plan and media distribution figure.
7. a kind of data analysis system for IDC harmful information monitoring platform according to claim 1, is characterized in that: described Search Results comprises harmful distribution site, route of transmission, money order receipt to be signed and returned to the sender rate, clicking rate and participant information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510343194.XA CN104965894A (en) | 2015-06-19 | 2015-06-19 | Data analysis system for IDC hazardous information monitoring platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510343194.XA CN104965894A (en) | 2015-06-19 | 2015-06-19 | Data analysis system for IDC hazardous information monitoring platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104965894A true CN104965894A (en) | 2015-10-07 |
Family
ID=54219932
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510343194.XA Pending CN104965894A (en) | 2015-06-19 | 2015-06-19 | Data analysis system for IDC hazardous information monitoring platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104965894A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649366A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Method and device for classifying keyword search results |
CN110674367A (en) * | 2019-09-09 | 2020-01-10 | 广州易起行信息技术有限公司 | Single Chinese character retrieval method and device based on travel industry products |
CN111314292A (en) * | 2020-01-15 | 2020-06-19 | 上海观安信息技术股份有限公司 | Data security inspection method based on sensitive data identification |
CN111858830A (en) * | 2020-03-27 | 2020-10-30 | 北京梦天门科技股份有限公司 | Health supervision law enforcement data retrieval system and method based on natural language processing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073683A (en) * | 2010-12-22 | 2011-05-25 | 四川大学 | Distributed real-time news information acquisition system |
CN103399877A (en) * | 2013-08-19 | 2013-11-20 | 四川公用信息产业有限责任公司 | Multi-Android-client service sharing method and system |
CN104281607A (en) * | 2013-07-08 | 2015-01-14 | 上海锐英软件技术有限公司 | Microblog hot topic analyzing method |
-
2015
- 2015-06-19 CN CN201510343194.XA patent/CN104965894A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102073683A (en) * | 2010-12-22 | 2011-05-25 | 四川大学 | Distributed real-time news information acquisition system |
CN104281607A (en) * | 2013-07-08 | 2015-01-14 | 上海锐英软件技术有限公司 | Microblog hot topic analyzing method |
CN103399877A (en) * | 2013-08-19 | 2013-11-20 | 四川公用信息产业有限责任公司 | Multi-Android-client service sharing method and system |
Non-Patent Citations (7)
Title |
---|
周亦鹏: "《软件人主题分析和信息检索技术》", 31 August 2012, 北京邮电大学出版社 * |
徐征: "网络不良信息检测系统的设计与实现", 《中国优秀硕士学位论文全文数据库》 * |
王和兴等: "《物联网工程 导论》", 31 December 2014 * |
王守银: "一种网络论坛有害信息监测系统的构建与应用", 《第28次全国计算机安全学术交流会》 * |
王继新等: "《远程教育原理与技术》", 30 November 2005 * |
苏旋: "分布式网络爬虫技术的研究与实现", 《中国优秀硕士学位论文全文数据库》 * |
苏金波等: "基于关键词相关性的有害信息爬虫系统研究", 《计算机技术与发展》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649366A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Method and device for classifying keyword search results |
CN110674367A (en) * | 2019-09-09 | 2020-01-10 | 广州易起行信息技术有限公司 | Single Chinese character retrieval method and device based on travel industry products |
CN110674367B (en) * | 2019-09-09 | 2022-02-01 | 广州易起行信息技术有限公司 | Single Chinese character retrieval method and device based on travel industry products |
CN111314292A (en) * | 2020-01-15 | 2020-06-19 | 上海观安信息技术股份有限公司 | Data security inspection method based on sensitive data identification |
CN111858830A (en) * | 2020-03-27 | 2020-10-30 | 北京梦天门科技股份有限公司 | Health supervision law enforcement data retrieval system and method based on natural language processing |
CN111858830B (en) * | 2020-03-27 | 2023-11-14 | 北京梦天门科技股份有限公司 | Health supervision law enforcement data retrieval system and method based on natural language processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104951539A (en) | Internet data center harmful information monitoring system | |
AU2022201654A1 (en) | System and engine for seeded clustering of news events | |
CN104182389B (en) | A kind of big data analyzing business intelligence service system based on semanteme | |
CN104899324B (en) | One kind monitoring systematic sample training system based on IDC harmful informations | |
US10146878B2 (en) | Method and system for creating filters for social data topic creation | |
CN107729336A (en) | Data processing method, equipment and system | |
CN102915335B (en) | Based on the information correlation method of user operation records and resource content | |
CN106777043A (en) | A kind of academic resources acquisition methods based on LDA | |
CN101751458A (en) | Network public sentiment monitoring system and method | |
CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
KR20220064016A (en) | Method for extracting construction safety accident based data mining using big data | |
CN104965823A (en) | Big data based opinion extraction method | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
CN103942268A (en) | Method and device for combining search and application and application interface | |
CA2956627A1 (en) | System and engine for seeded clustering of news events | |
CN107330111A (en) | The search method and device of domain body based on common version body | |
CN104965894A (en) | Data analysis system for IDC hazardous information monitoring platform | |
CN116992010A (en) | Content distribution and interaction method and system based on multi-mode large model | |
CN104636386A (en) | Information monitoring method and device | |
KR102107474B1 (en) | Social issue deduction system and method using crawling | |
Pandya et al. | Mated: metadata-assisted twitter event detection system | |
CN102902705A (en) | Locating ambiguities in data | |
Zhao et al. | Hot question prediction in Stack Overflow | |
KR20220068793A (en) | Method for providing news analysis service using robotic process automation monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151007 |