CN104484566A - Big data analysis system and big data analysis method - Google Patents
Big data analysis system and big data analysis method Download PDFInfo
- Publication number
- CN104484566A CN104484566A CN201410783566.6A CN201410783566A CN104484566A CN 104484566 A CN104484566 A CN 104484566A CN 201410783566 A CN201410783566 A CN 201410783566A CN 104484566 A CN104484566 A CN 104484566A
- Authority
- CN
- China
- Prior art keywords
- data
- data analysis
- network
- function
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of information processing, in particular to a big data analysis system and a big data analysis method which are low in complexity, fast in computation speed and high in search efficiency. The big data analysis system is characterized in that a data analysis and extraction module used for completing key information extraction, a data preprocessing module used for preprocessing key information, a network construction module used for abstracting the preprocessed data into a network graph model, an operation discovery module used for partitioning the network graph model and performing further analysis operation, and a result output module used for outputting a discovery module are arranged. Compared with the prior art, the big data analysis system and the big data analysis method have the remarkable advantages that the classification, the defining and the output of a certain potential group can be completed by rapidly processing communication data among individuals; the processing speed is fast, the analysis efficiency is high, and the like.
Description
Technical field:
The present invention relates to technical field of information processing, specifically a kind of complexity be low, computing velocity is fast, large data analysis system that search efficiency is high and method.
Background technology:
Large data technique or claim flood tide data, refers to involved data quantity huge to cannot by current main software instrument, reaches to draw, manage, process and arranges and become the more positive object information of help enterprise management decision-making within the rational time.The strategic importance of large data counts does not lie in grasps huge data message, and be to carry out specialized process to these containing significant data, in other words, if large data are compared to a kind of industry, the key that so this industry realizes profit is to improve " working ability " to data, realizes increment by process data.
How from a large amount of, incomplete, noisy, fuzzy, random extracting data lie in wherein, ignorant in advance but process that the is information of potentially useful sometimes is called as data mining, obviously, the key of large data technique during data mining.
Summary of the invention:
The present invention is directed to the shortcoming and defect existed in prior art, propose a kind of complexity is low, computing velocity is fast, search efficiency is high large data analysis system and method.
The present invention is reached by following measures:
A kind of large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
The present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
Step 1 of the present invention realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W
ij=e
φ (t)+θ (n)+γ (f), wherein W
ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
Step 2 of the present invention specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for training hash function, the size n of training set by
determine, wherein t
α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I)
-1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
In step 3 of the present invention, network graphics drawing can be expressed as G=(V, E), and wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
The present invention compared with prior art, by processing the communication data between individuality fast, can complete the classification of a certain potential group, delimitation and output, has that processing speed is fast, the high significant advantage of analysis efficiency.
Accompanying drawing illustrates:
Accompanying drawing 1 is structured flowchart of the present invention.
Accompanying drawing 2 is process flow diagrams of the present invention.
Embodiment:
Below in conjunction with accompanying drawing, the present invention is further illustrated.
As shown in Figure 1, the present invention proposes a kind of large data analysis system, it is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
As shown in Figure 2, the present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
Step 1 of the present invention realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W
ij=e
φ (t)+0 (n)+γ (f), wherein W
ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
Step 2 of the present invention specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for training hash function, the size n of training set by
determine, wherein t
α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I)
-1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
In step 3 of the present invention, network graphics drawing can be expressed as G=(V, E), and wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
The present invention compared with prior art, by processing the communication data between individuality fast, can complete the classification of a certain potential group, delimitation and output, has that processing speed is fast, the high significant advantage of analysis efficiency.
Claims (5)
1. a large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
2. a large data analysing method, is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
3. the large data analysing method of one according to claim 2, is characterized in that step 1 realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W
ij=e
φ (t)+θ (n)+γ (f), wherein W
ijrepresent weighted value, Φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
4. the large data analysing method of one according to claim 2, is characterized in that step 2 specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for
Training hash function, the size n of training set by
determine, wherein t
α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
, wherein X is training set, and B is base space, and each vector of B is the base vector of training out in training set X, and S is the low-dimensional real number value that X is projected in base space B, λ
1and λ
2the adjustable parameter obtained by ten folding cross validation methods, w
i, jtwo instance X in X
iand X
jbetween the projection of Euclidean distance in gaussian kernel, S
iand S
jtwo vectors in matrix S, B
i, jthe element of the i-th row and jth row in matrix B, i=1,2,3 ... n is the mark representing example, j=1,2,3, k represents the label of base vector, and n is the number of example, and k is the number of base vector, and s > 0 represents each element non-negative in S;
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I)
-1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
5. the large data analysing method of one according to claim 2, it is characterized in that in step 3, network graphics drawing can be expressed as G=(V, E), wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410783566.6A CN104484566A (en) | 2014-12-16 | 2014-12-16 | Big data analysis system and big data analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410783566.6A CN104484566A (en) | 2014-12-16 | 2014-12-16 | Big data analysis system and big data analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104484566A true CN104484566A (en) | 2015-04-01 |
Family
ID=52759107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410783566.6A Pending CN104484566A (en) | 2014-12-16 | 2014-12-16 | Big data analysis system and big data analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104484566A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229600A (en) * | 2017-05-31 | 2017-10-03 | 北京邮电大学 | A kind of parallel variance analysis method and device based on big data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily-compressed database index storage system utilizing fragments and sparse bitmap and corresponding construction, scheduling and query processing methods thereof |
CN102567375A (en) * | 2010-12-27 | 2012-07-11 | 中国移动通信集团公司 | Data mining method and device |
KR20140005474A (en) * | 2012-07-04 | 2014-01-15 | 한국전자통신연구원 | Apparatus and method for providing an application for processing bigdata |
CN103605653A (en) * | 2013-09-29 | 2014-02-26 | 广西师范大学 | Big data searching method based on sparse hash |
-
2014
- 2014-12-16 CN CN201410783566.6A patent/CN104484566A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567375A (en) * | 2010-12-27 | 2012-07-11 | 中国移动通信集团公司 | Data mining method and device |
CN102402617A (en) * | 2011-12-23 | 2012-04-04 | 天津神舟通用数据技术有限公司 | Easily-compressed database index storage system utilizing fragments and sparse bitmap and corresponding construction, scheduling and query processing methods thereof |
KR20140005474A (en) * | 2012-07-04 | 2014-01-15 | 한국전자통신연구원 | Apparatus and method for providing an application for processing bigdata |
CN103605653A (en) * | 2013-09-29 | 2014-02-26 | 广西师范大学 | Big data searching method based on sparse hash |
Non-Patent Citations (2)
Title |
---|
李孝伟: "基于用户通信行为分析的电信网络社区划分技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
段松青等: "PDM:基于Hadoop的并行数据分析系统", 《湖南大学学报(自然科学版)》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229600A (en) * | 2017-05-31 | 2017-10-03 | 北京邮电大学 | A kind of parallel variance analysis method and device based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102521656B (en) | Integrated transfer learning method for classification of unbalance samples | |
CN103902988B (en) | A kind of sketch shape matching method based on Modular products figure with Clique | |
CN104536983A (en) | Method and device for predicting advertisement click rate | |
CN106407208A (en) | Establishment method and system for city management ontology knowledge base | |
CN111709244A (en) | Deep learning method for identifying causal relationship of contradictory dispute events | |
CN104679818A (en) | Video keyframe extracting method and video keyframe extracting system | |
CN108376164B (en) | Display method and device of potential anchor | |
KR101968449B1 (en) | Automatic inspection system for label type data based on Artificial Intelligence Learning to improve data productivity, and method thereof | |
CN110555305A (en) | Malicious application tracing method based on deep learning and related device | |
CN109918658A (en) | A kind of method and system obtaining target vocabulary from text | |
CN111143578A (en) | Method, device and processor for extracting event relation based on neural network | |
CN109214407A (en) | Event detection model, calculates equipment and storage medium at method, apparatus | |
CN104182771A (en) | Time series data graphics analysis method based on automatic coding technology with packet loss | |
CN107392311A (en) | The method and apparatus of sequence cutting | |
CN112215398A (en) | Power consumer load prediction model establishing method, device, equipment and storage medium | |
CN109977977A (en) | A kind of method and corresponding intrument identifying potential user | |
CN106503386A (en) | The good and bad method and device of assessment luminous power prediction algorithm performance | |
US10706049B2 (en) | Method and apparatus for querying nondeterministic graph | |
CN103207804A (en) | MapReduce load simulation method based on cluster job logging | |
CN105183806A (en) | Method and system for identifying same user among different platforms | |
CN104484566A (en) | Big data analysis system and big data analysis method | |
CN102034102B (en) | Image-based significant object extraction method as well as complementary significance graph learning method and system | |
CN113742495B (en) | Rating feature weight determining method and device based on prediction model and electronic equipment | |
CN106156256A (en) | A kind of user profile classification transmitting method and system | |
CN112488312B (en) | Construction method of tensor-based automatic coding machine for network exchange data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150401 |