CN104484566A

CN104484566A - Big data analysis system and big data analysis method

Info

Publication number: CN104484566A
Application number: CN201410783566.6A
Authority: CN
Inventors: 殷晋; 章伟
Original assignee: Wuhu Leruisi Information Consulting Co Ltd
Current assignee: Wuhu Leruisi Information Consulting Co Ltd
Priority date: 2014-12-16
Filing date: 2014-12-16
Publication date: 2015-04-01

Abstract

The invention relates to the technical field of information processing, in particular to a big data analysis system and a big data analysis method which are low in complexity, fast in computation speed and high in search efficiency. The big data analysis system is characterized in that a data analysis and extraction module used for completing key information extraction, a data preprocessing module used for preprocessing key information, a network construction module used for abstracting the preprocessed data into a network graph model, an operation discovery module used for partitioning the network graph model and performing further analysis operation, and a result output module used for outputting a discovery module are arranged. Compared with the prior art, the big data analysis system and the big data analysis method have the remarkable advantages that the classification, the defining and the output of a certain potential group can be completed by rapidly processing communication data among individuals; the processing speed is fast, the analysis efficiency is high, and the like.

Description

Large data analysis system and method

Technical field:

The present invention relates to technical field of information processing, specifically a kind of complexity be low, computing velocity is fast, large data analysis system that search efficiency is high and method.

Background technology:

Large data technique or claim flood tide data, refers to involved data quantity huge to cannot by current main software instrument, reaches to draw, manage, process and arranges and become the more positive object information of help enterprise management decision-making within the rational time.The strategic importance of large data counts does not lie in grasps huge data message, and be to carry out specialized process to these containing significant data, in other words, if large data are compared to a kind of industry, the key that so this industry realizes profit is to improve " working ability " to data, realizes increment by process data.

How from a large amount of, incomplete, noisy, fuzzy, random extracting data lie in wherein, ignorant in advance but process that the is information of potentially useful sometimes is called as data mining, obviously, the key of large data technique during data mining.

Summary of the invention:

The present invention is directed to the shortcoming and defect existed in prior art, propose a kind of complexity is low, computing velocity is fast, search efficiency is high large data analysis system and method.

The present invention is reached by following measures:

A kind of large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.

The present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:

Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;

Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;

Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;

Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;

Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;

Step 6: export operation result.

Step 1 of the present invention realizes especially by following steps:

Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;

Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:

W _ij=e ^{φ (t)+θ (n)+γ (f)}, wherein W _ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.

Step 2 of the present invention specifically comprises the following steps:

Step 2-1: extracted data composition training set X from the database that step 1 obtains, for training hash function, the size n of training set by determine, wherein t _α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;

Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:

\min_{B, S} {| | X - BS | |}_{2} + λ_{1} Σ_{i, j} w_{i, j} {| | s_{i} - s_{j} | |}^{2} + λ_{2} {| | S | |}_{1}, s . t ., S > 0, Σ_{i} B_{i, j}^{2} \leq 1

, wherein X is training set, and B is base space, and each vector of B is the base vector of training out in training set X, and S is the low-dimensional real number value that X is projected in base space B, λ ₁and λ ₂the adjustable parameter obtained by ten folding cross validation methods, w _{i, j}two instance X in X _iand X _jbetween the projection of Euclidean distance in gaussian kernel, S _iand S _jtwo vectors in matrix S, B _{i, j}the element of the i-th row and jth row in matrix B, i=1,2,3 ... n is the mark representing example, j=1,2,3, k represents the label of base vector, and n is the number of example, and k is the number of base vector, and s > 0 represents each element non-negative in S;

Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I) ^-1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.

In step 3 of the present invention, network graphics drawing can be expressed as G=(V, E), and wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.

The present invention compared with prior art, by processing the communication data between individuality fast, can complete the classification of a certain potential group, delimitation and output, has that processing speed is fast, the high significant advantage of analysis efficiency.

Accompanying drawing illustrates:

Accompanying drawing 1 is structured flowchart of the present invention.

Accompanying drawing 2 is process flow diagrams of the present invention.

Embodiment:

Below in conjunction with accompanying drawing, the present invention is further illustrated.

As shown in Figure 1, the present invention proposes a kind of large data analysis system, it is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.

As shown in Figure 2, the present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:

Step 6: export operation result.

Step 1 of the present invention realizes especially by following steps:

W _ij=e ^{φ (t)+0 (n)+γ (f)}, wherein W _ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.

Step 2 of the present invention specifically comprises the following steps:

\min_{B, S} {| | X - BS | |}_{2} + λ_{1} Σ_{i, j} w_{i, j} {| | s_{i} - s_{j} | |}^{2} + λ_{2} {| | S | |}_{1}, s . t ., S > 0, Σ_{i} B_{i, j}^{2} \leq 1

Claims

1. a large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.

2. a large data analysing method, is characterized in that comprising the following steps:

Step 6: export operation result.

3. the large data analysing method of one according to claim 2, is characterized in that step 1 realizes especially by following steps:

4. the large data analysing method of one according to claim 2, is characterized in that step 2 specifically comprises the following steps:

Step 2-1: extracted data composition training set X from the database that step 1 obtains, for

Training hash function, the size n of training set by determine, wherein t _α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;

5. the large data analysing method of one according to claim 2, it is characterized in that in step 3, network graphics drawing can be expressed as G=(V, E), wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.