[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104484566A - Big data analysis system and big data analysis method - Google Patents

Big data analysis system and big data analysis method Download PDF

Info

Publication number
CN104484566A
CN104484566A CN201410783566.6A CN201410783566A CN104484566A CN 104484566 A CN104484566 A CN 104484566A CN 201410783566 A CN201410783566 A CN 201410783566A CN 104484566 A CN104484566 A CN 104484566A
Authority
CN
China
Prior art keywords
data
data analysis
network
function
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410783566.6A
Other languages
Chinese (zh)
Inventor
殷晋
章伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhu Leruisi Information Consulting Co Ltd
Original Assignee
Wuhu Leruisi Information Consulting Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhu Leruisi Information Consulting Co Ltd filed Critical Wuhu Leruisi Information Consulting Co Ltd
Priority to CN201410783566.6A priority Critical patent/CN104484566A/en
Publication of CN104484566A publication Critical patent/CN104484566A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of information processing, in particular to a big data analysis system and a big data analysis method which are low in complexity, fast in computation speed and high in search efficiency. The big data analysis system is characterized in that a data analysis and extraction module used for completing key information extraction, a data preprocessing module used for preprocessing key information, a network construction module used for abstracting the preprocessed data into a network graph model, an operation discovery module used for partitioning the network graph model and performing further analysis operation, and a result output module used for outputting a discovery module are arranged. Compared with the prior art, the big data analysis system and the big data analysis method have the remarkable advantages that the classification, the defining and the output of a certain potential group can be completed by rapidly processing communication data among individuals; the processing speed is fast, the analysis efficiency is high, and the like.

Description

Large data analysis system and method
Technical field:
The present invention relates to technical field of information processing, specifically a kind of complexity be low, computing velocity is fast, large data analysis system that search efficiency is high and method.
Background technology:
Large data technique or claim flood tide data, refers to involved data quantity huge to cannot by current main software instrument, reaches to draw, manage, process and arranges and become the more positive object information of help enterprise management decision-making within the rational time.The strategic importance of large data counts does not lie in grasps huge data message, and be to carry out specialized process to these containing significant data, in other words, if large data are compared to a kind of industry, the key that so this industry realizes profit is to improve " working ability " to data, realizes increment by process data.
How from a large amount of, incomplete, noisy, fuzzy, random extracting data lie in wherein, ignorant in advance but process that the is information of potentially useful sometimes is called as data mining, obviously, the key of large data technique during data mining.
Summary of the invention:
The present invention is directed to the shortcoming and defect existed in prior art, propose a kind of complexity is low, computing velocity is fast, search efficiency is high large data analysis system and method.
The present invention is reached by following measures:
A kind of large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
The present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
Step 1 of the present invention realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W ij=e φ (t)+θ (n)+γ (f), wherein W ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
Step 2 of the present invention specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for training hash function, the size n of training set by determine, wherein t α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
min B , S | | X - BS | | 2 + λ 1 Σ i , j w i , j | | s i - s j | | 2 + λ 2 | | S | | 1 , s . t . , S > 0 , Σ i B i , j 2 ≤ 1 , wherein X is training set, and B is base space, and each vector of B is the base vector of training out in training set X, and S is the low-dimensional real number value that X is projected in base space B, λ 1and λ 2the adjustable parameter obtained by ten folding cross validation methods, w i, jtwo instance X in X iand X jbetween the projection of Euclidean distance in gaussian kernel, S iand S jtwo vectors in matrix S, B i, jthe element of the i-th row and jth row in matrix B, i=1,2,3 ... n is the mark representing example, j=1,2,3, k represents the label of base vector, and n is the number of example, and k is the number of base vector, and s > 0 represents each element non-negative in S;
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I) -1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
In step 3 of the present invention, network graphics drawing can be expressed as G=(V, E), and wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
The present invention compared with prior art, by processing the communication data between individuality fast, can complete the classification of a certain potential group, delimitation and output, has that processing speed is fast, the high significant advantage of analysis efficiency.
Accompanying drawing illustrates:
Accompanying drawing 1 is structured flowchart of the present invention.
Accompanying drawing 2 is process flow diagrams of the present invention.
Embodiment:
Below in conjunction with accompanying drawing, the present invention is further illustrated.
As shown in Figure 1, the present invention proposes a kind of large data analysis system, it is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
As shown in Figure 2, the present invention also proposes a kind of large data analysing method, it is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
Step 1 of the present invention realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W ij=e φ (t)+0 (n)+γ (f), wherein W ijrepresent weighted value, φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
Step 2 of the present invention specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for training hash function, the size n of training set by determine, wherein t α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
min B , S | | X - BS | | 2 + λ 1 Σ i , j w i , j | | s i - s j | | 2 + λ 2 | | S | | 1 , s . t . , S > 0 , Σ i B i , j 2 ≤ 1 , wherein X is training set, and B is base space, and each vector of B is the base vector of training out in training set X, and S is the low-dimensional real number value that X is projected in base space B, λ 1and λ 2the adjustable parameter obtained by ten folding cross validation methods, w i, jtwo instance X in X iand X jbetween the projection of Euclidean distance in gaussian kernel, S iand S jtwo vectors in matrix S, B i, jthe element of the i-th row and jth row in matrix B, i=1,2,3 ... n is the mark representing example, j=1,2,3, k represents the label of base vector, and n is the number of example, and k is the number of base vector, and s > 0 represents each element non-negative in S;
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I) -1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
In step 3 of the present invention, network graphics drawing can be expressed as G=(V, E), and wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
The present invention compared with prior art, by processing the communication data between individuality fast, can complete the classification of a certain potential group, delimitation and output, has that processing speed is fast, the high significant advantage of analysis efficiency.

Claims (5)

1. a large data analysis system, is characterized in that being provided with the data analysis for completing key message extraction and extraction module; For carrying out pretreated data preprocessing module to key message; For by pretreated data abstraction being the network struction module of network graphics drawing; For carrying out network graphics drawing dividing and the computing discovery module of further analytic operation; For exporting the result output module finding result.
2. a large data analysing method, is characterized in that comprising the following steps:
Step 1: extract key message by data analysis and extraction module, described key message is the message registration data between individuality;
Step 2: key message step 1 obtained by data preprocessing module carries out pre-service;
Step 3: build network graphics drawing, by abstract for each individuality be a node in network graphics drawing, by abstract for the contact between individuality be limit in network chart, the data of extraction in use step 1, with the form storage networking graph model of matrix;
Step 4: analytical parameters and computing threshold value are set, analytical parameters comprises individual quantity, and computing threshold value is for limiting output individual amount;
Step 5: run and find algorithm, carries out division to network artwork and further analytic operation;
Step 6: export operation result.
3. the large data analysing method of one according to claim 2, is characterized in that step 1 realizes especially by following steps:
Step 1-1: distribute unique id to each individuality, a node after this id in meeting map network figure;
Step 1-2: between arbitrary two individual i and j, if there are communications records, so calculate contact weight coefficient between i and j with communication time, number of communications, communication frequency data for parameter, computing formula is as follows:
W ij=e φ (t)+θ (n)+γ (f), wherein W ijrepresent weighted value, Φ (t), θ (n), γ (f) is duration of call t respectively, talk times n, the function of voice frequency f, the concrete form of function is determined according to the experience of concrete application scenarios and user, can selection index attenuation function, linear function etc., if user also needs to consider more factor, only need on exponential term, increase new mapping function.
4. the large data analysing method of one according to claim 2, is characterized in that step 2 specifically comprises the following steps:
Step 2-1: extracted data composition training set X from the database that step 1 obtains, for
Training hash function, the size n of training set by determine, wherein t α/2represent the value of degree of confidence, can be obtained by the t critical value that distributes, ε represents maximum permissible error;
Step 2-2: train hash function with X, first relate to objective function and turn high-order real data to low-dimensional data, objective function is defined as:
, wherein X is training set, and B is base space, and each vector of B is the base vector of training out in training set X, and S is the low-dimensional real number value that X is projected in base space B, λ 1and λ 2the adjustable parameter obtained by ten folding cross validation methods, w i, jtwo instance X in X iand X jbetween the projection of Euclidean distance in gaussian kernel, S iand S jtwo vectors in matrix S, B i, jthe element of the i-th row and jth row in matrix B, i=1,2,3 ... n is the mark representing example, j=1,2,3, k represents the label of base vector, and n is the number of example, and k is the number of base vector, and s > 0 represents each element non-negative in S;
Step 2-3: carry out binary coding to the example also not obtaining binary code in database, crosses appellation to each example x, by s=(B ' B+2I) -1b ' x obtains the low-dimensional real number value of x, and then obtained its low-dimensional binary code by hash function, wherein B is the base space defined in step 2-2, and I follows B with the unit matrix of dimension, encodes, complete the pre-service of data to whole database.
5. the large data analysing method of one according to claim 2, it is characterized in that in step 3, network graphics drawing can be expressed as G=(V, E), wherein G represents network chart, V to represent in figure set a little, E represents the set on all limits in figure, and the every bar sideband in E has weight, and limit flexible strategy are stored in an independent vector lists.
CN201410783566.6A 2014-12-16 2014-12-16 Big data analysis system and big data analysis method Pending CN104484566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410783566.6A CN104484566A (en) 2014-12-16 2014-12-16 Big data analysis system and big data analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410783566.6A CN104484566A (en) 2014-12-16 2014-12-16 Big data analysis system and big data analysis method

Publications (1)

Publication Number Publication Date
CN104484566A true CN104484566A (en) 2015-04-01

Family

ID=52759107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410783566.6A Pending CN104484566A (en) 2014-12-16 2014-12-16 Big data analysis system and big data analysis method

Country Status (1)

Country Link
CN (1) CN104484566A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229600A (en) * 2017-05-31 2017-10-03 北京邮电大学 A kind of parallel variance analysis method and device based on big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily-compressed database index storage system utilizing fragments and sparse bitmap and corresponding construction, scheduling and query processing methods thereof
CN102567375A (en) * 2010-12-27 2012-07-11 中国移动通信集团公司 Data mining method and device
KR20140005474A (en) * 2012-07-04 2014-01-15 한국전자통신연구원 Apparatus and method for providing an application for processing bigdata
CN103605653A (en) * 2013-09-29 2014-02-26 广西师范大学 Big data searching method based on sparse hash

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567375A (en) * 2010-12-27 2012-07-11 中国移动通信集团公司 Data mining method and device
CN102402617A (en) * 2011-12-23 2012-04-04 天津神舟通用数据技术有限公司 Easily-compressed database index storage system utilizing fragments and sparse bitmap and corresponding construction, scheduling and query processing methods thereof
KR20140005474A (en) * 2012-07-04 2014-01-15 한국전자통신연구원 Apparatus and method for providing an application for processing bigdata
CN103605653A (en) * 2013-09-29 2014-02-26 广西师范大学 Big data searching method based on sparse hash

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李孝伟: "基于用户通信行为分析的电信网络社区划分技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
段松青等: "PDM:基于Hadoop的并行数据分析系统", 《湖南大学学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229600A (en) * 2017-05-31 2017-10-03 北京邮电大学 A kind of parallel variance analysis method and device based on big data

Similar Documents

Publication Publication Date Title
CN102521656B (en) Integrated transfer learning method for classification of unbalance samples
CN103902988B (en) A kind of sketch shape matching method based on Modular products figure with Clique
CN104536983A (en) Method and device for predicting advertisement click rate
CN106407208A (en) Establishment method and system for city management ontology knowledge base
CN111709244A (en) Deep learning method for identifying causal relationship of contradictory dispute events
CN104679818A (en) Video keyframe extracting method and video keyframe extracting system
CN108376164B (en) Display method and device of potential anchor
KR101968449B1 (en) Automatic inspection system for label type data based on Artificial Intelligence Learning to improve data productivity, and method thereof
CN110555305A (en) Malicious application tracing method based on deep learning and related device
CN109918658A (en) A kind of method and system obtaining target vocabulary from text
CN111143578A (en) Method, device and processor for extracting event relation based on neural network
CN109214407A (en) Event detection model, calculates equipment and storage medium at method, apparatus
CN104182771A (en) Time series data graphics analysis method based on automatic coding technology with packet loss
CN107392311A (en) The method and apparatus of sequence cutting
CN112215398A (en) Power consumer load prediction model establishing method, device, equipment and storage medium
CN109977977A (en) A kind of method and corresponding intrument identifying potential user
CN106503386A (en) The good and bad method and device of assessment luminous power prediction algorithm performance
US10706049B2 (en) Method and apparatus for querying nondeterministic graph
CN103207804A (en) MapReduce load simulation method based on cluster job logging
CN105183806A (en) Method and system for identifying same user among different platforms
CN104484566A (en) Big data analysis system and big data analysis method
CN102034102B (en) Image-based significant object extraction method as well as complementary significance graph learning method and system
CN113742495B (en) Rating feature weight determining method and device based on prediction model and electronic equipment
CN106156256A (en) A kind of user profile classification transmitting method and system
CN112488312B (en) Construction method of tensor-based automatic coding machine for network exchange data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150401