CN117331989A

CN117331989A - Big data mining method and mining system based on cloud computing

Info

Publication number: CN117331989A
Application number: CN202311344555.3A
Authority: CN
Inventors: 李明芹
Original assignee: Hubei Wanze Hongtong Information Technology Co ltd
Current assignee: Hubei Wanze Hongtong Information Technology Co ltd
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-02

Abstract

The invention discloses a big data mining method and a mining system based on cloud computing, which relate to the technical field of cloud computing, wherein the system is divided into an acquisition access layer, a data storage layer, a real-time processing layer, a business service layer and a user interaction layer from a hardware architecture, wherein a cloud computing module is arranged in the acquisition access layer, and the acquired big data are combined to form a data set; the data storage layer is internally provided with a data storage module and provides a distributed big data storage function; the real-time processing layer at least comprises a data mining module and an evaluation and comparison module; the technical key points are as follows: after a large amount of data is initially classified and trained, predicted result data can be obtained, and the association coefficient Gsx is calculated according to the obtained parameters, so that the matching degree between the predicted result data and the user requirements can be effectively predicted, the accuracy of the data is further improved after data mining, and after the distribution engine is used, a rapid and accurate big data mining task is further realized.

Description

Big data mining method and mining system based on cloud computing

Technical Field

The invention relates to the technical field of cloud computing, in particular to a big data mining method and a mining system based on cloud computing.

Background

Cloud computing, which refers to a mode of providing computing resources and services over the internet that separates and pools computing, storage, and network resources based on virtualization technology, provides users in an elastic and on-demand manner, generally includes the following key concepts and components: virtualization, service model, and deployment mode, virtualization: the cloud computing uses a virtualization technology to abstract physical computing resources (such as servers, storage equipment, networks and the like) into virtual resources, and provides flexible resource use modes for users; service model: cloud computing can be divided into three main models according to the hierarchy of service offerings: infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS), iaaS providing virtualized resources of the infrastructure, paaS providing a higher level development platform, and SaaS being application software provided based on a cloud platform; deployment mode: the cloud computing deployment modes comprise public cloud, private cloud and mixed cloud, wherein the public cloud is cloud computing service provided by a cloud service provider to vast users, the private cloud is cloud computing infrastructure built and managed by a single organization or enterprise, and the mixed cloud is the combination of the public cloud and the private cloud.

The traditional Chinese patent application publication number is CN114780620B, and the name is cloud computing service analysis method, device and system based on big data mining performance, which indicates that: the system mainly comprises: acquiring a query characteristic value of a user according to historical query conditions of the user on different services; user classification is carried out by utilizing query characteristic values of all users accessed within a preset duration, and habit service sequences of the same type of users are obtained; according to the difference of the queried frequencies of different services of adjacent query times of the same type of users and the correlation among different services, the correlation between any two services in any query time is obtained; predicting the next inquiry service of the user according to the current inquiry service of the user, the habit sequence of the user and the relevance of each service in the current inquiry times, and putting the predicted inquiry service into a cache for the user to inquire in advance.

In the above application, although the method can predict targeted query services for different users, redetermine the relevance among the services and further deploy storage resources in advance, the matching degree between related query data and user requirements cannot be effectively controlled, so that the related data which partially does not meet the user requirements enter the query data during data mining, and the working efficiency in the data mining process and the accuracy of the related data after data mining are affected.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a big data mining method and a mining system based on cloud computing, which can obtain predicted result data after carrying out preliminary classification and training on a large amount of data, calculate association coefficient Gsx according to the acquired parameters, and effectively predict the matching degree between the predicted result data and the user demand, thereby further improving the accuracy after data mining, further realizing quick and accurate big data mining tasks after being combined with the use of a distribution engine, and solving the problems in the background art.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme:

a big data mining system based on cloud computing comprises an acquisition access layer, a data storage layer, a real-time processing layer, a business service layer and a user interaction layer from a hardware architecture;

the cloud computing module is arranged in the acquisition access layer, and the acquired big data are combined to form a data set;

the data storage layer is internally provided with a data storage module and a distributed big data storage function, the data set obtained by acquisition and processing is used as a data source and is recorded into a database, and corresponding data guarantee is provided for other layers;

the real-time processing layer at least comprises a data mining module and an evaluation comparison module, wherein the data mining module builds a parallel algorithm training model on the basis of a parallel environment, performs parallel calculation tasks on data sets of all nodes in corresponding data sets in a parallel calculation frame, obtains prediction result data after training and optimization, calculates a correlation coefficient Gsx according to the acquired pearson correlation coefficient Pr, the spearman correlation coefficient Sp and the covariance Fc through the evaluation comparison module, and the correlation coefficient Gsx is compared with a standard threshold value, and the obtained comparison result can be used for reflecting the matching degree between the prediction result data and the user requirement, removing data with low matching degree and reserving data with high matching degree;

the parallel environment is a parallel computing cluster configured in an initial period, secondary cleaning processing is needed to be carried out on data in the period, the parallel computing framework can be selected to comprise Apache Hadoop or Spark, parallel computing tasks are specifically set according to requirements, and Apache Hadoop can be selected as a main computing framework to carry out subsequent related processing work in the application.

The service layer is internally provided with a task scheduling module, and an allocation engine is carried in the built-in task scheduling module to realize task allocation and task scheduling operation;

the user interaction layer is the combination of the client and the visualization tool, provides user access system service on the client, relies on the client to perform various operations, and displays the query result in real time through the visualization tool.

Further, the cloud computing module comprises an acquisition unit and a computing unit, the acquisition unit acquires big data by utilizing a method at least comprising RFID radio frequency data, sensor data, social network interaction data and mobile internet data, and the computing unit performs preprocessing on the data and then classifies the data with the same attribute so as to obtain a corresponding data set, wherein the computing unit comprises the following specific steps:

s101, preprocessing big data by using a data cleaning technology to realize preliminary processing of the data;

s102, the data obtained under different methods form a set { K } ₁ ，K ₂ ，...，K _n The total amount of the method for acquiring the big data is n;

s103, carrying out data with the same attribute through the configured data fusion subunitFusion to obtain new data set { F ₁ ，F ₂ ，...，F _m The amount of the different attribute is m.

Furthermore, the storage form of the data storage module comprises hard disk storage and cloud storage, and the hard disk storage adopts a hard disk encryption technology, and the hard disk storage is adopted in the application, so that the adopted hard disk encryption technology is safer compared with the cloud storage.

Further, the specific steps of using the evaluation comparison module are as follows:

the method of obtaining the association coefficient Gsx in S201 is as follows:

the pearson correlation coefficient Pr, the spearman correlation coefficient Sp and the covariance Fc are obtained, the correlation forms a correlation coefficient Gsx,

wherein K is ₁ 、K ₂ 、K ₃ Preset scaling coefficients of pearson correlation coefficient Pr, spearman correlation coefficient Sp and covariance Fc, respectively, and K ₁ 、K ₂ 、K ₃ Are all larger than 0,G ₁ Adding |K to the above formula as a constant correction coefficient ₃ * Fc| to avoid negative values from affecting the accuracy of the correlation coefficient Gsx.

S202, comparing the association coefficient Gsx with a standard threshold;

if the association coefficient Gsx is greater than the standard threshold, the matching degree between the predicted result data and the user demand is lower, the predicted result data cannot be used as the associated data under the user demand, a first execution strategy is made, if the association coefficient Gsx is less than or equal to the standard threshold, the matching degree between the predicted result data and the user demand is higher, and the predicted result data can be used as the associated data under the user demand, a second execution strategy is made; the first execution strategy is to remove data with low matching degree, the second execution strategy is to save data with high matching degree, and the user requirements can be determined according to a preset logic relationship, and a data value or a data set corresponding to the user requirements is determined in the logic relationship.

Further, the task scheduling module carries a distribution engine for breaking up a single task, distributing the broken tasks to different nodes of the data mining, loading the generated calculation tasks into a database in a single data set mode, and carrying out subsequent task scheduling according to the requirement.

Further, the user interaction layer is used as a window for connecting the system with the user, various operations performed by the client include inquiring and storing related data results, and a liquid crystal display screen can be selected for the visualization tool used for the inquiring results.

A big data mining method based on cloud computing comprises the following steps:

step one, acquiring big data by utilizing methods of RFID radio frequency data, sensor data, social network interaction data and mobile internet data, preprocessing the data, and completing classification of data with the same attribute so as to obtain a corresponding data set;

step two, a distributed big data storage function is operated, a data set obtained by acquisition and processing is used as a data source and is input into a database, the distributed big data storage function refers to the capability of storing and managing large-scale data in a distributed mode, the requirement of big data processing and analysis can be met, the distributed storage generally needs to divide the data into a plurality of fragments or partitions and distributes and stores the data on different nodes in a cluster, so that the distributed storage of the data can be realized, the read-write concurrency of the data is improved, and the data and inquiry requests are uniformly distributed to each node through load balancing;

thirdly, building a parallel algorithm training model on the basis of a parallel environment, performing parallel calculation tasks on each node type data set in a corresponding data set in a parallel calculation frame, obtaining predicted result data after training and optimization, and calculating a correlation coefficient Gsx according to an acquired pearson correlation coefficient Pr, a spearman correlation coefficient Sp and a covariance Fc, wherein the pearson correlation coefficient Pr and the spearman correlation coefficient Sp reflect the correlation coefficients and are used for representing the strength and the direction of a linear relation between two variable data, the covariance Fc is an overall error between the two variable data and represents the common change of fluctuation degrees of the two variable data, and the acquisition mode of the correlation coefficient Gsx is as follows:

the acquired pearson correlation coefficient Pr, spearman correlation coefficient Sp and covariance Fc are correlated to form a correlation coefficient Gsx,

wherein K is ₁ 、K ₂ 、K ₃ Preset scaling coefficients of pearson correlation coefficient Pr, spearman correlation coefficient Sp and covariance Fc, respectively, and K ₁ 、K ₂ 、K ₃ Are all larger than 0,G ₁ Is a constant correction coefficient;

then, after the association coefficient Gsx is compared with the standard threshold, the obtained comparison result can be used for reflecting the matching degree between the predicted result data and the user requirement, if the association coefficient Gsx is greater than the standard threshold, the data with low matching degree is removed, and if the association coefficient Gsx is less than or equal to the standard threshold, the data with high matching degree is reserved;

breaking a single task, distributing the broken tasks to different nodes of data mining, loading the generated calculation tasks into a database in a single data set mode, and carrying out subsequent task scheduling according to requirements;

and fifthly, the user queries and stores related data results by relying on the client, and the query results are displayed in real time through the liquid crystal display screen, so that the user can be ensured to acquire the information to be queried in real time.

(III) beneficial effects

The invention provides a big data mining method and a mining system based on cloud computing, which have the following beneficial effects:

the data mining module and the evaluation comparison module are designed in the real-time processing layer and combined with the cloud computing module, so that data mining processing based on cloud computing is realized, after a large amount of data is subjected to preliminary classification and training, predicted result data can be obtained, and the association coefficient Gsx is calculated according to the obtained parameters, so that the matching degree between the predicted result data and the user requirements can be effectively predicted, and the accuracy after data mining is further improved;

and then combining the whole system with a task scheduling module, using a distribution engine carried in the module, breaking up a single task, distributing the split broken tasks to different nodes of data mining, and realizing efficient processing.

Drawings

Fig. 1 is a schematic structural diagram of a big data mining system based on cloud computing according to the present invention, which is based on an existing hardware architecture;

fig. 2 is a schematic diagram of an overall system module structure in the cloud computing-based big data mining system.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1: referring to fig. 1-2, the present invention provides a big data mining system based on cloud computing, which is divided into an acquisition access layer, a data storage layer, a real-time processing layer, a business service layer and a user interaction layer from the hardware architecture;

the cloud computing module is arranged in the acquisition access layer and comprises an acquisition unit and a computing unit;

the cloud computing module is used for merging the collected big data and forming a data set;

the acquisition unit acquires big data by utilizing a method at least comprising RFID radio frequency data, sensor data, social network interaction data and mobile internet data, and the data is preprocessed by the calculation unit and classified into the same attribute data, so that a corresponding data set is obtained;

the calculating unit comprises the following specific steps:

s101, preprocessing big data by using a data cleaning technology, wherein the preprocessing is one-time cleaning of the data, the data cleaning at the position refers to processing and converting the original data to remove or correct errors, repetition, incompleteness or inconsistent parts in the data so as to improve the quality and usability of the data, and specific cleaning operations are not repeated herein;

s103, fusing the data with the same attribute through the configured data fusion subunit to obtain a new data set { F } ₁ ，F ₂ ，...，F _m The amount of the different attributes is m, where the new data set is the data source, which is subsequently entered into the database.

The data storage layer is internally provided with a data storage module and a distributed big data storage function, and a data set obtained by acquisition and processing is used as a data source to provide corresponding data guarantee for other layers;

it should be noted that: the distributed big data storage function refers to the capability of storing and managing large-scale data in a distributed manner, and can meet the requirements of big data processing and analysis, the distributed storage generally needs to divide the data into a plurality of fragments or partitions, and the data is stored in a distributed manner on different nodes in a cluster, so that the distributed storage of the data can be realized, the read-write concurrency of the data is improved, and the data and the query requests are uniformly distributed to each node through load balancing.

The real-time processing layer at least comprises a data mining module and an evaluation comparison module, belongs to a part of a computing module cluster, builds a parallel algorithm training model on the basis of a parallel environment, executes parallel computing tasks on data sets of all nodes in corresponding data sets in a parallel computing framework, and obtains prediction result data after training and optimization;

it should be noted that: the parallel environment is a parallel computing cluster configured in an initial period, secondary cleaning processing is needed to be carried out on data in the period, the parallel computing framework can be selected to comprise Apache Hadoop or Spark, and parallel computing tasks are specifically set according to requirements;

the above-described apache hadoop is an open-source distributed computing framework for processing distributed storage and computation of large-scale data sets, comprising two core components: hadoop distributed file system and Hadoop distributed computing framework, HDFS is a reliable file system with high fault tolerance, is used for storing large-scale data sets, hadoop MapReduce is a computing framework for processing and analyzing large-scale data sets, and supports parallel processing and distributed computing of data;

apache Spark is also an open-source distributed computing framework for large-scale data processing and analysis, provides faster data processing speed and richer computing model compared with Hadoop, supports advanced data abstraction and operation such as elastic distributed data set and structured data processing (such as Spark SQL), and provides extended functions such as machine learning library and graph computing library, and Spark uses memory computing and data partitioning operation, so that the Spark has excellent performance in processing large-scale data;

whether Hadoop or Spark is used in the method, the method provides strong capability for distributed storage and calculation, so that a user can effectively process and analyze a large-scale data set, and the two frames are widely applied in the field of big data and support various data processing tasks and application scenes.

The evaluation and comparison module is used for calculating a correlation coefficient Gsx according to the acquired parameters, after the correlation coefficient Gsx is compared with a standard threshold value, the obtained comparison result can be used for reflecting the matching degree between the predicted result data and the user demand, wherein the evaluation data represents one attribute, the user demand represents the other attribute, and the two attributes correspond to each other in the data set;

the parameters include pearson correlation coefficient Pr, spearman correlation coefficient Sp, and covariance Fc;

the specific steps of using the evaluation contrast module are as follows:

the method of obtaining the association coefficient Gsx in S201 is as follows:

the pearson correlation coefficient Pr and the spearman correlation coefficient Sp are both used for representing the strength and the direction of the linear relationship between the two variable data, the covariance Fc is the overall error between the two variable data, and represents the common change of the fluctuation degree of the two variable data, and the pearson correlation coefficient Pr, the spearman correlation coefficient Sp and the covariance Fc are obtained by adopting a conventional technical means and are not repeated herein.

S202, comparing the association coefficient Gsx with a standard threshold, if the association coefficient Gsx is greater than the standard threshold, the matching degree between the predicted result data and the user demand is lower, and the predicted result data cannot be used as the association data under the user demand, and making a first policy as follows: deleting, if the association coefficient Gsx is less than or equal to the standard threshold, the matching degree between the predicted result data and the user demand is higher, and the predicted result data can be used as the association data under the user demand, and the second strategy is made as follows: and (5) preserving.

By adopting the technical scheme: the data mining module and the evaluation comparison module are designed in the real-time processing layer and are combined with the cloud computing module to realize data mining processing based on cloud computing, after a large amount of data is subjected to preliminary classification and training, predicted result data can be obtained, and the association coefficient Gsx is calculated according to the obtained parameters, so that the matching degree between the predicted result data and the user requirements can be effectively predicted, and the accuracy after data mining is further improved.

The service layer is internally provided with a task scheduling module, and the built-in task scheduling module is internally provided with an allocation engine, so that a single task can be broken up, split fragmented tasks are allocated to different nodes of data mining, a plurality of generated calculation tasks are loaded into a distributed database in a single data set form, and subsequent task scheduling is carried out according to the need, thereby providing calculation service for big data and realizing big data mining tasks;

by adopting the technical scheme: the whole system is combined with the task scheduling module, a distribution engine carried in the module is used, a single task can be broken, split broken tasks are distributed to different nodes of data mining, efficient processing is achieved, and the working efficiency of the whole system can be further improved due to the fact that a plurality of generated calculation tasks are loaded into a distributed database in the form of a single data set, calculation services are provided for complex and large amount of big data, and rapid and accurate big data mining tasks are achieved.

Specifically, the engine may be composed of JobTrackers and TaskTrackers.

The user interaction layer is a client and a visualization tool, the client is provided with user access system services, various queries and other operations are carried out by depending on the client, the query results are displayed in real time through the visualization tool, and the visualization tool can adopt a liquid crystal display.

Example 2: the invention provides a big data mining method based on cloud computing, which comprises the following steps:

specifically, the RFID radio frequency identification is a wireless communication technology and is used for automatically identifying and tracking objects, the RFID system consists of an RFID tag, a reader-writer and a back-end data management system, and when the RFID tag and the reader-writer are communicated through wireless radio frequency signals, required RFID radio frequency data can be obtained in real time; the sensor data includes data directly acquired by various types of sensors, such as: temperature data detected by the temperature sensor and humidity data detected by the humidity sensor; the social network interaction data is required to be accessed and acquired through an API or an open platform; mobile internet data may also be obtained through the API.

thirdly, building a parallel algorithm training model on the basis of a parallel environment, executing parallel calculation tasks on the data sets of each node type in the corresponding data sets in a parallel calculation frame, obtaining predicted result data after training and optimizing, and calculating a correlation coefficient Gsx according to the acquired pearson correlation coefficient Pr, the acquired spearman correlation coefficient Sp and the covariance Fc, wherein the acquisition mode of the correlation coefficient Gsx is as follows:

and fifthly, the user queries and stores related data results by relying on the client, and the query results are displayed in real time through the liquid crystal display screen.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application.

Claims

1. The system is divided into an acquisition access layer, a data storage layer, a real-time processing layer, a business service layer and a user interaction layer from a hardware architecture, and is characterized in that:

the data storage layer is internally provided with a data storage module, a distributed big data storage function is provided, and a data set obtained through acquisition and processing is used as a data source and is input into a database;

2. The cloud computing-based big data mining system of claim 1, wherein: the cloud computing module comprises a collecting unit and a computing unit, wherein the collecting unit obtains big data by utilizing a method at least comprising RFID radio frequency data, sensor data, social network interaction data and mobile internet data, and the data is preprocessed by the computing unit and then classified into the same attribute data, so that a corresponding data set is obtained.

3. The cloud computing-based big data mining system of claim 2, wherein: the specific steps of the calculating unit are as follows:

s101, preprocessing big data by utilizing a data cleaning technology;

s103, fusing the data with the same attribute through the configured data fusion subunit to obtain a new data set { F } ₁ ，F ₂ ，...，F _m The amount of the different attribute is m.

4. The cloud computing-based big data mining system of claim 1, wherein: the storage form of the data storage module comprises hard disk storage and cloud storage, and the hard disk storage adopts a hard disk encryption technology.

5. The cloud computing-based big data mining system of claim 1, wherein: the specific steps of using the evaluation comparison module are as follows:

the method of obtaining the association coefficient Gsx in S201 is as follows:

wherein K is ₁ 、K ₂ 、K ₃ Preset scaling coefficients of pearson correlation coefficient Pr, spearman correlation coefficient Sp and covariance Fc, respectively, and K ₁ 、K ₂ 、K ₃ Are all larger than 0,G ₁ Is a constant correction coefficient.

S202, comparing the association coefficient Gsx with a standard threshold;

if the association coefficient Gsx is greater than the standard threshold, the matching degree between the predicted result data and the user demand is lower, the predicted result data cannot be used as the associated data under the user demand, a first execution strategy is made, if the association coefficient Gsx is less than or equal to the standard threshold, the matching degree between the predicted result data and the user demand is higher, and the predicted result data can be used as the associated data under the user demand, a second execution strategy is made.

6. The cloud computing-based big data mining system of claim 5, wherein: the first execution strategy is to remove data with low matching degree, and the second execution strategy is to save the data with high matching degree.

7. The cloud computing-based big data mining system of claim 1, wherein: the allocation engine is used for breaking a single task, allocating the split broken tasks to different nodes of data mining, loading the generated calculation tasks into a database in the form of a single data set, and carrying out subsequent task scheduling according to the need.

8. The cloud computing-based big data mining system of claim 1, wherein: the user interaction layer is used as a window for connecting the system and the user, various operations performed by the client include inquiring and storing related data results, and a liquid crystal display screen can be selected for the visualization tool used for the inquiring results.

9. A big data mining method based on cloud computing, using the system of any of claims 1 to 8, characterized in that: the method comprises the following steps:

step two, running a distributed big data storage function, taking a data set obtained by acquisition and processing as a data source, and inputting the data into a database;