CN118474166A

CN118474166A - Data sharing and cooperative processing platform based on cloud computing

Info

Publication number: CN118474166A
Application number: CN202410425496.0A
Authority: CN
Inventors: 吴敏
Original assignee: Wuhan Zhongzhi Digital Technology Co ltd
Current assignee: Wuhan Zhongzhi Digital Technology Co ltd
Priority date: 2024-04-10
Filing date: 2024-04-10
Publication date: 2024-08-09

Abstract

A cloud computing-based data sharing and co-processing platform, comprising: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; the invention is specially designed for related department data, and aims to realize efficient collection, processing, storage, analysis, sharing and collaborative case handling of the related department data, thereby greatly improving the execution efficiency of the related departments, optimizing the resource allocation and remarkably improving the safety satisfaction of the public.

Description

Data sharing and cooperative processing platform based on cloud computing

Technical Field

The invention relates to the technical field of data sharing, in particular to a data sharing and collaborative processing platform based on cloud computing.

Background

With the rapid development of information technology, related departments work face unprecedented data challenges. Traditional data processing and sharing methods cannot meet the stringent requirements of modern related departments on data real-time, accuracy, comprehensiveness and security. Therefore, an innovative solution is urgently needed, the advantages of cloud computing, big data and information security technology can be fully utilized, and a powerful, reliable and efficient data sharing and collaborative case handling platform is provided for related department systems.

Disclosure of Invention

The present invention has been made in view of the above problems, and it is an object of the present invention to provide a data sharing and co-processing platform based on cloud computing, which overcomes or at least partially solves the above problems.

In order to solve the technical problems, the embodiment of the application discloses the following technical scheme:

The embodiment of the invention discloses a data sharing and collaborative processing platform based on cloud computing, which comprises the following steps: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; wherein:

The system architecture construction unit is used for completing cloud computing platform selection and configuration, micro-service splitting and containerization deployment, API interface design and realization, and message queue selection and configuration;

the data acquisition unit is used for completing the docking and configuration of the data source, completing the data acquisition strategy and scheduling, and completing the temporary storage of the data and the realization of log record;

the data cleaning unit is used for completing data cleaning strategy configuration and analysis, completing data automatic cleaning and manual auditing, and completing cleaning result feedback and log record realization;

the data aggregation unit is used for completing data aggregation rule configuration and analysis, completing data real-time aggregation and offline aggregation, and completing aggregation result visualization display;

the data storage and management unit is used for completing storage medium selection and configuration realization, data encryption and access control realization and vulnerability and attack protection realization.

Further, the system architecture construction unit is configured to complete cloud computing platform selection and configuration, and specifically includes:

AWS cloud computing platform: AWS is selected as an infrastructure provider, and a system is built by using rich cloud services;

EC2 example configuration: according to the system requirements, different types and scales of EC2 examples are configured, including CPU, memory, storage and network configuration;

VPC and subnet configuration: creating a VPC in the AWS, and dividing a plurality of sub-networks for isolating different service components, wherein the service components at least comprise a front end, a back end and a database;

Security group and network access control: configuring a security group for each subnet, defining allowed and rejected inbound and outbound traffic rules, and ensuring network security;

the system architecture construction unit is also used for completing the split and containerization deployment of the micro-services, and specifically comprises the following steps:

Service splitting: splitting the system into five micro services of data acquisition, cleaning, aggregation, storage and security, wherein each service is responsible for a specific function;

Docker containerization: writing Dockerfile for each micro-service, and defining a construction process of a container mirror image, wherein the construction process at least comprises the steps of basic mirror image, dependent installation and code replication;

local development and testing: using Docker Compose to build a development environment locally to simulate the operation and interaction of a plurality of micro services;

ECS cluster configuration: creating an ECS cluster in the AWS, configuring a capacity provider of the cluster, and automatically expanding a policy for container management and arrangement of the production environment;

ECR mirror warehouse: and creating an ECR warehouse for storing and managing the Docker mirror image and configuring the push and pull rights of the mirror image.

Further, the system architecture construction unit is further configured to design and implement an API interface, and specifically includes:

RESTful API design: the defined RESTful API interface specification at least comprises URL design, request method, parameter transmission and response format;

Spring Boot framework: the method comprises the steps of quickly constructing a micro service application by using a Spring Boot, and running the application by using a built-in Tomcat container;

swagger document generation: integrating a Swagger framework, automatically generating an API document, and providing a function of a test interface;

API GATEWAY configuration: creating API GATEWAY examples in the AWS, configuring the end points of the API, the identity verification and the request flow limiting rules, and managing all API requests as a unified entry;

The system architecture construction unit is further configured to complete message queue selection and configuration, and specifically includes:

SQS queue creation: creating an SQS queue in the AWS, and configuring the attribute of the queue, wherein the attribute at least comprises a visibility timeout period and a message reservation period;

Data acquisition and transmission: writing a data acquisition script or plug-in, and sending the acquired original data to an SQS queue;

data cleaning and consumption: writing a data cleaning service, consuming data from the SQS queue for cleaning, and sending the cleaned data back to the queue or storing the cleaned data in a database.

Further, the data acquisition unit is configured to complete data source docking and configuration, and specifically includes:

And (3) docking a database: configuring a JDBC connection pool, providing database connection parameters, periodically inquiring a database table and sending data to an SQS queue;

API interface docking: calling a third party API interface to acquire data, processing the authentication, request parameters and response results of the API, and sending the acquired data to an SQS queue;

Webpage data capture: simulating browser behaviors by using Selenium or Puppeteer tools, capturing webpage content, extracting required data, and sending the required data to an SQS queue;

the data acquisition unit is also used for completing data acquisition strategies and scheduling, and specifically comprises the following steps:

acquisition interval and data volume limit: setting corresponding acquisition intervals and data volume limitation according to the importance and update frequency of the data source, and avoiding overlarge pressure on the data source;

And (3) real-time data acquisition: the long polling or WebSocket technology is used for realizing the real-time acquisition and transmission of data; timing task scheduling: the timed tasks are configured using a Cron expression or a timed task framework, and execution of the data acquisition tasks is scheduled.

Further, the data acquisition unit is further configured to complete data temporary storage and log record implementation, and specifically includes:

Temporary storage of data to the SQS queue: temporarily storing the collected original data in a SQS queue in a JSON format, and waiting for subsequent processing; setting a visibility timeout and retry strategy of the message to ensure reliable transmission of data;

log to CloudWatch Logs: the method comprises the steps that a log recorder is configured to send log information of data acquisition to AWS CloudWatch Logs services, wherein the log information at least comprises acquisition time, a data source, data quantity and acquisition result fields; the retention period and access rights of the log are set to meet compliance requirements.

Further, the data cleaning unit is configured to complete configuration and analysis of the data cleaning policy, and specifically includes:

Providing Web interface configuration cleaning rules: developing a Web interface, and allowing a user to customize data cleaning rules in a mode of dragging the component and configuring parameters;

Rule parsing and execution: the back-end service analyzes the cleaning rules configured by the user, generates executable data cleaning logic and applies the executable data cleaning logic to the data cleaning service;

the data cleaning unit is also used for completing the realization of automatic data cleaning and manual auditing, and specifically comprises the following steps:

And (3) automatic cleaning treatment: the data cleaning service acquires original data from the SQS queue, and performs automatic cleaning processing according to the loaded cleaning rule, wherein the automatic cleaning processing at least comprises the steps of removing repeated data, filling missing values and converting data types;

Manual auditing mechanism: marking data which cannot be automatically processed or the processing result is uncertain as a state to be checked, providing a Web interface for professional personnel to manually judge and process, feeding the checking result back to the data cleaning service, and optimizing the logic of automatic cleaning;

the data cleaning unit is also used for finishing the feedback of cleaning results and the realization of log records, and specifically comprises the following steps:

And (5) sending a cleaning result: the cleaned data is sent back to the SQS queue in a JSON format or stored in a database for subsequent data aggregation and analysis;

Logging and monitoring: recording log information of data cleaning into CloudWatch Logs services, wherein the log information at least comprises cleaning time, cleaning rules and cleaning result fields; and setting alarm rules to monitor and inform abnormal conditions in real time.

Further, the data aggregation unit is configured to complete configuration and parsing of the data aggregation rule, and specifically includes:

providing a Web interface configuration aggregation rule: developing a Web interface, and allowing a user to customize data aggregation rules in a mode of dragging the component and configuring parameters;

Rule parsing and execution: the back-end service analyzes the aggregation rule configured by the user, generates executable data aggregation logic and applies the executable data aggregation logic to the data aggregation service;

the data aggregation unit is also used for completing the realization of data real-time aggregation and offline aggregation, and specifically comprises the following steps:

Real-time data stream processing: performing aggregation analysis on the real-time data stream by using a Flink stream processing frame, and displaying an aggregation result on a Web interface in real time or pushing the aggregation result to a client through a WebSocket;

Offline batch processing: and performing batch aggregation analysis on the historical data stored in the database by using a Spark batch processing framework, storing an aggregation result into the database for subsequent inquiry and analysis, and setting a timing task and a periodic task to schedule the execution of the offline aggregation task.

Further, the data aggregation unit is further configured to complete visual display implementation of an aggregation result, and specifically includes:

Front-end visualization library integration: integrating ECharts or HIGHCHARTS front-end visual libraries, displaying the aggregation results on a Web interface in a chart form, and providing rich interaction functions, wherein the functions at least comprise data screening, sorting and drilling;

the back-end data interface provides: the back-end service provides a data interface of the aggregation result for the front-end visual library to call and display data; the interface should support paging, ordering, screening query parameters to meet the requirements of front-end presentation.

Further, the data storage and management unit is configured to complete storage medium selection and configuration implementation, and specifically includes:

SSD and HDD storage media selection: according to the access frequency and importance of the data, hot data is stored on a high-performance SSD storage medium, and cold data is stored on a lower-cost HDD storage medium;

data backup and recovery strategy: implementing a regular data backup strategy to backup data to SSD storage media object storage services or other reliable storage media; configuring a data recovery flow to ensure that data can be recovered in time in the event of an accident;

the data storage and management unit is also used for completing data encryption and access control realization, and specifically comprises the following steps:

Data encryption storage and transmission: encrypting, storing and transmitting the sensitive data, and managing an encryption key by using a KMS service of the AWS; for data stored in the database, the security of the data is protected by using the encryption function of the database or field level encryption technology;

Identity authentication and authorized access: the IAM service of AWS is used for carrying out identity authentication and authorized access control on users, proper authority and roles are allocated for each user, only authorized users can access sensitive data and execute sensitive operation, and multi-factor identity authentication is implemented to improve the security of accounts.

Further, the data storage and management unit is further configured to complete vulnerability and attack protection implementation, and specifically includes:

Security service integration and configuration: shield, guardDuty security services of the AWS are integrated to defend against DDoS attacks and detect potential security threats; configuring security groups and network access control lists to limit inbound and outbound traffic to prevent unauthorized access and attacks;

Periodic security assessment and vulnerability scanning: carrying out security assessment on the system at regular intervals, wherein the security assessment at least comprises code examination, configuration check and vulnerability scanning; providing vulnerability scanning and repair suggestions using automated tools such as AWS instructor; establishing a security event response mechanism, formulating a detailed security event handling flow and performing exercise regularly to improve response speed and processing capacity; log information of the security event is recorded into CloudWatch Logs service for subsequent auditing and analysis.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

The invention constructs a data sharing and cooperative processing platform based on cloud computing, which comprises the following steps: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; the invention is specially designed for related department data, and aims to realize efficient collection, processing, storage, analysis, sharing and collaborative case handling of the related department data, thereby greatly improving the execution efficiency of the related departments, optimizing the resource allocation and remarkably improving the safety satisfaction of the public.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

Fig. 1 is a block diagram of a data sharing and co-processing platform based on cloud computing in embodiment 1 of the present invention;

fig. 2 is a flow chart of data acquisition and processing in embodiment 1 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problems in the prior art, the embodiment of the invention provides a data sharing and collaborative processing platform based on cloud computing.

Example 1

The embodiment of the invention discloses a data sharing and collaborative processing platform based on cloud computing, as shown in fig. 1, comprising the following steps: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; wherein:

in this embodiment, the system architecture construction unit is configured to complete selection and configuration of a cloud computing platform, and specifically includes:

In this embodiment, the system architecture building unit is further configured to design and implement an API interface, and specifically includes:

The data acquisition unit is used for completing the docking and configuration of the data source, completing the data acquisition strategy and scheduling, and completing the temporary storage of the data and the realization of log record; specifically, as shown in fig. 2, data acquisition includes data source adaptation and data extraction and transmission; wherein, the data source is adapted: for different types of data sources (such as databases, APIs, log files, etc.), corresponding adapters or connectors are developed for unified collection. Data extraction and transmission: asynchronous extraction and reliable transmission of data are realized by using message queues such as RabbitMQ, kafka, and stability and integrity of the data are ensured.

In this embodiment, the data acquisition unit is configured to complete docking and configuration of a data source, and specifically includes:

In some preferred embodiments, the data acquisition unit is further configured to complete data temporary storage and log record implementation, and specifically includes:

in this embodiment, the data cleansing includes:

Missing value processing: in addition to common filling policies (e.g., mean, median, mode, etc.), it may be desirable to customize filling rules according to the business meaning and data distribution of the fields. For example, for the absence of certain critical fields, it may be necessary to trace back the data source or to make additional data acquisitions.

Abnormal value detection: in addition to using statistical methods and machine learning algorithms to automatically detect outliers, manual auditing is also required in combination with business knowledge. Some seemingly outliers may be reasonable in certain business scenarios.

Data formatting and normalization: this step is intended to ensure consistency and comparability of the data. For example, date and time are uniformly converted into a standard format, and numerical data of different units are converted into uniform measurement units.

Data deduplication: duplicate records are removed according to the unique key of the service definition. For approximately duplicate recording processing, it may be desirable to use fuzzy matching algorithms in conjunction with manual auditing to ensure accuracy of the data.

Text cleaning: for text data, natural language processing operations such as part-of-speech tagging, entity recognition, etc., may be required to extract useful information in addition to removing extraneous characters and stop words.

In this embodiment, the data cleansing unit is configured to complete configuration and parsing of a data cleansing policy, and specifically includes:

In this implementation, the data aggregation includes:

Grouping and aggregation: data is grouped according to one or more fields according to traffic demand, and statistical indicators (e.g., total number, average value, standard deviation, etc.) for each group are calculated. These statistical indicators may provide strong support for subsequent data analysis and visualization.

Time window aggregation: for time series data, the aggregate analysis is performed according to a specified time window (e.g., day, week, month, etc.). This helps reveal the trend and periodicity of the data over time.

Layering polymerization: hierarchical aggregation operation is performed in multiple dimensions to meet complex business requirements. For example, the data characteristics of different regions in different time periods may be analyzed by grouping the regions first and then aggregating the regions in time windows.

Custom polymerization: and writing a custom aggregation function or script according to specific business requirements. This can handle complex scenarios that some standard aggregation functions cannot handle directly.

In this embodiment, the data aggregation unit is configured to complete configuration and analysis of a data aggregation rule, and specifically includes:

In some preferred embodiments, the data aggregation unit is further configured to complete visualization and implementation of aggregation results, and specifically includes:

In some preferred embodiments, the K-Means algorithm that may be used in data aggregation is a simple and common clustering algorithm. It attempts to divide the data into K predetermined clusters, minimizing the sum of squared distances of objects within the cluster by iteratively optimizing the center (centroid) of the cluster.

The following is the basic steps of the K-Means algorithm:

Initializing: the K data points were randomly selected as initial cluster centers. The cluster centers may be randomly selected from the samples, or may be selected based on a priori knowledge or experience.

Dispensing samples: for each data point in the dataset, its distance (e.g., euclidean distance) from the respective cluster center is calculated and then assigned to the cluster represented by the cluster center closest to it.

Updating a clustering center: for each cluster, the average (i.e., the center position) of all data points in the cluster is recalculated and this average is taken as the new cluster center.

Repeating the iteration: repeating the steps of distributing the samples and updating the clustering center until the clustering center does not change obviously any more or reaches the preset maximum iteration number. In each iteration, the cluster center is moved to the center position of the data point, so that the division of clusters is gradually optimized.

Outputting a result: and when the termination condition is met, stopping iteration by the algorithm, and outputting final K cluster centers and cluster types to which each data point belongs.

In this embodiment, data storage and management includes:

data storage policy: and designing a reasonable storage strategy according to factors such as the access frequency, importance and size of the data. For example, hot data is stored in a high performance database for quick access, and cold data is migrated to a low cost storage tier to save costs. Meanwhile, a data backup and recovery strategy is implemented to ensure the security and availability of the data.

Data encryption: sensitive data is stored and transmitted in encrypted form to prevent unauthorized access and leakage. And proper encryption algorithm and key management scheme are selected to ensure the effectiveness and security of encryption.

Access control and authentication: strict access control policies are enforced to ensure that only authorized users can access sensitive data and functions. And the security of user access is enhanced by combining a multi-factor identity authentication mode (such as a user name password, a dynamic token, biological recognition and the like). At the same time, the user access log is periodically audited to discover potential security risks.

Vulnerability and attack protection: network security facilities such as a firewall, an intrusion detection system and the like are deployed, and network attacks are discovered and defended in time. And (3) periodically scanning and repairing the security holes of the system to ensure the security and stability of the system. Meanwhile, an emergency response mechanism is established to rapidly cope with sudden safety events.

In this embodiment, the data storage and management unit is configured to complete storage medium selection and configuration implementation, and specifically includes:

SSD and HDD storage media selection: according to the access frequency and importance of the data, hot data is stored on a high-performance SSD storage medium (such as an EBS volume using AWS), and cold data is stored on a lower-cost HDD storage medium (such as an S3 standard storage class using AWS);

In some preferred embodiments, the data storage and management unit is further configured to complete vulnerability and attack protection implementation, and specifically includes:

The embodiment of the invention constructs a data sharing and cooperative processing platform based on cloud computing, which comprises the following steps: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; the invention is specially designed for related department data, and aims to realize efficient collection, processing, storage, analysis, sharing and collaborative case handling of the related department data, thereby greatly improving the execution efficiency of the related departments, optimizing the resource allocation and remarkably improving the safety satisfaction of the public.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. The processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described in this disclosure may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. These software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Claims

1. The utility model provides a data sharing and collaborative processing platform based on cloud computing which characterized in that includes: the system comprises a system architecture construction unit, a data acquisition unit, a data cleaning unit, a data aggregation unit and a data storage and management unit; wherein:

2. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the system architecture construction unit is configured to complete the selection and configuration of the cloud computing platform, and specifically comprises:

3. The cloud computing-based data sharing and co-processing platform as claimed in claim 1, wherein the system architecture building unit is further configured to design and implement an API interface, and specifically comprises:

4. The cloud computing-based data sharing and co-processing platform as claimed in claim 1, wherein the data acquisition unit is configured to complete data source docking and configuration, and specifically comprises:

5. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data acquisition unit is further configured to complete data temporary storage and log record implementation, and specifically includes:

6. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data cleansing unit is configured to complete configuration and parsing of a data cleansing policy, and specifically comprises:

7. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data aggregation unit is configured to complete configuration and parsing of data aggregation rules, and specifically comprises:

8. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data aggregation unit is further configured to complete the visual display implementation of the aggregation result, and specifically includes:

9. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data storage and management unit is configured to complete storage medium selection and configuration implementation, and specifically comprises:

10. The data sharing and co-processing platform based on cloud computing as claimed in claim 1, wherein the data storage and management unit is further configured to complete vulnerability and attack protection implementation, and specifically comprises: