CN117272077A - Data processing method, device, computer equipment and storage medium - Google Patents
Data processing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN117272077A CN117272077A CN202311193229.7A CN202311193229A CN117272077A CN 117272077 A CN117272077 A CN 117272077A CN 202311193229 A CN202311193229 A CN 202311193229A CN 117272077 A CN117272077 A CN 117272077A
- Authority
- CN
- China
- Prior art keywords
- data
- serialized
- processed
- processing
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 27
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 123
- 238000000034 method Methods 0.000 claims abstract description 47
- 238000004140 cleaning Methods 0.000 claims abstract description 12
- 238000010606 normalization Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 abstract description 13
- 230000008569 process Effects 0.000 description 21
- 238000007781 pre-processing Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004801 process automation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application belongs to the fields of data processing and digital medical treatment, and relates to a data processing method, which comprises the following steps: the method comprises the steps of obtaining data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue; an asynchronous thread is adopted, a subtask to be executed is obtained from the RocketMQ queue, and the subtask to be executed is subjected to result query through an original database, so that a result set is obtained; serializing the result set according to a preset protocol to obtain serialized data and storing the serialized data; clustering the stored serialized data according to a pre-configured matching library to obtain clustered data; and carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data. The application also provides a data processing device, computer equipment and a storage medium. The method and the device improve the quick response capability of the system when a large amount of data is processed through multithreading, and improve the data processing efficiency.
Description
Technical Field
The present invention relates to the field of data processing and the field of digital medical treatment, and is applied to a scenario of processing a large amount of medical service settlement data generated in insurance business, and in particular, to a data processing method, a data processing device, a computer device and a storage medium.
Background
With the rapid development of digital medical treatment, the insurance industry has also put higher demands on information management of medical service settlement data. Each enterprise not only wants to realize the process automation of medical service settlement business, but also wants to better collect and analyze the settlement data of medical services so as to effectively utilize the data resources.
However, for the insurance industry, medical service aspects generate a large amount of business data and settlement data every day, and these data are accumulated continuously due to the long shelf life required for the business data and settlement data. In this case, it is necessary to collect and analyze mass data accumulated over a long period of time to remove various dirty data.
At present, a common solution is to implement multithreaded parallel processing through a multithreaded channel after data is packetized. However, in the process, due to factors such as overlarge data volume, incomplete data, inconsistent data and the like, not only is a large pressure easily caused to the system, but also the system cannot respond quickly, so that the data processing efficiency is affected.
Disclosure of Invention
An embodiment of the application aims to provide a data processing method, a data processing device, computer equipment and a storage medium, so as to solve the technical problem of insufficient quick response capability of a system when a large amount of data generated by medical service is processed through multiple threads.
In order to solve the above technical problems, the embodiments of the present application provide a data processing method, which adopts the following technical schemes:
the method comprises the steps of obtaining data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue;
an asynchronous thread is adopted, a subtask to be executed is obtained from the RocketMQ queue, and the subtask to be executed is subjected to result query through an original database, so that a result set is obtained;
serializing the result set according to a preset protocol to obtain serialized data and storing the serialized data;
clustering the stored serialized data according to a pre-configured matching library to obtain clustered data;
and carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data.
Further, the step of serializing the result set according to a preset protocol to obtain and store serialized data specifically includes:
Converting the result set into a transmissible byte sequence as the serialized data according to the preset protocol;
storing the serialized data in a result database.
Further, the step of clustering the stored serialized data according to a pre-configured matching library to obtain clustered data specifically includes:
reading the serialized data from the result database by adopting an asynchronous thread;
and determining associated data of the serialized data according to the matching library, and combining the serialized data with the associated data to obtain the clustered data.
Further, the step of determining association data of the serialized data according to the matching library and combining the serialized data with the association data to obtain the cluster data specifically includes:
comparing the serialized data with each piece of preset configuration information in the matching library, and determining target configuration information matched with the serialized data as the associated data;
and extracting the associated data from the matching library, and writing the associated data into the serialized data to obtain the clustered data.
Further, the step of obtaining the data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting policy, and adding each subtask into a RocketMQ queue specifically includes:
acquiring data to be processed of a current batch in a batch processing task pool;
dividing the data to be processed into different main tasks according to different task dimensions, and splitting each main task to obtain each sub task;
and adding each subtask into the RocketMQ queue, and constructing a task relation table corresponding to the current batch according to each main task and each subtask.
Further, after the step of obtaining the target data after the step of performing data normalization, data combination and data cleaning on the clustered data, the method further includes:
obtaining batch information corresponding to the target data, and determining a target task relation table according to the batch information;
performing deserialization on the target data to obtain detail data corresponding to the target data;
and executing a preset processing flow aiming at the detail data to obtain a processing result, and writing the processing result into the target task relation table.
Further, before the step of obtaining the data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting policy, and adding each subtask into a RocketMQ queue, the method further includes:
collecting original data from a plurality of data sources at regular time;
and adding the original data into a batch processing task pool and taking the original data as the data to be processed.
In order to solve the above technical problems, the embodiments of the present application further provide a data processing apparatus, which adopts the following technical schemes:
a data processing apparatus comprising:
the acquisition module is used for acquiring data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue;
the query module is used for acquiring a sub-task to be executed from the RocketMQ queue by adopting an asynchronous thread, and querying the result of the sub-task to be executed through an original database to obtain a result set;
the serialization module is used for serializing the result set according to a preset protocol to obtain and store serialized data;
the clustering module is used for clustering the stored serialized data according to a pre-configured matching library to obtain clustered data;
And the processing module is used for carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data.
In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the data processing method as described above.
In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions:
a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of a data processing method as described above.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
according to the data processing method disclosed by the application, the data to be processed is obtained, the data to be processed is split into a plurality of subtasks according to a preset splitting strategy, and each subtask is added into a RocketMQ queue; meanwhile, an asynchronous thread is adopted, a subtask to be executed is obtained from the RocketMQ queue, and the subtask to be executed is subjected to result query through an original database to obtain a result set; serializing the result set according to a preset protocol to obtain serialized data and storing the serialized data; then, clustering the stored serialized data according to a pre-configured matching library to obtain clustered data; finally, carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data. According to the method and the device, the data to be processed is distributed as the subtasks to be executed through the message queue middleware, and the query results are serialized and clustered in an asynchronous thread mode, so that the preprocessing of a large amount of data can be realized, the follow-up operations such as summarizing and analyzing can be more conveniently carried out on the data obtained after processing according to the flow, the quick response capability of a system when a large amount of data is generated by multithreading processing medical services is improved, and the data processing efficiency under the condition of a large amount of data is improved.
Drawings
For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a data processing method according to the present application;
FIG. 3 is a schematic diagram of one embodiment of a data processing apparatus according to the present application;
FIG. 4 is a schematic diagram of an embodiment of the clustering module shown in FIG. 3;
FIG. 5 is a schematic structural diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 (MPEG Audio Layer III, moving picture experts compression standard audio layer 3) players, MP4 (MPEG Audio Layer IV, moving picture experts compression standard audio layer 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the data processing method provided in the embodiments of the present application is generally executed by a server, and accordingly, the data processing device is generally disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of a data processing method according to the present application is shown. The data processing method comprises the following steps:
step S201, obtaining data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue;
In this embodiment, the server (as shown in fig. 1) on which the data processing method operates may receive data and requests related to the data processing method of this embodiment through a wired connection or a wireless connection. It should be noted that the above wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (Ultra Wide-Band) connection, and other now known or later developed wireless connection methods.
In this embodiment, when the data processing method is operated, the data to be processed needs to be acquired first, then the data to be processed is split into a plurality of subtasks according to a preset splitting policy, and then each subtask obtained by splitting is added into a RocketMQ queue. Specifically, for insurance business, business data and settlement data generated in medical service are stored in an original database, the data can be added into a batch processing task pool according to different batches and used as data to be processed in the batch processing task pool, the data to be processed can be batched, the current batch of data to be processed is obtained from the batch processing task pool, then the obtained data to be processed is divided into different main tasks according to different task dimensions, including but not limited to an underwriting mechanism of a policy, a settlement time interval, a settlement staff, a settlement expense type, a settlement type and the like, each main task can be divided into a plurality of minimum units to serve as sub-tasks, then data processing can be carried out through a producer-consumer mode, and the producer can add each sub-task into a RocketMQ queue, so that a subsequent consumer can read and process the sub-tasks from the RocketMQ queue.
It should be noted that, the producer-consumer mode is generally used to separate one party of the production data from one party of the consumption data, and decouple the process of the production data from the process of the consumption data, so as to implement asynchronous processing. In addition, the RocketMQ is used as a message middleware of a distributed queue model, can support the processing of a large amount of data according to the embodiment, and ensures that the data are transferred strictly in sequence.
Optionally, each main task and each subtask obtained by splitting data to be processed carry a state identifier, the task state can be divided into to be processed, in processing, processing completion, processing failure and the like through the state identifier, when each subtask is added into the RocketMQ queue, each subtask is identified as the to-be-processed state, and in the subsequent processing process, the state identifier can be changed according to the processing progress so as to embody the real-time state of the task.
Step S202, an asynchronous thread is adopted, a subtask to be executed is obtained from the RocketMQ queue, and result inquiry is carried out on the subtask to be executed through an original database, so that a result set is obtained;
in this embodiment, when each subtask joins the dockmq queue, an asynchronous thread may be used to obtain the subtask to be executed from the dockmq queue, and perform result query on the subtask to be executed through the original database to obtain a result set. Specifically, after a subtask is added to a RocketMQ queue by a producer through a producer-consumer mode, an asynchronous thread can be adopted by a consumer, the subtask to be executed is obtained from the RocketMQ queue, an original database is called to carry out result query on the subtask to be executed, a result set corresponding to the subtask to be executed is obtained, the result set (result set) is an object returned by a data query result, all data related to the queried subtask to be executed are contained, and the subtask to be executed can be processed later according to the result set.
It can be understood that the asynchronous threads are adopted for processing, so that decoupling among multiple threads can be realized, a consumer directly obtains the subtasks to be executed from the RocketMQ queue, and the processing efficiency is improved without influencing or waiting for the process that the producer splits the data to be processed into a plurality of subtasks and adds the subtasks into the RocketMQ queue.
Step S203, serializing the result set according to a preset protocol to obtain and store serialized data;
in this embodiment, after obtaining a result set corresponding to a subtask to be executed through querying an original database, the result set may be serialized according to a preset protocol, so as to obtain serialized data and store the serialized data. Specifically, in order to enable cross-platform storage and network transmission of the result set, serialization is required for the result set to convert unstructured data into structured data, the preset protocol is a preset communication protocol, a format used by the data unit, information and meaning contained in the information unit, a connection mode and a time sequence for sending and receiving the information are defined, the result set can be converted into a transmissible byte sequence according to the preset protocol, the byte sequence is the serialized data, and meanwhile, the serialized data can be independently stored in a result database for subsequent processing.
Step S204, clustering the stored serialization data according to a pre-configured matching library to obtain clustered data;
in this embodiment, after the result set is serialized and stored, the stored serialized data can be clustered according to a preconfigured matching library, so as to obtain clustered data. Specifically, for the sake of data integrity, clustering is required for the data, and for decoupling from the above-mentioned process of serializing the result set, the step may employ an asynchronous thread to read the stored serialized data from the result database, then determine the associated data of the serialized data set according to the preconfigured matching library, and combine the read serialized data with the associated data, thereby obtaining clustered data.
Step S205, performing data normalization, data association and data cleaning on the clustered data to obtain target data.
In this embodiment, after the serialized data are clustered to obtain clustered data, the clustered data may be subjected to data normalization, data association, and data cleaning, so as to obtain target data. Specifically, the data standardization can adopt a dispersion standardization or standard deviation standardization mode, and the cluster data is eliminated in a different mode through the data standardization; the data set union can define a plurality of different data types as a class, so that the data share the same section of memory to save space; and in the process of data cleaning, repeated values and abnormal values of the data can be detected and deleted, so that the data is more effective and accurate. The target data can be obtained by carrying out data standardization, data combination and data cleaning on the clustered data, and the data to be processed in the embodiment is processed to obtain the target data process, which is used as a preprocessing process of a large amount of data, and then data summarization and analysis are carried out according to the target data, so that the data processing efficiency can be improved, and the effective utilization of data resources can be realized.
According to the method and the device, the data to be processed is distributed as the subtasks to be executed through the message queue middleware, and the query results are serialized and clustered in an asynchronous thread mode, so that the preprocessing of a large amount of data can be realized, the follow-up operations such as summarizing and analyzing can be more conveniently carried out on the data obtained after processing according to the flow, the quick response capability of a system when a large amount of data is generated by multithreading processing medical services is improved, and the data processing efficiency under the condition of a large amount of data is improved.
In some optional implementations of this embodiment, the step of serializing the result set according to a preset protocol to obtain serialized data and storing the serialized data includes:
converting the result set into a transmissible byte sequence as the serialized data according to the preset protocol;
storing the serialized data in a result database.
In this embodiment, the result of the subtask to be executed is queried through the original database, and after the result set is obtained, the result set can be converted into a transmissible byte sequence according to a preset protocol, and the byte sequence is used as the serialized data, and then the serialized data is stored in the result database. Specifically, the serialization of the data refers to converting the data into a standardized format, where in this embodiment, the standardized format is specified by a preset protocol, so that the format can be used across programs and platforms, and the original content and specification of the format are maintained, so that according to the preset protocol, a result set can be converted into a transmissible byte sequence, and the transmissible byte sequence is stored as the serialized data in a result database, so that the serialized data can be extracted from the result database asynchronously for clustering.
According to the method and the device, the unstructured query result is converted into the structured data, so that the query result can be clustered in an asynchronous thread mode, the pretreatment of a large amount of data can be realized, the quick response capability of a system when a large amount of data generated by medical service is processed through multiple threads is improved, and the data processing efficiency under the condition of a large amount of data is improved.
In some optional implementations of this embodiment, the step of clustering the stored serialized data according to the pre-configured matching library to obtain clustered data includes:
reading the serialized data from the result database by adopting an asynchronous thread;
and determining associated data of the serialized data according to the matching library, and combining the serialized data with the associated data to obtain the clustered data.
In this embodiment, after the result set is converted into the serialized data and stored in the result database, an asynchronous thread may be used to read the stored serialized data from the result database, and then determine associated data of the serialized data according to a pre-configured matching library, and combine the serialized data with the associated data to obtain clustered data. Specifically, the preset configuration information is contained in a preset matching library, the serialized data and each preset configuration information are compared, the target configuration information matched with the serialized data can be determined and used as associated data, the associated data is extracted from the matching library, and the associated data is written into the serialized data, so that cluster data are obtained.
According to the method and the device, the serialization process and the clustering process of the data are decoupled through asynchronous processing, and the serialized result set is clustered, so that the similarity of the data in the result set is ensured, the quick response capability of a system during subsequent data processing is improved, and the data processing efficiency is improved.
In some optional implementations of this embodiment, the step of determining the association data of the serialized data according to the matching library, and combining the serialized data with the association data to obtain the cluster data includes:
comparing the serialized data with each piece of preset configuration information in the matching library, and determining target configuration information matched with the serialized data as the associated data;
and extracting the associated data from the matching library, and writing the associated data into the serialized data to obtain the clustered data.
In this embodiment, after the stored serialized data is read from the result database, the serialized data may be compared with each preset configuration information in the preset matching library, the target configuration information matched with the serialized data is determined and used as the associated data, then the associated data is extracted from the preset matching library, and the associated data is written into the serialized data, thereby obtaining the clustered data. Specifically, the preset matching library may be a fuzzy string matching tool kit, for example, a fuzzy wuzzzy library, and each preset configuration information in the matching library may be each preset byte sequence, the matching library is used for matching the serialized data, namely, the byte sequence matched with the serialized data can be determined according to the difference between the byte sequences, and is used as associated data, the associated data is extracted from the matching library, and the associated data is written into the serialized data through editing the serialized data, so that the purpose of clustering is achieved, and the clustered data is obtained.
The method and the device cluster the serialized result set by means of matching library inquiry, so that the similarity of data in the result set is ensured, the accuracy of clustered data is improved, and the efficiency of subsequent data processing is improved.
In some optional implementations of this embodiment, the step of obtaining the data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting policy, and adding each subtask to a dockmq queue includes:
acquiring data to be processed of a current batch in a batch processing task pool;
dividing the data to be processed into different main tasks according to different task dimensions, and splitting each main task to obtain each sub task;
and adding each subtask into the RocketMQ queue, and constructing a task relation table corresponding to the current batch according to each main task and each subtask.
In this embodiment, when the data processing method starts to run, first, data to be processed of a current batch in a batch task pool may be obtained, the data to be processed may be divided into different main tasks according to different task dimensions, each main task may be split to obtain each sub task, then each sub task may be added into a RocketMQ queue, and a task relationship table corresponding to the current batch may be constructed according to each main task and each sub task. Specifically, in order to batch process a large amount of data, different batches of data to be processed are added in a batch processing task pool, from the batch processing task pool, the current batch of data to be processed can be obtained, the current batch of data to be processed is divided into different main tasks according to task dimensions such as an underwriting mechanism, a settlement time interval, a settlement staff, a settlement expense type, a settlement type and the like of a policy, each main task is split, a plurality of sub tasks are obtained, so that the data processing can be performed asynchronously through a producer-consumer mode subsequently, meanwhile, a task relation table corresponding to the current batch can be constructed according to each main task and each sub task, and processing results obtained by performing corresponding processing on finally obtained target data can be used for filling the task relation table subsequently, so that data summarization and analysis can be completed.
The method and the device can temporarily cache and batched process the data to be processed through the batch processing task pool, avoid interference caused by overlarge data quantity, split the data to be processed into the minimum units for processing, ensure that the system can respond in time, and improve the accuracy of subsequent data processing.
In some optional implementations of this embodiment, after the step of obtaining the target data after performing data normalization, data association, and data cleansing on the clustered data, the method further includes:
obtaining batch information corresponding to the target data, and determining a target task relation table according to the batch information;
performing deserialization on the target data to obtain detail data corresponding to the target data;
and executing a preset processing flow aiming at the detail data to obtain a processing result, and writing the processing result into the target task relation table.
In this embodiment, after the target data is obtained, the batch information corresponding to the target data may be obtained, the target task relationship table is determined according to the batch information, then the target data is deserialized to obtain the detail data corresponding to the target data, then a preset processing flow is executed for the detail data to obtain a processing result, and the processing result is written into the target task relationship table. Specifically, when the data to be processed is split, a task relation table corresponding to the current batch can be constructed according to each main task and each sub task, so that after the target data is obtained, the corresponding task relation table can be determined according to batch information corresponding to the target data, the target data is used as the target task relation table, deserialized, the detail data corresponding to the target data can be generated, a plurality of threads can be generated according to the detail data through completioneFurure to generate the data to be obtained, and after the processing result, such as financial data, the generated data can be written back to the target task relation table to complete the summarization and analysis of the data.
It should be noted that Future is an interface newly added by Java5, and may provide an asynchronous parallel computing function. The Completable is an implementation class of the Future interface, through which multiple threads can be generated to asynchronously process tasks, and provides a mechanism similar to an observer mode, so that a monitor party can be notified after task execution is completed.
According to the method and the device, the target data obtained after the pretreatment are correspondingly treated, and the treatment result is written back to the task relation table which is generated in advance, so that the data to be treated are summarized and analyzed, and the data treatment efficiency is improved.
In some optional implementations of this embodiment, before the step of obtaining the data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting policy, and adding each subtask to a dockmq queue, the method further includes:
collecting original data from a plurality of data sources at regular time;
and adding the original data into a batch processing task pool and taking the original data as the data to be processed.
In this embodiment, before the data processing method is run, raw data may be collected from multiple data sources at regular time, and the raw data may be added to a batch task pool as data to be processed. Specifically, for the insurance industry, a large amount of service data is generated every day, the system can collect the service data from a plurality of data sources at regular time according to set time, the service data is used as original data, a batch processing task pool is added as data to be processed, and the batch processing of the data to be processed can be performed asynchronously through the batch processing task pool.
According to the method and the device, the data is collected at regular time and added into the batch processing task pool as the data to be processed for asynchronous processing, and the data collecting process and the data processing process are decoupled, so that the quick response capability of the system is improved, and the data processing efficiency is improved.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data processing apparatus, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.
As shown in fig. 3, the data processing apparatus 300 according to the present embodiment includes: an acquisition module 301, a query module 302, a serialization module 303, a clustering module 304, and a processing module 305. Wherein:
the acquiring module 301 is configured to acquire data to be processed, split the data to be processed into a plurality of subtasks according to a preset splitting policy, and add each subtask into a RocketMQ queue;
the query module 302 is configured to obtain a subtask to be executed from the dockmq queue by using an asynchronous thread, and perform result query on the subtask to be executed through an original database to obtain a result set;
the serialization module 303 is configured to serialize the result set according to a preset protocol, obtain serialized data, and store the serialized data;
the clustering module 304 is configured to cluster the stored serialized data according to a preconfigured matching library to obtain clustered data;
and the processing module 305 is used for obtaining target data after data normalization, data combination and data cleaning of the clustered data.
According to the data processing device, the data to be processed is distributed as the subtasks to be executed through the message queue middleware, and the query results are serialized and clustered in an asynchronous thread mode, so that the preprocessing of a large amount of data can be realized, the follow-up operations such as summarizing and analyzing can be more conveniently carried out on the data obtained after processing according to the flow, the quick response capability of a system when a large amount of data is generated through the multithreading processing medical service is improved, and the data processing efficiency under the condition of a large amount of data is improved.
In some optional implementations of this embodiment, the serialization module 303 is further configured to:
converting the result set into a transmissible byte sequence as the serialized data according to the preset protocol;
storing the serialized data in a result database.
According to the data processing device, the unstructured query result is converted into the structured data, so that the query result can be clustered in an asynchronous thread mode, the pretreatment of a large amount of data can be realized, the quick response capability of a system when a large amount of data generated by medical service is processed through multiple threads is improved, and the data processing efficiency under the condition of a large amount of data is improved.
In some optional implementations of this embodiment, the clustering module 304 further includes:
a reading unit 3041, configured to read the serialized data from the result database by using an asynchronous thread;
and the determining unit 3042 is configured to determine association data of the serialized data according to the matching library, and combine the serialized data with the association data to obtain the cluster data.
According to the data processing device, the serialization process and the clustering process of the data are decoupled through asynchronous processing, and the serialized result sets are clustered, so that the similarity of the data in the result sets is ensured, the quick response capability of a system during subsequent data processing is improved, and the data processing efficiency is improved.
In some optional implementations of this embodiment, the determining unit 3042 is further configured to:
comparing the serialized data with each piece of preset configuration information in the matching library, and determining target configuration information matched with the serialized data as the associated data;
and extracting the associated data from the matching library, and writing the associated data into the serialized data to obtain the clustered data.
According to the data processing device, the serialized result set is clustered through the means of matching library inquiry, so that the similarity of data in the result set is ensured, the accuracy of clustered data is improved, and the efficiency of subsequent data processing is improved.
In some optional implementations of this embodiment, the acquiring module 301 is further configured to:
acquiring data to be processed of a current batch in a batch processing task pool;
dividing the data to be processed into different main tasks according to different task dimensions, and splitting each main task to obtain each sub task;
and adding each subtask into the RocketMQ queue, and constructing a task relation table corresponding to the current batch according to each main task and each subtask.
According to the data processing device, temporary caching and batch processing can be carried out on the data to be processed through the batch processing task pool, interference caused by overlarge data size is avoided, the data to be processed is split into the minimum units for processing, the system is ensured to respond in time, and the accuracy of subsequent data processing is improved.
In some optional implementations of this embodiment, the data processing apparatus 300 is further configured to:
Obtaining batch information corresponding to the target data, and determining a target task relation table according to the batch information;
performing deserialization on the target data to obtain detail data corresponding to the target data;
and executing a preset processing flow aiming at the detail data to obtain a processing result, and writing the processing result into the target task relation table.
According to the data processing device, the target data obtained after the pretreatment is correspondingly processed, and the processing result is written back to the task relation table which is generated in advance, so that the data to be processed are summarized and analyzed, and the data processing efficiency is improved.
In some optional implementations of this embodiment, the data processing apparatus 300 is further configured to:
collecting original data from a plurality of data sources at regular time;
and adding the original data into a batch processing task pool and taking the original data as the data to be processed.
According to the data processing device, the data is collected at regular time and added into the batch processing task pool as data to be processed for asynchronous processing, and the data collecting process and the data processing process are decoupled, so that the quick response capability of the system is improved, and the data processing efficiency is improved.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 5, fig. 5 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 5 comprises a memory 51, a processor 52, a network interface 53 which are communicatively connected to each other via a system bus. It should be noted that only the computer device 5 with components 51-53 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 51 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 5. Of course, the memory 51 may also comprise both an internal memory unit of the computer device 5 and an external memory device. In this embodiment, the memory 51 is typically used to store an operating system and various application software installed on the computer device 5, such as computer readable instructions of a data processing method. Further, the memory 51 may be used to temporarily store various types of data that have been output or are to be output.
The processor 52 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is configured to execute computer readable instructions stored in the memory 51 or process data, such as computer readable instructions for executing the data processing method.
The network interface 53 may comprise a wireless network interface or a wired network interface, which network interface 53 is typically used to establish communication connections between the computer device 5 and other electronic devices.
According to the computer equipment, the data to be processed is distributed as the subtasks to be executed through the message queue middleware, and the query results are serialized and clustered in an asynchronous thread mode, so that the preprocessing of a large amount of data can be realized, the follow-up operations such as summarizing and analyzing can be more conveniently carried out on the data obtained after processing according to the flow, the quick response capability of a system when a large amount of data is generated by multithreading processing medical services is improved, and the data processing efficiency under the condition of a large amount of data is improved.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the data processing method as described above.
According to the computer readable storage medium, the data to be processed is distributed as the subtasks to be executed through the message queue middleware, and the query results are serialized and clustered in an asynchronous thread mode, so that the preprocessing of a large amount of data can be realized, the follow-up operations such as summarizing and analyzing can be more conveniently performed according to the data obtained after the processing of the flow, the quick response capability of a system when a large amount of data generated by medical service is processed through multiple threads is improved, and the data processing efficiency under the condition of a large amount of data is improved.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.
Claims (10)
1. A method of data processing comprising the steps of:
the method comprises the steps of obtaining data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue;
an asynchronous thread is adopted, a subtask to be executed is obtained from the RocketMQ queue, and the subtask to be executed is subjected to result query through an original database, so that a result set is obtained;
Serializing the result set according to a preset protocol to obtain serialized data and storing the serialized data;
clustering the stored serialized data according to a pre-configured matching library to obtain clustered data;
and carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data.
2. The method for processing data according to claim 1, wherein the step of serializing the result set according to a predetermined protocol to obtain and store the serialized data specifically comprises:
converting the result set into a transmissible byte sequence as the serialized data according to the preset protocol;
storing the serialized data in a result database.
3. The data processing method according to claim 2, wherein the step of clustering the stored serialized data according to a pre-configured matching library to obtain clustered data specifically includes:
reading the serialized data from the result database by adopting an asynchronous thread;
and determining associated data of the serialized data according to the matching library, and combining the serialized data with the associated data to obtain the clustered data.
4. A data processing method according to claim 3, wherein the step of determining associated data of the serialized data according to the matching library and combining the serialized data with the associated data to obtain the clustered data specifically comprises:
comparing the serialized data with each piece of preset configuration information in the matching library, and determining target configuration information matched with the serialized data as the associated data;
and extracting the associated data from the matching library, and writing the associated data into the serialized data to obtain the clustered data.
5. The data processing method according to claim 1, wherein the steps of obtaining the data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting policy, and adding each subtask into a dockmq queue specifically include:
acquiring data to be processed of a current batch in a batch processing task pool;
dividing the data to be processed into different main tasks according to different task dimensions, and splitting each main task to obtain each sub task;
And adding each subtask into the RocketMQ queue, and constructing a task relation table corresponding to the current batch according to each main task and each subtask.
6. The data processing method according to claim 5, further comprising, after the step of obtaining target data after the step of performing data normalization, data association, and data cleansing on the clustered data:
obtaining batch information corresponding to the target data, and determining a target task relation table according to the batch information;
performing deserialization on the target data to obtain detail data corresponding to the target data;
and executing a preset processing flow aiming at the detail data to obtain a processing result, and writing the processing result into the target task relation table.
7. The method for processing data according to any one of claims 1 to 6, further comprising, before the step of obtaining the data to be processed, splitting the data to be processed into a plurality of sub-tasks according to a preset splitting policy, and adding each of the sub-tasks to a dockmq queue:
collecting original data from a plurality of data sources at regular time;
and adding the original data into a batch processing task pool and taking the original data as the data to be processed.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring data to be processed, splitting the data to be processed into a plurality of subtasks according to a preset splitting strategy, and adding each subtask into a RocketMQ queue;
the query module is used for acquiring a sub-task to be executed from the RocketMQ queue by adopting an asynchronous thread, and querying the result of the sub-task to be executed through an original database to obtain a result set;
the serialization module is used for serializing the result set according to a preset protocol to obtain and store serialized data;
the clustering module is used for clustering the stored serialized data according to a pre-configured matching library to obtain clustered data;
and the processing module is used for carrying out data standardization, data combination and data cleaning on the clustered data to obtain target data.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the data processing method of any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon computer-readable instructions which, when executed by a processor, implement the steps of the data processing method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311193229.7A CN117272077A (en) | 2023-09-14 | 2023-09-14 | Data processing method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311193229.7A CN117272077A (en) | 2023-09-14 | 2023-09-14 | Data processing method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117272077A true CN117272077A (en) | 2023-12-22 |
Family
ID=89211541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311193229.7A Pending CN117272077A (en) | 2023-09-14 | 2023-09-14 | Data processing method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117272077A (en) |
-
2023
- 2023-09-14 CN CN202311193229.7A patent/CN117272077A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112162965B (en) | Log data processing method, device, computer equipment and storage medium | |
CN113010542B (en) | Service data processing method, device, computer equipment and storage medium | |
CN113282611B (en) | Method, device, computer equipment and storage medium for synchronizing stream data | |
CN112860662B (en) | Automatic production data blood relationship establishment method, device, computer equipment and storage medium | |
CN114564294A (en) | Intelligent service arranging method and device, computer equipment and storage medium | |
CN113111078B (en) | Resource data processing method and device, computer equipment and storage medium | |
CN111198916A (en) | Data transmission method and device, electronic equipment and storage medium | |
CN117094729A (en) | Request processing method, device, computer equipment and storage medium | |
CN117271121A (en) | Task processing progress control method, device, equipment and storage medium thereof | |
CN117215867A (en) | Service monitoring method, device, computer equipment and storage medium | |
CN117251490A (en) | Data query method, device, computer equipment and storage medium | |
CN117271122A (en) | Task processing method, device, equipment and storage medium based on separation of CPU and GPU | |
CN117272077A (en) | Data processing method, device, computer equipment and storage medium | |
CN116450723A (en) | Data extraction method, device, computer equipment and storage medium | |
CN114238335A (en) | Buried point data generation method and related equipment thereof | |
CN113590372A (en) | Log-based link tracking method and device, computer equipment and storage medium | |
CN114663073B (en) | Abnormal node discovery method and related equipment thereof | |
CN113806372B (en) | New data information construction method, device, computer equipment and storage medium | |
CN117827988A (en) | Data warehouse optimization method, device, equipment and storage medium thereof | |
CN116680263A (en) | Data cleaning method, device, computer equipment and storage medium | |
CN117251468A (en) | Query processing method, device, computer equipment and storage medium | |
CN117390119A (en) | Task processing method, device, computer equipment and storage medium | |
CN116795882A (en) | Data acquisition method, device, computer equipment and storage medium | |
CN116842011A (en) | Blood relationship analysis method, device, computer equipment and storage medium | |
CN117273635A (en) | Business system task regulation and control method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |