[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112597200A - Batch and streaming combined data processing method and device - Google Patents

Batch and streaming combined data processing method and device Download PDF

Info

Publication number
CN112597200A
CN112597200A CN202011529842.8A CN202011529842A CN112597200A CN 112597200 A CN112597200 A CN 112597200A CN 202011529842 A CN202011529842 A CN 202011529842A CN 112597200 A CN112597200 A CN 112597200A
Authority
CN
China
Prior art keywords
data
batch
streaming
nodes
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011529842.8A
Other languages
Chinese (zh)
Other versions
CN112597200B (en
Inventor
陈卓
孙启明
汪利鹏
李延明
李侃
郭显宽
胡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Three Eye Spirit Information Technology Co ltd
Original Assignee
Nanjing Three Eye Spirit Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Three Eye Spirit Information Technology Co ltd filed Critical Nanjing Three Eye Spirit Information Technology Co ltd
Priority to CN202011529842.8A priority Critical patent/CN112597200B/en
Publication of CN112597200A publication Critical patent/CN112597200A/en
Application granted granted Critical
Publication of CN112597200B publication Critical patent/CN112597200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The embodiment of the application provides a batch and streaming combined data processing method and device, wherein the method comprises the following steps: determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.

Description

Batch and streaming combined data processing method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a batch and streaming combined data processing method and apparatus.
Background
After the digital era, the value of data is continuously explored, particularly, the appearance of big data technology enables data analysis work to become a necessary course for further development in various fields, and data analysis modeling also becomes a popular research direction.
In the field of data processing, batch and streaming are the most common forms of data, and for many years, a set of targeted data computing tools have emerged: in the aspect of batch data, from an sql tool of a traditional relational database to a hive and impala batch calculation engine of a big data platform; in the aspect of streaming data, from message middleware kafka, RabbitMQ to a stream processing framework storm, flush, and the like. They have the characteristics of providing a powerful calculation means for batch and streaming data analysis from multiple angles and scenes.
The inventor finds the defects and shortcomings in the prior art:
(1) streaming data is poorly utilized in data modeling
The factors such as comprehensive service and efficiency are adopted, most of data adopting stream storage in engineering have single service attribute and simple content, are mainly used as single scene and are realized by directional function in application, even a lot of data are only collected and are directly put in storage to become batch data, and precious timeliness benefit is lost. In addition, streaming data is less flexible in presentation and use than batch data and is difficult to control, and thus is less applicable in data modeling processes than the latter.
(2) Lack of integration of batch and streaming computing
The difference between bulk and streaming data is significant: the batch data is stored in various databases and file systems in a mass mode and is integrally accessed and used in batch; the streaming data is mostly stored in various message middleware or even memory, and is processed one by one or in small batches. Moreover, the data formats of the two are very different, and format conversion is required in advance in most cases to realize common calculation. In addition to differences in process control, batch and streaming data computation lack a suitable model for integration.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a batch and streaming combined data processing method and device, which can effectively combine batch data and streaming data and improve data processing efficiency.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a batch and streaming combined data processing method, including:
determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model;
and processing data according to the calculation model and a preset execution mode.
Further, the performing data processing according to the calculation model and a preset execution mode includes:
if the calculation model is a single-source calculation model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and processing data according to a preset execution mode;
and if the calculation model is a multi-source calculation model and the data node type is a streaming data node, sequentially acquiring each piece of target data of the streaming data node or acquiring the target data of the streaming data node within a set time period, and performing data processing according to a preset execution mode.
Further, the performing data processing according to the calculation model and a preset execution mode includes:
if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of batch data nodes, performing data processing on target data of the batch data nodes through a preset calculation engine or a preset data index rule;
if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation comprise batch data nodes and streaming data nodes, data indexing is carried out on the batch data nodes in advance, target data matched with the batch data nodes in the streaming data nodes are determined according to a preset rule, and data processing is carried out;
and if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of streaming data nodes, performing data processing through a preset calculation engine or a preset data index rule according to target data of the streaming data nodes in a set time period and target data of the batch data nodes.
Further, the preset execution mode includes: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
In a second aspect, the present application provides a batch and streaming combined data processing apparatus, comprising:
the computing node determining module is used for determining a corresponding computing model according to the type and the number of data nodes which need to perform computing, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, a single-source computing model is executed, otherwise, a multi-source computing model is executed;
and the target data calculation module is used for carrying out data processing according to the calculation model and a preset execution mode.
Further, the computing node determination module includes:
the batch data node single-source computing unit is used for acquiring target data of the batch data nodes in batch at one time and processing the data according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
and the streaming data node single-source computing unit is used for sequentially acquiring each piece of target data of the streaming data node or acquiring the target data of the streaming data node within a set time period if the computing model is a single-source computing model and the type of the data node is a streaming data node, and processing the data according to a preset execution mode.
Further, the computing node determination module includes:
the multi-batch data node multi-source calculation unit is used for performing data processing on target data of each batch of data nodes through a preset calculation engine or a preset data index rule if the calculation model is a multi-source calculation model and the types of the data nodes needing to perform calculation are multiple batch data nodes;
the combined computing unit of the batch data nodes and the streaming data nodes is used for pre-indexing the batch data nodes if the computing model is a multi-source computing model and the types of the data nodes needing to be computed comprise the batch data nodes and the streaming data nodes, determining target data matched with the batch data nodes in the streaming data nodes according to a preset rule and performing data processing;
and the multi-source computing unit of the multi-stream data nodes is used for processing data through a preset computing engine or a preset data index rule according to target data of the stream data nodes in a set time period and target data of the batch data nodes if the computing model is the multi-source computing model and the types of the data nodes needing to execute computing are multiple stream data nodes.
Further, the preset execution mode includes: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the batch and streaming combined data processing method.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the combined batch and streaming data processing method.
According to the technical scheme, the method and the device for processing the data in the combination of the batch mode and the streaming mode are used for determining the corresponding calculation model through the data node type and the data node number which need to be calculated, wherein the data node type comprises the batch data node and the streaming data node, if the data node number is single, the single-source calculation model is executed, and otherwise, the multi-source calculation model is executed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating a batch-and-stream combined data processing method according to an embodiment of the present application;
FIG. 2 is a block diagram of a batch and streaming combined data processing apparatus according to an embodiment of the present application;
FIG. 3 is a block diagram of an embodiment of a batch and streaming data processing apparatus;
FIG. 4 is a single-source computation diagram of a batch data node in an embodiment of the present application;
FIG. 5 is a schematic diagram of single-source computation of a streaming data node in an embodiment of the present application;
FIG. 6 is a schematic diagram of multi-source computing of multi-batch data nodes in an embodiment of the present application;
fig. 7 is a schematic diagram of combined calculation of a bulk data node and a streaming data node in the embodiment of the present application;
FIG. 8 is a schematic diagram of multi-source computing of a multi-stream data node in an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In view of the problems in the prior art, the application provides a batch and streaming combined data processing method and device, wherein a corresponding calculation model is determined according to a data node type and a data node number which need to be calculated, wherein the data node type comprises a batch data node and a streaming data node, if the data node number is single, a single-source calculation model is executed, otherwise, a multi-source calculation model is executed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
In order to effectively combine batch data and streaming data and improve data processing efficiency, the present application provides an embodiment of a batch and streaming combined data processing method, and referring to fig. 1, the batch and streaming combined data processing method specifically includes the following contents:
step S101: determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model;
step S102: and processing data according to the calculation model and a preset execution mode.
As can be seen from the above description, the batch and streaming combined data processing method provided in the embodiment of the present application can determine a corresponding computation model by performing a computation according to a data node type and a data node number, where the data node type includes a batch data node and a streaming data node, and if the data node number is single, a single-source computation model is performed, otherwise, a multi-source computation model is performed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
In an embodiment of the batch-and-stream combined data processing method according to the present application, the performing data processing according to the calculation model and a preset execution mode includes:
step S201: if the calculation model is a single-source calculation model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and processing data according to a preset execution mode;
step S202: and if the calculation model is a multi-source calculation model and the data node type is a streaming data node, sequentially acquiring each piece of target data of the streaming data node or acquiring the target data of the streaming data node within a set time period, and performing data processing according to a preset execution mode.
In an embodiment of the batch-and-stream combined data processing method according to the present application, the performing data processing according to the calculation model and a preset execution mode includes:
step S301: if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of batch data nodes, performing data processing on target data of the batch data nodes through a preset calculation engine or a preset data index rule;
step S302: if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation comprise batch data nodes and streaming data nodes, data indexing is carried out on the batch data nodes in advance, target data matched with the batch data nodes in the streaming data nodes are determined according to a preset rule, and data processing is carried out;
step S303: and if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of streaming data nodes, performing data processing through a preset calculation engine or a preset data index rule according to target data of the streaming data nodes in a set time period and target data of the batch data nodes.
In an embodiment of the batch-and-stream combined data processing method of the present application, the preset execution mode includes: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
In order to effectively combine batch data and streaming data and improve data processing efficiency, the present application provides an embodiment of a batch-and-streaming combined data processing apparatus for implementing all or part of the contents of the batch-and-streaming combined data processing method, and referring to fig. 2, the batch-and-streaming combined data processing apparatus specifically includes the following contents:
the calculation node determination module 10 is configured to determine a corresponding calculation model according to a data node type and a data node number that require calculation, where the data node type includes batch data nodes and streaming data nodes, and if the data node number is single, execute a single-source calculation model, otherwise execute a multi-source calculation model;
and the target data calculation module 20 is used for performing data processing according to a preset execution mode according to the calculation model.
As can be seen from the above description, the batch-and-stream combined data processing apparatus provided in the embodiment of the present application can determine a corresponding calculation model by performing a calculation according to a data node type and a data node number, where the data node type includes a batch data node and a stream data node, and if the data node number is single, a single-source calculation model is performed, otherwise, a multi-source calculation model is performed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
In an embodiment of the batch and streaming combined data processing apparatus of the present application, the computing node determining module 10 includes:
the batch data node single-source computing unit 11 is configured to obtain target data of the batch data nodes in batch at one time and perform data processing according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
and the streaming data node single-source computing unit 12 is configured to, if the computing model is a single-source computing model and the data node type is a streaming data node, sequentially obtain each piece of target data of the streaming data node or obtain target data of the streaming data node within a set time period, and perform data processing according to a preset execution mode.
In an embodiment of the batch and streaming combined data processing apparatus of the present application, the computing node determining module 10 includes:
the multi-batch data node multi-source calculation unit 13 is used for performing data processing on target data of each batch of data nodes through a preset calculation engine or a preset data index rule if the calculation model is a multi-source calculation model and the types of the data nodes needing to perform calculation are multiple batch data nodes;
the batch data node and stream data node combined computing unit 14 is configured to perform data indexing on the batch data nodes in advance if the computing model is a multi-source computing model and the types of the data nodes needing to perform computing include a batch data node and a stream data node, determine target data matched with the batch data node in the stream data node according to a preset rule, and perform data processing;
and the multi-source computing unit 15 is configured to, if the computing model is a multi-source computing model and the types of the data nodes that need to perform computing are multiple streaming data nodes, perform data processing according to target data of the streaming data nodes in a set time period and target data of the batch data nodes by using a preset computing engine or a preset data index rule.
In an embodiment of the batch and streaming combined data processing apparatus of the present application, the preset execution mode includes: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
To further illustrate the present solution, the present application further provides a specific application example of the method for implementing batch and streaming combined data processing by using the above batch and streaming combined data processing apparatus, which specifically includes the following contents:
referring to fig. 3, the element nodes and their computation relationships in fig. 3 form a Directed Acyclic Graph (DAG), and a graph and its represented information we refer to as a computation Model. The model supports the calculation relationship of multi-level nodes (the number of the nodes is arbitrary), the construction process of the multi-level nodes is the construction process of the calculation model, and after the model is constructed, the model independently operates according to the sequence and the direction in the drawing (the data calculation is carried out step by step).
The specific design scheme is as follows:
one, element constitution
The scheme divides elements in the model into the following parts:
(1) data node
Data nodes are the initial data sources in the model and are divided into two types:
bulk data node
Representing batch data and metadata thereof, the data of the nodes are randomly batch updated (business related), and the structure is as follows:
data format: two-dimensional form, hierarchical structure entity, self-defined structure list and the like which can be formed by rows (data) and columns (fields)
Metadata information: data description, structural description (data structure description, field name, field meaning, field type, etc.)
The bearing mode is as follows: can be stored in relational database, distributed file system, etc
The streaming data node:
representing streaming data and metadata thereof, the data of the node continuously flows, has time attribute, and has the following structure:
data format: the message ID + the message body, the message ID has uniqueness, and the message body contains all data values of the current message (line storage, which can be in various formats, such as Key/Value structure, json structure, etc., the scheme takes Key/Value structure as default format).
Metadata information: data description, message body analysis Structure (scheme, including key name, data type, analysis method, etc.)
The bearing mode is as follows: middleware or memory for various messages
(2) Computing node
The compute nodes represent two layers of meaning: firstly, computing operation is carried out on a data source node (a data node or a computing node); the second is a data set generated by calculation in the step, and the data set has two carrier forms of batch and streaming.
The types of computations supported include:
data cleaning: the data is flushed using basic processing logic (e.g., merging, intercepting, etc.).
And (3) algorithm prediction: and predicting the source node by utilizing a pre-training data algorithm model to generate a prediction result.
And (3) operator calculation: and processing the source node data by using an external operator (a calculation process after encapsulation) to generate a processing result.
Data collision: rules (e.g., cross union) of two or more node data are collided, and the result after collision is obtained.
And (3) relation mining: and calculating the relationship between two or more node data to obtain relationship data.
Other extension types: any other computing type that supports independent calls, such as external interfaces, SDKs, etc.
The nodes store basic information required by various calculations, such as field matching relations of data collision, feature selection of algorithm prediction, source fields of operator calculation and the like.
In the scheme, no matter the data nodes or the computing nodes are adopted, different physical platforms can be adopted to bear data, and the data can be accessed in a model operation link.
(3) Connecting wire
The connecting lines in the logic expression graph represent the node relation of data calculation and the model operation direction, and can be used for interface display (the information required by the specific model operation is completely provided in the nodes).
Second, calculation mode
The node calculation type is closely related to the calculation mode, and the scheme provides the following modes which respectively correspond to the application modes of data under different scenes and calculation types (the internal details of a specific calculation are not related):
single-source calculation:
the calculation aiming at a single node data set is mostly in the types of data cleaning, algorithm prediction and operator calculation.
(1) Batch data calculation (Batch)
Referring to fig. 4, the node data is obtained, transmitted and involved in calculation in batches at one time, and the calculation result is written into the load-bearing container as a whole.
(2) Streaming data computation (Stream)
Referring to fig. 5, according to the real-time requirement, the method can be divided into:
a real-time mode: and (4) processing each piece of data in the flow carrier independently, completing the links of acquisition, transmission and calculation in sequence, and writing the data into a result carrier instantly. The method has higher requirement on data processing performance.
A quasi-real-time mode: and when the data in the streaming carrier is accumulated to a certain amount or exceeds a certain time, calculating the group of small data in batches and writing the group of small data into a result carrier. This approach sacrifices real-time to get computational efficiency of the approximate batch process.
Multi-source calculation:
aiming at the calculation among a plurality of node data sets, a plurality of node data are required to be called simultaneously to participate, and most of the node data are of the types of relationship mining and data collision.
(1) Batch pair Batch calculation (Batch VS Batch)
Referring to fig. 6, there are two implementation ways for the computation between batch nodes:
engine mode
And mapping the data by using a calculation engine (such as an SQL engine used by a structured two-dimensional table and an NoSQL engine used by unstructured data) meeting the requirement of a data format, and converting the calculation among batch data into the calculation operation supported by the engine to obtain a calculation result.
Index mode
And carrying out data indexing on the batch data, constructing the indexes based on fields/key values involved in calculation operation, converting the calculation among the batch data into the matching operation among the indexes, and then reversely extracting the original data to generate a next result set (calculation node).
Both of the above two ways support simultaneous computation of multiple nodes (2 and more) to reduce computation steps and intermediate temporary data generation. In the current mode, the generated result data is of a batch type.
(2) Batch convection type calculation (Batch VS Stream)
Referring to fig. 7, to improve the calculation efficiency, data indexing is performed on batch nodes in advance, and the indexes are constructed based on fields/key values involved in calculation operations. And continuously reading data from the data carrier of the streaming node, scanning an index row matched with the data according to a specific rule set under the current calculation type, and then reversely acquiring reserved field data and storing the reserved field data to the result node.
The mode supports simultaneous calculation of multiple batch nodes and one streaming node, wherein the batch nodes are firstly executed to form an integrated batch node and then are calculated with the streaming node. In the current mode, the resulting data is of streaming type.
(3) Streaming convection computing (Stream VS Stream)
Referring to fig. 8, since the streaming data itself contains a time-carrying attribute, the full computation in the full time range of the streaming data is not considered, but the streaming data can be converted into batch streaming computation after a specified time point (one of the components is converted into continuously growing batch data).
The specific implementation process comprises the following steps:
a streaming DATA 1 is selected, from which DATA is read and written continuously to the bulk DATA carrier at a calculation start time T0, forming an incrementally increasing intermediate node DATA X. If efficiency is a concern, a timeout or over-measure may be taken, and at the subsequent time points T1 and T2. . . And Tn gradually carries out data destaging in batches.
The remaining party DATA 2 implements both calculations using the foregoing (batch flow) strategy, and the calculation objects are D2 each piece of DATA and all DATA/indexes of DATA X at the current time.
In order to control the storage space of the intermediate node, the old data can be cleared to release the space (the data is required to meet the timeliness requirement) by utilizing the principle of timeout elimination. In the current mode, the resulting data is of streaming type.
Third, execution mode
After the model is built, the corresponding execution mode is needed to start the model to start data calculation. The scheme provides the following several execution modes
(1) Disposable mode
The model is only used by a model only containing batch data, the model is integrally executed once from an initial node, and the data of each computing node is a final result.
(2) Persistent mode
The mode needs to contain streaming data, the model is continuously executed from an initial node, no end condition is set, and the specific mode is as follows:
calculating nodes by batch data: performing only one calculation as a final result
A streaming data computing node: continuously calculating, and updating the calculation result into the result set
(3) Timed update mode
The mode has no requirement on the data type, is similar to the continuous mode, is continuously executed after the model is started, does not set an end condition, but increases the time interval of data updating, and has the following specific mode:
calculating nodes by batch data: the model is executed once immediately after being started as a current result, and is re-executed again after the data updating interval time is reached, and a new result (a new calculation result generated due to data source updating in two time intervals) is updated to a current result set.
A streaming data computing node: and continuously calculating, and if the batch data related to the batch data is changed, calculating by using the latest data set after updating.
As can be seen from the above, the present application can achieve at least the following technical effects:
1. the model construction method provided supports the introduction and use of batch data and streaming data, so that the streaming data can be more widely applied to data modeling, and the unique timeliness advantage of the streaming data is brought into play.
2. The method realizes various fusion calculation methods of batch data and streaming data, so that the batch data and the streaming data can be combined to deal with more and more complex practical problems, respective application fields are enriched, and the actual combat effects of both parties are improved.
In order to effectively combine batch data and streaming data and improve data processing efficiency from a hardware level, the present application provides an embodiment of an electronic device for implementing all or part of contents in the batch and streaming combined data processing method, where the electronic device specifically includes the following contents:
a processor (processor), a memory (memory), a communication Interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the batch and stream combined data processing device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may be implemented with reference to the embodiment of the batch and stream combined data processing method and the embodiment of the batch and stream combined data processing apparatus in the embodiment, and the contents thereof are incorporated herein, and repeated details are not repeated.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the batch and streaming combined data processing method may be executed on the electronic device side as described above, or all operations may be completed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be communicatively connected to a remote server to implement data transmission with the server. The server may include a server on the task scheduling center side, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 9, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the batch and streaming combined data processing method functions may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
step S101: determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model;
step S102: and processing data according to the calculation model and a preset execution mode.
As can be seen from the above description, in the electronic device provided in the embodiment of the present application, the corresponding calculation model is determined according to the data node type and the number of data nodes that need to perform calculation, where the data node type includes batch data nodes and streaming data nodes, if the number of the data nodes is single, a single-source calculation model is performed, and otherwise, a multi-source calculation model is performed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
In another embodiment, the batch and streaming combined data processing apparatus may be configured separately from the central processor 9100, for example, the batch and streaming combined data processing apparatus may be configured as a chip connected to the central processor 9100, and the functions of the batch and streaming combined data processing method may be realized by the control of the central processor.
As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, which may be referred to in the prior art.
As shown in fig. 9, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the batch-and-stream combined data processing method in which the execution subject is the server or the client in the above embodiment, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the batch-and-stream combined data processing method in which the execution subject is the server or the client, for example, when the processor executes the computer program, the processor implements the following steps:
step S101: determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model;
step S102: and processing data according to the calculation model and a preset execution mode.
As can be seen from the above description, in the computer-readable storage medium provided in the embodiment of the present application, a corresponding computation model is determined according to a data node type and a data node number that need to perform computation, where the data node type includes batch data nodes and streaming data nodes, and if the data node number is single, a single-source computation model is executed, otherwise, a multi-source computation model is executed; processing data according to the calculation model and a preset execution mode; the method and the device can effectively combine the batch data and the streaming data, and improve the data processing efficiency.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of batch and streaming combined data processing, the method comprising:
determining a corresponding calculation model according to the type of data nodes and the number of the data nodes needing to be calculated, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise executing a multi-source calculation model;
and processing data according to the calculation model and a preset execution mode.
2. The batch and streaming combined data processing method according to claim 1, wherein the performing data processing according to the calculation model and a preset execution mode comprises:
if the calculation model is a single-source calculation model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and processing data according to a preset execution mode;
and if the calculation model is a multi-source calculation model and the data node type is a streaming data node, sequentially acquiring each piece of target data of the streaming data node or acquiring the target data of the streaming data node within a set time period, and performing data processing according to a preset execution mode.
3. The batch and streaming combined data processing method according to claim 1, wherein the performing data processing according to the calculation model and a preset execution mode comprises:
if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of batch data nodes, performing data processing on target data of the batch data nodes through a preset calculation engine or a preset data index rule;
if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation comprise batch data nodes and streaming data nodes, data indexing is carried out on the batch data nodes in advance, target data matched with the batch data nodes in the streaming data nodes are determined according to a preset rule, and data processing is carried out;
and if the calculation model is a multi-source calculation model and the types of the data nodes needing to execute calculation are a plurality of streaming data nodes, performing data processing through a preset calculation engine or a preset data index rule according to target data of the streaming data nodes in a set time period and target data of the batch data nodes.
4. The batch and streaming combined data processing method of claim 1, wherein the preset execution mode comprises: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
5. A batch and streaming combined data processing apparatus, comprising:
the computing node determining module is used for determining a corresponding computing model according to the type and the number of data nodes which need to perform computing, wherein the type of the data nodes comprises batch data nodes and streaming data nodes, if the number of the data nodes is single, a single-source computing model is executed, otherwise, a multi-source computing model is executed;
and the target data calculation module is used for carrying out data processing according to the calculation model and a preset execution mode.
6. The combined batch and streaming data processing apparatus of claim 5, wherein the compute node determination module comprises:
the batch data node single-source computing unit is used for acquiring target data of the batch data nodes in batch at one time and processing the data according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
and the streaming data node single-source computing unit is used for sequentially acquiring each piece of target data of the streaming data node or acquiring the target data of the streaming data node within a set time period if the computing model is a single-source computing model and the type of the data node is a streaming data node, and processing the data according to a preset execution mode.
7. The combined batch and streaming data processing apparatus of claim 5, wherein the compute node determination module comprises:
the multi-batch data node multi-source calculation unit is used for performing data processing on target data of each batch of data nodes through a preset calculation engine or a preset data index rule if the calculation model is a multi-source calculation model and the types of the data nodes needing to perform calculation are multiple batch data nodes;
the combined computing unit of the batch data nodes and the streaming data nodes is used for pre-indexing the batch data nodes if the computing model is a multi-source computing model and the types of the data nodes needing to be computed comprise the batch data nodes and the streaming data nodes, determining target data matched with the batch data nodes in the streaming data nodes according to a preset rule and performing data processing;
and the multi-source computing unit of the multi-stream data nodes is used for processing data through a preset computing engine or a preset data index rule according to target data of the stream data nodes in a set time period and target data of the batch data nodes if the computing model is the multi-source computing model and the types of the data nodes needing to execute computing are multiple stream data nodes.
8. The batch and streaming combined data processing device of claim 5, wherein the preset execution mode comprises: at least one of a one-time execution mode, a continuous execution mode for setting a time period, and a timing update execution mode.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the combined batch and streaming data processing method of any of claims 1 to 4 are implemented by the processor when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the combined batch and streaming data processing method of any one of claims 1 to 4.
CN202011529842.8A 2020-12-22 2020-12-22 Batch and stream combined data processing method and device Active CN112597200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011529842.8A CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011529842.8A CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Publications (2)

Publication Number Publication Date
CN112597200A true CN112597200A (en) 2021-04-02
CN112597200B CN112597200B (en) 2024-01-12

Family

ID=75200746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011529842.8A Active CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Country Status (1)

Country Link
CN (1) CN112597200B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599524A (en) * 2022-10-27 2023-01-13 中国兵器工业计算机应用技术研究所(Cn) Data lake system based on cooperative scheduling processing of streaming data and batch data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761309A (en) * 2014-01-23 2014-04-30 中国移动(深圳)有限公司 Operation data processing method and system
US20140280766A1 (en) * 2013-03-12 2014-09-18 Yahoo! Inc. Method and System for Event State Management in Stream Processing
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN105701161A (en) * 2015-12-31 2016-06-22 深圳先进技术研究院 Real-time big data user label system
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN109889575A (en) * 2019-01-15 2019-06-14 北京航空航天大学 Cooperated computing plateform system and method under a kind of peripheral surroundings

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280766A1 (en) * 2013-03-12 2014-09-18 Yahoo! Inc. Method and System for Event State Management in Stream Processing
CN103761309A (en) * 2014-01-23 2014-04-30 中国移动(深圳)有限公司 Operation data processing method and system
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN105701161A (en) * 2015-12-31 2016-06-22 深圳先进技术研究院 Real-time big data user label system
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN109889575A (en) * 2019-01-15 2019-06-14 北京航空航天大学 Cooperated computing plateform system and method under a kind of peripheral surroundings

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINYAN WANG 等: "Two privacy-preserving approaches for publishing transaction data systems", 《IEEE ACCESS》, vol. 6, pages 1 - 2 *
张瞩熹 等: "多源异构航班航迹数据流实时融合方法研究", 《物联网学报》, vol. 4, no. 3, pages 60 - 68 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599524A (en) * 2022-10-27 2023-01-13 中国兵器工业计算机应用技术研究所(Cn) Data lake system based on cooperative scheduling processing of streaming data and batch data

Also Published As

Publication number Publication date
CN112597200B (en) 2024-01-12

Similar Documents

Publication Publication Date Title
CN110413673B (en) Database data unified acquisition and distribution method and system
CN113642673B (en) Image generation method, device, equipment and storage medium
CN111898698B (en) Object processing method and device, storage medium and electronic equipment
CN102426567A (en) Graphical editing and debugging system of automatic answer system
CN111259066A (en) Server cluster data synchronization method and device
CN109086126B (en) Task scheduling processing method and device, server, client and electronic equipment
CN112597200B (en) Batch and stream combined data processing method and device
CN115359220A (en) Virtual image updating method and device of virtual world
CN113191257B (en) Order of strokes detection method and device and electronic equipment
CN102930581A (en) General representations for data frame animations
CN112734545A (en) Block chain data sharing method, device and system
CN112148744A (en) Page display method and device, electronic equipment and computer readable medium
CN115495519A (en) Report data processing method and device
CN115098262B (en) Multi-neural network task processing method and device
CN109635238A (en) Matrix operation method, apparatus, equipment and readable medium
CN113760962A (en) Single-domain to cross-domain data set data processing method and device
CN114723976A (en) Subgraph pattern matching method and device for computational graph
CN112792808B (en) Industrial robot online track planning method and device based on variable structure filter
CN114003388A (en) Method and device for determining task parameters of big data computing engine
CN114968182A (en) Operator splitting method, control method and device for storage and computation integrated chip
CN113434423A (en) Interface test method and device
CN114254563A (en) Data processing method and device, electronic equipment and storage medium
CN113312331A (en) Data migration method, device, system, electronic equipment and computer readable medium
CN111291254A (en) Information processing method and device
CN114283060B (en) Video generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant