CN113268505B

CN113268505B - Offline batch processing method and system for multi-source multi-mode ocean big data

Info

Publication number: CN113268505B
Application number: CN202110476164.1A
Authority: CN
Inventors: 李昭; 沈金伟; 彭小红
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-11-30
Anticipated expiration: 2041-04-29
Also published as: CN113268505A

Abstract

The invention discloses an off-line batch processing method and system of multi-source multi-modal ocean big data, which comprises the steps of collecting flow data; carrying out data normalization on the stream data; dividing the processing stream data; constructing a scheduling distribution model, inputting stream data into the computing nodes, and performing task scheduling processing on the computing nodes through the scheduling distribution model; the method has the advantages that the error nodes can be rapidly detected and isolated under the condition of repeated inclination of data, new nodes are dynamically dispatched and distributed to take over the calculation tasks of the error nodes, the processing time is shortened, each calculation node can be intelligently dispatched according to the trend time, and the phenomenon that each node which is frequently revived/killed is possibly and repeatedly called after the nodes are revived so as to cause deadlock is avoided.

Description

Offline batch processing method and system for multi-source multi-mode ocean big data

Technical Field

The disclosure belongs to the field of marine big data processing, batch data processing and data transmission, and particularly relates to an off-line batch processing method and system for multi-source multi-mode marine big data.

Background

The ocean big data is collected from sensors such as various Argo buoys, buoys and mapping equipment, covers submarine topography data, ocean remote sensing data, ship survey data and buoy data, and is continuously developed along with ocean monitoring equipment, but because the data are collected from different sources and different data structures, the data are derived from multi-source heterogeneous data collected by different collection equipment terminals, and in the current big data processing method, when mass data are stored as data sources, offline batch processing is usually required for non-real-time service data.

Batch systems that process such business data, also commonly referred to as off-line systems or off-line systems, require a large amount of input data, run a job to process it, and produce some output data. Work usually takes a longer time. Batch jobs are typically run periodically. Currently, offline batch processing of big data requires low delay of data processing, but the processed data amount is large, and occupies more computing and storage resources, and offline batch processing of big data is generally realized by spark or Hadoop framework. For massive amounts of data, spark or Hadoop frameworks are often employed to provide bandwidth, memory, storage and other resources without requiring fast response (e.g., minute-scale delay and hour-scale delay). However, since marine big data is usually massive multi-mode big data, it is difficult to obtain good results in an environment requiring fast response processing and timely processing, MapReduce (map reduce) is adopted in spark or Hadoop framework, and MapReduce job is a unit of work that a client needs to execute, and includes input data, MapReduce program and configuration information. The Hadoop divides the operation into a plurality of tasks for execution, wherein each task is divided into a Map task node and a Reduce task node, in a MapReduce computing cluster, node hardware (host, disk, memory and the like) errors and software errors are normal, and the conventional scheduling method of the MapReduce computing node (cluster node) comprises the following steps: MapReduce achieves reliability by distributing large-scale operations on a data set to each node on the network; each node periodically returns the work completed by the node and the latest state, if a node keeps silent for more than a preset time interval, the master node records that the node is dead and sends data allocated to the node to other nodes, although MapReduce can detect and isolate the faulty node and schedule and allocate a new node to take over the calculation task of the faulty node, under the scene that the data is seriously inclined, the current node scheduling method easily causes overlong processing time, and even after the node is reactivated, the node can repeatedly call each node with frequent reactivation/death so as to cause deadlock.

Disclosure of Invention

The invention aims to provide an off-line batch processing method and system for multi-source multi-modal marine big data, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.

In order to achieve the above object, according to an aspect of the present disclosure, there is provided an offline batch processing method of multi-source multi-modal marine big data, the method comprising the steps of:

s100, collecting flow data;

further, the method for collecting the flow data comprises the following steps: a data sequence of physical quantity data acquired by an apparatus such as an Argo float, a surveying instrument, or the like of a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object is used as stream data.

S200, carrying out data normalization on the stream data, wherein the data normalization comprises any one or more of time formatting, field completion, data cleaning, data integration and data reduction;

s300, processing streaming data in a partitioning mode through a MapReduce method;

further, in S300, the method of dividing the processing stream data is to divide the processing stream data by the MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.

S400, constructing a scheduling distribution model;

further, in S400, the method for constructing the scheduling assignment model includes the following steps:

s401, MapReduce computing nodes (the MapReduce computing nodes comprise Map nodes or Reduce nodes, the MapReduce computing nodes are established on a built Map/Reduce frame and a Hadoop distributed file system), the MapReduce computing nodes are simply called nodes, the nodes are integrated into nodes, and the nodes are { nodes ═ nodes } nodes_iI has a value range of [1, N ]]N is the number of computing nodes, each Node in the Node_iAll have a corresponding stream data processing task set Bath to be processed in batch after being divided_i＝{Bath_i,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bath_i,jFor the ith Node_iThe jth stream data processing task to be batch processed after division, M>N；

The corresponding stream data processing tasks to be batch processed after being divided include, but are not limited to: any one or more of data compression, clustering, sampling, dimensionality reduction, and data transformation;

s402, sequentially mixing the Bath_iThe stream data Bath to be batched after the 1 st to the N th segmentation_i,jRespectively correspondingly input to each Node_iIn a batch process (i.e., put Bath for the first time)_i,1To Bath_i,NInput to each Node in turn_iIn the method, a node which finishes the batch processing task for the first time in the nodes is recorded as a reference node, and the node which finishes the batch processing task for the first time is recorded as a streaming data task Bath which finishes the batch processing after the segmentation for the first time_i,jNode of_i；

S403, calculating the reference processing amount R of the reference node,

or

Wherein, K1_i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2_i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;

s404, let Cu_iRepresenting a Node_iCurrent batch task total of (1), i.e. Cu_iNode for Node_iNumber of average threads or processes currently performing batch processing tasks or Node_iThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodes_iAb (Cu) was obtained by calculation_i)，Ab(Cu_i) Representing a Node_iThe amount of processing tasks is Cu_iAbility value of (i) in which Ab (Cu)_i)＝exp(-(Cu_i÷R-1)²)；

S405, every set time interval T, detecting each Node_iAb (Cu) of_i) Calculating the capability values Ab (Cu) of all nodes_i) When the ith Node is Node, the average value ABV of_iAb (Cu) of_i) When the number of the acquired k points is larger than or equal to the ABV number, dividing the newly acquired k points intoAdding the stream data processing task to be processed in batch after being cut into the Node_iCorresponding Bath_iIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bath_i,j+1To Bath_i,j+kForm a new Bath_i(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds.

And S500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model.

Further, in S500, the method for inputting stream data into a computing node and performing task scheduling processing on the computing node through a scheduling assignment model includes the following steps:

in order to avoid the problem of node deadlock caused by repeated inclination of data, each Map node or Reduce node needs to be intelligently scheduled according to the following pre-calculated trend time;

s501, let Time₁For the ith Node_iThe length of time to complete a batch task; calculating in sequence to obtain each Node_iLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodes_i，

Wherein, T_iFor the ith Node_iExecute the corresponding Bath_iThe longest processing Time in each processed stream data task, Time1_iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein the function Min is used to select the smallest value among the values and the function Max is used to select the largest value among the values, e.g., Max (T)_i) The meaning of (1) is to select all nodes Node_iExecute the corresponding Bath_iSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data task_i) The meaning of (1) is to select all nodes Node_iExecute the corresponding Bath_iEach longest one obtained from each processed stream data taskThe minimum value, Max (Ab (cu)) is selected among the processing times_i) Min (Ab (cu)) represents the maximum ability value selected among the nodes_i) Means that the minimum ability value is selected among the respective nodes;

s502, calculating in sequence to obtain each Node_iTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing task_i，

Time2_iCorresponding to the node with the largest capacity value and the fastest processing speed;

s503, when the ith Node_iMaximum trend Time of_iIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 Node_N1Order and Time2_iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speed_N2Node is to be_N1Corresponding streaming data processing task set Banh_N1Joining to Node_N2Corresponding streaming data processing task set Banh_N2In, and emptying Bath_N1(ii) a Wherein, the delay threshold value is set to all nodes Node_iIs set to Time₁1.5 times of the total weight of the powder.

The invention also provides an off-line batch processing system of the multi-source multi-modal ocean big data, which comprises the following steps: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:

the data acquisition unit is used for acquiring flow data;

the data normalization unit is used for performing data normalization on the stream data;

a data dividing unit for dividing the processing stream data;

the model building unit is used for building a scheduling distribution model;

and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.

The beneficial effect of this disclosure does: the invention provides an off-line batch processing method and system for multi-source multi-modal marine big data, which can quickly detect and isolate error nodes under the condition of repeated inclination of data, dynamically dispatch and allocate new nodes to take over the calculation tasks of the error nodes, reduce the processing time, intelligently dispatch each calculation node according to the trend time, and avoid the possibility that each node with frequent revival/death is repeatedly called after the nodes are revived so as to cause deadlock.

Drawings

The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:

FIG. 1 is a flow chart of a method for offline batch processing of multi-source multi-modal marine big data;

FIG. 2 is a block diagram of an offline batch processing system for multi-source multi-modal marine big data.

Detailed Description

The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a flowchart of an offline batch processing method of multi-source multi-modal marine big data, and the offline batch processing method of multi-source multi-modal marine big data according to an embodiment of the present invention is described below with reference to fig. 1, where the method includes the following steps:

s100, collecting flow data;

further, the method for collecting the flow data comprises the following steps: data series of physical quantity data acquired by an apparatus such as an Argo buoy, a surveying instrument, or the like, which is a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object, is used as stream data.

further, in S300, the key-value pair of the stream data split-processed by the MapReduce method is: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.

S400, constructing a scheduling distribution model;

s402, sequentially mixing the Bath_iThe stream data Bath to be batched after the 1 st to the N th segmentation_i,jRespectively correspondingly input to each Node_iIn a batch process (i.e., put Bath for the first time)_i,1To Bath_i,NInput to each Node in turn_iIn), recording a node which finishes a batch processing task for the first time in the nodes as a reference node;

s403, calculating the reference processing amount R of the reference node,

or

S405, every set time interval T, detecting each Node_iAb (Cu) of_i) Calculating the capability values Ab (Cu) of all nodes_i) When the ith Node is Node, the average value ABV of_iAb (Cu) of_i) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes Node_iCorresponding Bath_iIn, i.e.Increasing the value of j by k, and sequentially supplementing and adding newly acquired k divided stream data processing tasks to be processed in batch into Bath_i,j+1To Bath_i,j+kForm a new Bath_i(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds;

s500, carrying out task scheduling processing on the computing nodes of the MapReduce through a scheduling distribution model;

further, in S500, the method for performing task scheduling processing on the MapReduce computing node through the scheduling assignment model includes the following steps:

Wherein, T_iFor the ith Node_iExecute the corresponding Bath_iThe longest processing Time in each processed stream data task, Time1_iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value; for example, Max (T)_i) The meaning of (1) is to select all nodes Node_iExecute the corresponding Bath_iSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data task_i) The meaning of (1) is to select all nodes Node_iExecute the corresponding Bath_iThe minimum value, Max (Ab (cu)) is selected from the respective maximum processing times obtained for each processed stream data task_i) Min (Ab (cu)) represents the maximum ability value selected among the nodes_i) In each node)Selecting the minimum ability value;

An embodiment of the present disclosure provides an offline batch processing system for multi-source multi-modal marine big data, as shown in fig. 2, which is a structure diagram of the offline batch processing system for multi-source multi-modal marine big data, and the offline batch processing system for multi-source multi-modal marine big data of the embodiment includes: the system comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the embodiment of the off-line batch processing system of the multi-source multi-modal ocean big data.

The system comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:

the data acquisition unit is used for acquiring flow data;

a data dividing unit for dividing the processing stream data;

the model building unit is used for building a scheduling distribution model;

The off-line batch processing system of the multi-source multi-modal ocean big data can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The multi-source multi-modal marine big data offline batch processing system can be operated by a system comprising, but not limited to, a processor and a memory. Those skilled in the art will appreciate that the example is merely an example of an offline batch processing system for multi-source multi-modal marine big data, and does not constitute a limitation of an offline batch processing system for multi-source multi-modal marine big data, and may include more or less components than a sub-scale, or combine certain components, or different components, for example, the offline batch processing system for multi-source multi-modal marine big data may further include input and output devices, network access devices, buses, and the like.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general processor can be a microprocessor or the processor can be any conventional processor and the like, the processor is a control center of the off-line batch processing system operation system of the multi-source multi-modal marine big data, and various interfaces and lines are utilized to connect all parts of the whole off-line batch processing system operable system of the multi-source multi-modal marine big data.

The memory can be used for storing the computer programs and/or modules, and the processor realizes various functions of the multi-source multi-modal marine big data offline batch processing system by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Although the description of the present disclosure has been rather exhaustive and particularly described with respect to several illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiments, so as to effectively encompass the intended scope of the present disclosure. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims

1. An off-line batch processing method for multi-source multi-modal marine big data is characterized by comprising the following steps:

s100, collecting flow data;

s200, carrying out data normalization on the stream data;

s300, dividing the processing stream data;

s400, constructing a scheduling distribution model;

s500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model;

the method for constructing the scheduling distribution model comprises the following steps:

s401, the MapReduce calculation nodes are abbreviated as nodes, the nodes are grouped into nodes, wherein Node = { Node = { (Node)_iI has a value range of [1, N ]]N is the number of nodes, each Node in the Node_iAll have a corresponding stream data processing task set Bath to be processed in batch after being divided_i={Bath_i,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bath_i,jFor the ith Node_iThe jth stream data processing task to be batch processed after the segmentation;

s402, sequentially mixing the Bath_iThe 1 st to the N th streaming data tasks to be processed in batches after being divided_i,jRespectively correspondingly input to each Node_iThe method comprises the steps of carrying out batch processing tasks, recording a node which finishes the batch processing tasks for the first time in the nodes as a reference node, and recording a node which finishes the batch processing tasks for the first time in the nodes as a streaming data task Bath which finishes the batch processing after being divided for the first time_i,jNode of_i；

A reference throughput R of the reference node is calculated,

or

s403, making Cu_iRepresenting a Node_iCurrent batch task total of (1), i.e. Cu_iNode for Node_iNumber of average threads or processes currently performing batch processing tasks or Node_iThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodes_iAb (Cu) was obtained by calculation_i)，Ab(Cu_i) Representing a Node_iThe amount of processing tasks is Cu_iAbility value of (i) in which Ab (Cu)_i)＝exp(-(Cu_i÷R-1)²)；

S404, detecting each Node at set time interval T_iAb (Cu) of_i) Calculating the capability values Ab (Cu) of all nodes_i) When the ith Node is Node, the average value ABV of_iAb (Cu) of_i) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes Node_iCorresponding Bath_iIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bath_i,j+1To Bath_i,j+kForm a new Bath_i。

2. The off-line batch processing method of multi-source multi-modal marine big data according to claim 1, wherein the method for collecting flow data comprises: a data sequence of physical quantity data acquired by a float or a mapping device of a sensor for acquiring any one or more of sonar data, wind power, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality component, size, speed, and direction of a moving object is taken as stream data.

3. The method of claim 1, wherein the data normalization includes any one or more of time formatting, field completion, data cleaning, data integration, and data reduction.

4. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S300, the method for processing the streaming data by splitting is to process the streaming data by a MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.

5. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S500, the method for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model comprises the following steps:

Wherein, T_iFor the ith Node_iExecute the corresponding Bath_iThe longest processing Time in each processed stream data task, Time1_iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value;

，Time2_iCorresponding to the node with the largest capacity value and the fastest processing speed;

s503, when the ith Node_iMaximum trend Time of_iIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 Node_N1Order and Time2_iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speed_N2Node is to be_N1Corresponding streaming data processing task set Banh_N1Joining to Node_N2Corresponding streaming data processing task set Banh_N2In, and emptying Bath_N1。

6. The method of claim 1, wherein the delay threshold is set to all nodes Node_iIs set to Time₁1.5 times of the total weight of the powder.

7. An offline batch processing system for multi-source multi-modal marine big data, the system comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:

the data acquisition unit is used for acquiring flow data;

a data dividing unit for dividing the processing stream data;

the model building unit is used for building a scheduling distribution model;

the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through a scheduling distribution model;

s401, the MapReduce calculation nodes are abbreviated as nodes and are integrated into nodes, wherein Node = { Node_iI has a value range of [1, N ]]N is the number of nodes, each Node in the Node_iAll have a corresponding stream data processing task set Bath to be processed in batch after being divided_i={Bath_i,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bath_i,jFor the ith Node_iThe jth stream data processing task to be batch processed after the segmentation;

A reference throughput R of the reference node is calculated,

or