[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113268505B - Offline batch processing method and system for multi-source multi-mode ocean big data - Google Patents

Offline batch processing method and system for multi-source multi-mode ocean big data Download PDF

Info

Publication number
CN113268505B
CN113268505B CN202110476164.1A CN202110476164A CN113268505B CN 113268505 B CN113268505 B CN 113268505B CN 202110476164 A CN202110476164 A CN 202110476164A CN 113268505 B CN113268505 B CN 113268505B
Authority
CN
China
Prior art keywords
node
data
nodes
batch
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110476164.1A
Other languages
Chinese (zh)
Other versions
CN113268505A (en
Inventor
李昭
沈金伟
彭小红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202110476164.1A priority Critical patent/CN113268505B/en
Publication of CN113268505A publication Critical patent/CN113268505A/en
Application granted granted Critical
Publication of CN113268505B publication Critical patent/CN113268505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/524Deadlock detection or avoidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an off-line batch processing method and system of multi-source multi-modal ocean big data, which comprises the steps of collecting flow data; carrying out data normalization on the stream data; dividing the processing stream data; constructing a scheduling distribution model, inputting stream data into the computing nodes, and performing task scheduling processing on the computing nodes through the scheduling distribution model; the method has the advantages that the error nodes can be rapidly detected and isolated under the condition of repeated inclination of data, new nodes are dynamically dispatched and distributed to take over the calculation tasks of the error nodes, the processing time is shortened, each calculation node can be intelligently dispatched according to the trend time, and the phenomenon that each node which is frequently revived/killed is possibly and repeatedly called after the nodes are revived so as to cause deadlock is avoided.

Description

Offline batch processing method and system for multi-source multi-mode ocean big data
Technical Field
The disclosure belongs to the field of marine big data processing, batch data processing and data transmission, and particularly relates to an off-line batch processing method and system for multi-source multi-mode marine big data.
Background
The ocean big data is collected from sensors such as various Argo buoys, buoys and mapping equipment, covers submarine topography data, ocean remote sensing data, ship survey data and buoy data, and is continuously developed along with ocean monitoring equipment, but because the data are collected from different sources and different data structures, the data are derived from multi-source heterogeneous data collected by different collection equipment terminals, and in the current big data processing method, when mass data are stored as data sources, offline batch processing is usually required for non-real-time service data.
Batch systems that process such business data, also commonly referred to as off-line systems or off-line systems, require a large amount of input data, run a job to process it, and produce some output data. Work usually takes a longer time. Batch jobs are typically run periodically. Currently, offline batch processing of big data requires low delay of data processing, but the processed data amount is large, and occupies more computing and storage resources, and offline batch processing of big data is generally realized by spark or Hadoop framework. For massive amounts of data, spark or Hadoop frameworks are often employed to provide bandwidth, memory, storage and other resources without requiring fast response (e.g., minute-scale delay and hour-scale delay). However, since marine big data is usually massive multi-mode big data, it is difficult to obtain good results in an environment requiring fast response processing and timely processing, MapReduce (map reduce) is adopted in spark or Hadoop framework, and MapReduce job is a unit of work that a client needs to execute, and includes input data, MapReduce program and configuration information. The Hadoop divides the operation into a plurality of tasks for execution, wherein each task is divided into a Map task node and a Reduce task node, in a MapReduce computing cluster, node hardware (host, disk, memory and the like) errors and software errors are normal, and the conventional scheduling method of the MapReduce computing node (cluster node) comprises the following steps: MapReduce achieves reliability by distributing large-scale operations on a data set to each node on the network; each node periodically returns the work completed by the node and the latest state, if a node keeps silent for more than a preset time interval, the master node records that the node is dead and sends data allocated to the node to other nodes, although MapReduce can detect and isolate the faulty node and schedule and allocate a new node to take over the calculation task of the faulty node, under the scene that the data is seriously inclined, the current node scheduling method easily causes overlong processing time, and even after the node is reactivated, the node can repeatedly call each node with frequent reactivation/death so as to cause deadlock.
Disclosure of Invention
The invention aims to provide an off-line batch processing method and system for multi-source multi-modal marine big data, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.
In order to achieve the above object, according to an aspect of the present disclosure, there is provided an offline batch processing method of multi-source multi-modal marine big data, the method comprising the steps of:
s100, collecting flow data;
further, the method for collecting the flow data comprises the following steps: a data sequence of physical quantity data acquired by an apparatus such as an Argo float, a surveying instrument, or the like of a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object is used as stream data.
S200, carrying out data normalization on the stream data, wherein the data normalization comprises any one or more of time formatting, field completion, data cleaning, data integration and data reduction;
s300, processing streaming data in a partitioning mode through a MapReduce method;
further, in S300, the method of dividing the processing stream data is to divide the processing stream data by the MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
S400, constructing a scheduling distribution model;
further, in S400, the method for constructing the scheduling assignment model includes the following steps:
s401, MapReduce computing nodes (the MapReduce computing nodes comprise Map nodes or Reduce nodes, the MapReduce computing nodes are established on a built Map/Reduce frame and a Hadoop distributed file system), the MapReduce computing nodes are simply called nodes, the nodes are integrated into nodes, and the nodes are { nodes ═ nodes } nodesiI has a value range of [1, N ]]N is the number of computing nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after division, M>N;
The corresponding stream data processing tasks to be batch processed after being divided include, but are not limited to: any one or more of data compression, clustering, sampling, dimensionality reduction, and data transformation;
s402, sequentially mixing the BathiThe stream data Bath to be batched after the 1 st to the N th segmentationi,jRespectively correspondingly input to each NodeiIn a batch process (i.e., put Bath for the first time)i,1To Bathi,NInput to each Node in turniIn the method, a node which finishes the batch processing task for the first time in the nodes is recorded as a reference node, and the node which finishes the batch processing task for the first time is recorded as a streaming data task Bath which finishes the batch processing after the segmentation for the first timei,jNode ofi
S403, calculating the reference processing amount R of the reference node,
Figure BDA0003047444590000021
or
Figure BDA0003047444590000022
Wherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;
s404, let CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S405, every set time interval T, detecting each NodeiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired k points is larger than or equal to the ABV number, dividing the newly acquired k points intoAdding the stream data processing task to be processed in batch after being cut into the NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds.
And S500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model.
Further, in S500, the method for inputting stream data into a computing node and performing task scheduling processing on the computing node through a scheduling assignment model includes the following steps:
in order to avoid the problem of node deadlock caused by repeated inclination of data, each Map node or Reduce node needs to be intelligently scheduled according to the following pre-calculated trend time;
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodesi
Figure BDA0003047444590000031
Wherein, TiFor the ith NodeiExecute the corresponding BathiThe longest processing Time in each processed stream data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein the function Min is used to select the smallest value among the values and the function Max is used to select the largest value among the values, e.g., Max (T)i) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data taski) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiEach longest one obtained from each processed stream data taskThe minimum value, Max (Ab (cu)) is selected among the processing timesi) Min (Ab (cu)) represents the maximum ability value selected among the nodesi) Means that the minimum ability value is selected among the respective nodes;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski
Figure BDA0003047444590000041
Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1(ii) a Wherein, the delay threshold value is set to all nodes NodeiIs set to Time11.5 times of the total weight of the powder.
The invention also provides an off-line batch processing system of the multi-source multi-modal ocean big data, which comprises the following steps: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.
The beneficial effect of this disclosure does: the invention provides an off-line batch processing method and system for multi-source multi-modal marine big data, which can quickly detect and isolate error nodes under the condition of repeated inclination of data, dynamically dispatch and allocate new nodes to take over the calculation tasks of the error nodes, reduce the processing time, intelligently dispatch each calculation node according to the trend time, and avoid the possibility that each node with frequent revival/death is repeatedly called after the nodes are revived so as to cause deadlock.
Drawings
The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:
FIG. 1 is a flow chart of a method for offline batch processing of multi-source multi-modal marine big data;
FIG. 2 is a block diagram of an offline batch processing system for multi-source multi-modal marine big data.
Detailed Description
The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a flowchart of an offline batch processing method of multi-source multi-modal marine big data, and the offline batch processing method of multi-source multi-modal marine big data according to an embodiment of the present invention is described below with reference to fig. 1, where the method includes the following steps:
s100, collecting flow data;
further, the method for collecting the flow data comprises the following steps: data series of physical quantity data acquired by an apparatus such as an Argo buoy, a surveying instrument, or the like, which is a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object, is used as stream data.
S200, carrying out data normalization on the stream data, wherein the data normalization comprises any one or more of time formatting, field completion, data cleaning, data integration and data reduction;
s300, processing streaming data in a partitioning mode through a MapReduce method;
further, in S300, the key-value pair of the stream data split-processed by the MapReduce method is: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
S400, constructing a scheduling distribution model;
further, in S400, the method for constructing the scheduling assignment model includes the following steps:
s401, MapReduce computing nodes (the MapReduce computing nodes comprise Map nodes or Reduce nodes, the MapReduce computing nodes are established on a built Map/Reduce frame and a Hadoop distributed file system), the MapReduce computing nodes are simply called nodes, the nodes are integrated into nodes, and the nodes are { nodes ═ nodes } nodesiI has a value range of [1, N ]]N is the number of computing nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after division, M>N;
The corresponding stream data processing tasks to be batch processed after being divided include, but are not limited to: any one or more of data compression, clustering, sampling, dimensionality reduction, and data transformation;
s402, sequentially mixing the BathiThe stream data Bath to be batched after the 1 st to the N th segmentationi,jRespectively correspondingly input to each NodeiIn a batch process (i.e., put Bath for the first time)i,1To Bathi,NInput to each Node in turniIn), recording a node which finishes a batch processing task for the first time in the nodes as a reference node;
s403, calculating the reference processing amount R of the reference node,
Figure BDA0003047444590000051
or
Figure BDA0003047444590000052
Wherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;
s404, let CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S405, every set time interval T, detecting each NodeiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn, i.e.Increasing the value of j by k, and sequentially supplementing and adding newly acquired k divided stream data processing tasks to be processed in batch into Bathi,j+1To Bathi,j+kForm a new Bathi(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds;
s500, carrying out task scheduling processing on the computing nodes of the MapReduce through a scheduling distribution model;
further, in S500, the method for performing task scheduling processing on the MapReduce computing node through the scheduling assignment model includes the following steps:
in order to avoid the problem of node deadlock caused by repeated inclination of data, each Map node or Reduce node needs to be intelligently scheduled according to the following pre-calculated trend time;
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodesi
Figure BDA0003047444590000061
Wherein, TiFor the ith NodeiExecute the corresponding BathiThe longest processing Time in each processed stream data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value; for example, Max (T)i) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data taski) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiThe minimum value, Max (Ab (cu)) is selected from the respective maximum processing times obtained for each processed stream data taski) Min (Ab (cu)) represents the maximum ability value selected among the nodesi) In each node)Selecting the minimum ability value;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski
Figure BDA0003047444590000071
Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1(ii) a Wherein, the delay threshold value is set to all nodes NodeiIs set to Time11.5 times of the total weight of the powder.
An embodiment of the present disclosure provides an offline batch processing system for multi-source multi-modal marine big data, as shown in fig. 2, which is a structure diagram of the offline batch processing system for multi-source multi-modal marine big data, and the offline batch processing system for multi-source multi-modal marine big data of the embodiment includes: the system comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the embodiment of the off-line batch processing system of the multi-source multi-modal ocean big data.
The system comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.
The off-line batch processing system of the multi-source multi-modal ocean big data can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The multi-source multi-modal marine big data offline batch processing system can be operated by a system comprising, but not limited to, a processor and a memory. Those skilled in the art will appreciate that the example is merely an example of an offline batch processing system for multi-source multi-modal marine big data, and does not constitute a limitation of an offline batch processing system for multi-source multi-modal marine big data, and may include more or less components than a sub-scale, or combine certain components, or different components, for example, the offline batch processing system for multi-source multi-modal marine big data may further include input and output devices, network access devices, buses, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general processor can be a microprocessor or the processor can be any conventional processor and the like, the processor is a control center of the off-line batch processing system operation system of the multi-source multi-modal marine big data, and various interfaces and lines are utilized to connect all parts of the whole off-line batch processing system operable system of the multi-source multi-modal marine big data.
The memory can be used for storing the computer programs and/or modules, and the processor realizes various functions of the multi-source multi-modal marine big data offline batch processing system by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Although the description of the present disclosure has been rather exhaustive and particularly described with respect to several illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiments, so as to effectively encompass the intended scope of the present disclosure. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.

Claims (7)

1. An off-line batch processing method for multi-source multi-modal marine big data is characterized by comprising the following steps:
s100, collecting flow data;
s200, carrying out data normalization on the stream data;
s300, dividing the processing stream data;
s400, constructing a scheduling distribution model;
s500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model;
the method for constructing the scheduling distribution model comprises the following steps:
s401, the MapReduce calculation nodes are abbreviated as nodes, the nodes are grouped into nodes, wherein Node = { Node = { (Node)iI has a value range of [1, N ]]N is the number of nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after the segmentation;
s402, sequentially mixing the BathiThe 1 st to the N th streaming data tasks to be processed in batches after being dividedi,jRespectively correspondingly input to each NodeiThe method comprises the steps of carrying out batch processing tasks, recording a node which finishes the batch processing tasks for the first time in the nodes as a reference node, and recording a node which finishes the batch processing tasks for the first time in the nodes as a streaming data task Bath which finishes the batch processing after being divided for the first timei,jNode ofi
A reference throughput R of the reference node is calculated,
Figure 230657DEST_PATH_IMAGE001
or
Figure 94708DEST_PATH_IMAGE002
Wherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;
s403, making CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S404, detecting each Node at set time interval TiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi
2. The off-line batch processing method of multi-source multi-modal marine big data according to claim 1, wherein the method for collecting flow data comprises: a data sequence of physical quantity data acquired by a float or a mapping device of a sensor for acquiring any one or more of sonar data, wind power, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality component, size, speed, and direction of a moving object is taken as stream data.
3. The method of claim 1, wherein the data normalization includes any one or more of time formatting, field completion, data cleaning, data integration, and data reduction.
4. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S300, the method for processing the streaming data by splitting is to process the streaming data by a MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
5. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S500, the method for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model comprises the following steps:
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodesi
Figure 361610DEST_PATH_IMAGE003
Wherein, TiFor the ith NodeiExecute the corresponding BathiThe longest processing Time in each processed stream data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski
Figure 449652DEST_PATH_IMAGE004
,Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1
6. The method of claim 1, wherein the delay threshold is set to all nodes NodeiIs set to Time11.5 times of the total weight of the powder.
7. An offline batch processing system for multi-source multi-modal marine big data, the system comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through a scheduling distribution model;
the method for constructing the scheduling distribution model comprises the following steps:
s401, the MapReduce calculation nodes are abbreviated as nodes and are integrated into nodes, wherein Node = { NodeiI has a value range of [1, N ]]N is the number of nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after the segmentation;
s402, sequentially mixing the BathiThe 1 st to the N th streaming data tasks to be processed in batches after being dividedi,jRespectively correspondingly input to each NodeiThe method comprises the steps of carrying out batch processing tasks, recording a node which finishes the batch processing tasks for the first time in the nodes as a reference node, and recording a node which finishes the batch processing tasks for the first time in the nodes as a streaming data task Bath which finishes the batch processing after being divided for the first timei,jNode ofi
A reference throughput R of the reference node is calculated,
Figure 919947DEST_PATH_IMAGE001
or
Figure 384427DEST_PATH_IMAGE002
Wherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;
s403, making CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S404, detecting each Node at set time interval TiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi
CN202110476164.1A 2021-04-29 2021-04-29 Offline batch processing method and system for multi-source multi-mode ocean big data Active CN113268505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110476164.1A CN113268505B (en) 2021-04-29 2021-04-29 Offline batch processing method and system for multi-source multi-mode ocean big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110476164.1A CN113268505B (en) 2021-04-29 2021-04-29 Offline batch processing method and system for multi-source multi-mode ocean big data

Publications (2)

Publication Number Publication Date
CN113268505A CN113268505A (en) 2021-08-17
CN113268505B true CN113268505B (en) 2021-11-30

Family

ID=77230023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110476164.1A Active CN113268505B (en) 2021-04-29 2021-04-29 Offline batch processing method and system for multi-source multi-mode ocean big data

Country Status (1)

Country Link
CN (1) CN113268505B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519914A (en) * 2018-04-09 2018-09-11 腾讯科技(深圳)有限公司 Big data computational methods, system and computer equipment
CN110119421A (en) * 2019-04-03 2019-08-13 昆明理工大学 A kind of electric power stealing user identification method based on Spark flow sorter
CN110321223A (en) * 2019-07-03 2019-10-11 湖南大学 The data flow division methods and device of Coflow work compound stream scheduling perception
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree
US10713257B2 (en) * 2017-09-29 2020-07-14 International Business Machines Corporation Data-centric reduction network for cluster monitoring

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10713257B2 (en) * 2017-09-29 2020-07-14 International Business Machines Corporation Data-centric reduction network for cluster monitoring
CN108519914A (en) * 2018-04-09 2018-09-11 腾讯科技(深圳)有限公司 Big data computational methods, system and computer equipment
CN110119421A (en) * 2019-04-03 2019-08-13 昆明理工大学 A kind of electric power stealing user identification method based on Spark flow sorter
CN110321223A (en) * 2019-07-03 2019-10-11 湖南大学 The data flow division methods and device of Coflow work compound stream scheduling perception
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree

Also Published As

Publication number Publication date
CN113268505A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN115269717B (en) Storage device, distributed storage system, and data processing method
CN109033340B (en) Spark platform-based point cloud K neighborhood searching method and device
CN114356587B (en) Calculation power task cross-region scheduling method, system and equipment
CN107977167B (en) Erasure code based degeneration reading optimization method for distributed storage system
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
US20190377655A1 (en) Two-stage distributed estimation system
CN114861910A (en) Neural network model compression method, device, equipment and medium
Moaddeli et al. The power of d choices in scheduling for data centers with heterogeneous servers
CN110764705A (en) Data reading and writing method, device, equipment and storage medium
CN108776589B (en) Deployment method of radar signal processing software component
CN110232073A (en) A kind of Data Management Analysis system and method
JP2021121921A (en) Method and apparatus for management of artificial intelligence development platform, and medium
CN113268505B (en) Offline batch processing method and system for multi-source multi-mode ocean big data
CN111858656A (en) Static data query method and device based on distributed architecture
CN112905596B (en) Data processing method, device, computer equipment and storage medium
CN118260053A (en) Memory scheduling method of heterogeneous computing system, heterogeneous computing system and device
Amannejad et al. Fast and lightweight execution time predictions for spark applications
CN116701001B (en) Target task allocation method and device, electronic equipment and storage medium
TWI778924B (en) Method for big data retrieval and system thereof
CN110362387B (en) Distributed task processing method, device, system and storage medium
Bani-Mohammad et al. Comparative evaluation of contiguous allocation strategies on 3D mesh multicomputers
CN108984271A (en) Load balancing method and related equipment
CN110764886B (en) Batch job cooperative scheduling method and system supporting multi-partition processing
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
US20230004322A1 (en) Managing provenance information for data processing pipelines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant