CN113268505A - Offline batch processing method and system for multi-source multi-mode ocean big data - Google Patents
Offline batch processing method and system for multi-source multi-mode ocean big data Download PDFInfo
- Publication number
- CN113268505A CN113268505A CN202110476164.1A CN202110476164A CN113268505A CN 113268505 A CN113268505 A CN 113268505A CN 202110476164 A CN202110476164 A CN 202110476164A CN 113268505 A CN113268505 A CN 113268505A
- Authority
- CN
- China
- Prior art keywords
- node
- data
- processing
- nodes
- batch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/524—Deadlock detection or avoidance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an off-line batch processing method and system of multi-source multi-modal ocean big data, which comprises the steps of collecting flow data; carrying out data normalization on the stream data; dividing the processing stream data; constructing a scheduling distribution model, inputting stream data into the computing nodes, and performing task scheduling processing on the computing nodes through the scheduling distribution model; the method has the advantages that the error nodes can be rapidly detected and isolated under the condition of repeated inclination of data, new nodes are dynamically dispatched and distributed to take over the calculation tasks of the error nodes, the processing time is shortened, each calculation node can be intelligently dispatched according to the trend time, and the phenomenon that each node which is frequently revived/killed is possibly and repeatedly called after the nodes are revived so as to cause deadlock is avoided.
Description
Technical Field
The disclosure belongs to the field of marine big data processing, batch data processing and data transmission, and particularly relates to an off-line batch processing method and system for multi-source multi-mode marine big data.
Background
The ocean big data is collected from sensors such as various Argo buoys, buoys and mapping equipment, covers submarine topography data, ocean remote sensing data, ship survey data and buoy data, and is continuously developed along with ocean monitoring equipment, but because the data are collected from different sources and different data structures, the data are derived from multi-source heterogeneous data collected by different collection equipment terminals, and in the current big data processing method, when mass data are stored as data sources, offline batch processing is usually required for non-real-time service data.
Batch systems that process such business data, also commonly referred to as off-line systems or off-line systems, require a large amount of input data, run a job to process it, and produce some output data. Work usually takes a longer time. Batch jobs are typically run periodically. Currently, offline batch processing of big data requires low delay of data processing, but the processed data amount is large, and occupies more computing and storage resources, and offline batch processing of big data is generally realized by spark or Hadoop framework. For massive amounts of data, spark or Hadoop frameworks are often employed to provide bandwidth, memory, storage and other resources without requiring fast response (e.g., minute-scale delay and hour-scale delay). However, since marine big data is usually massive multi-mode big data, it is difficult to obtain good results in an environment requiring fast response processing and timely processing, MapReduce (map reduce) is adopted in spark or Hadoop framework, and MapReduce job is a unit of work that a client needs to execute, and includes input data, MapReduce program and configuration information. The Hadoop divides the operation into a plurality of tasks for execution, wherein each task is divided into a Map task node and a Reduce task node, in a MapReduce computing cluster, node hardware (host, disk, memory and the like) errors and software errors are normal, and the conventional scheduling method of the MapReduce computing node (cluster node) comprises the following steps: MapReduce achieves reliability by distributing large-scale operations on a data set to each node on the network; each node periodically returns the work completed by the node and the latest state, if a node keeps silent for more than a preset time interval, the master node records that the node is dead and sends data allocated to the node to other nodes, although MapReduce can detect and isolate the faulty node and schedule and allocate a new node to take over the calculation task of the faulty node, under the scene that the data is seriously inclined, the current node scheduling method easily causes overlong processing time, and even after the node is reactivated, the node can repeatedly call each node with frequent reactivation/death so as to cause deadlock.
Disclosure of Invention
The invention aims to provide an off-line batch processing method and system for multi-source multi-modal marine big data, which are used for solving one or more technical problems in the prior art and at least providing a beneficial selection or creation condition.
In order to achieve the above object, according to an aspect of the present disclosure, there is provided an offline batch processing method of multi-source multi-modal marine big data, the method comprising the steps of:
s100, collecting flow data;
further, the method for collecting the flow data comprises the following steps: a data sequence of physical quantity data acquired by an apparatus such as an Argo float, a surveying instrument, or the like of a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object is used as stream data.
S200, carrying out data normalization on the stream data, wherein the data normalization comprises any one or more of time formatting, field completion, data cleaning, data integration and data reduction;
s300, processing streaming data in a partitioning mode through a MapReduce method;
further, in S300, the method of dividing the processing stream data is to divide the processing stream data by the MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
S400, constructing a scheduling distribution model;
further, in S400, the method for constructing the scheduling assignment model includes the following steps:
s401, MapReduce computing nodes (the MapReduce computing nodes comprise Map nodes or Reduce nodes, the MapReduce computing nodes are established on a built Map/Reduce frame and a Hadoop distributed file system), the MapReduce computing nodes are simply called nodes, the nodes are integrated into nodes, and the nodes are { nodes ═ nodes } nodesiI has a value range of [1, N ]]N is the number of computing nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after division, M>N;
The corresponding stream data processing tasks to be batch processed after being divided include, but are not limited to: any one or more of data compression, clustering, sampling, dimensionality reduction, and data transformation;
s402, sequentially mixing the BathiThe stream data Bath to be batched after the 1 st to the N th segmentationi,jRespectively correspondingly input to each NodeiIn a batch process (i.e., put Bath for the first time)i,1To Bathi,NInput to each Node in turniIn the method, a node which finishes the batch processing task for the first time in the nodes is recorded as a reference node, and the node which finishes the batch processing task for the first time is recorded as a streaming data task Bath which finishes the batch processing after the segmentation for the first timei,jNode ofi;
S403, calculating the reference processing amount R of the reference node,orWherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1Any task that batched at i1 for the reference nodeThe service number; p is the total times of the batch processing tasks of the reference nodes;
s404, let CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S405, every set time interval T, detecting each NodeiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds.
And S500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model.
Further, in S500, the method for inputting stream data into a computing node and performing task scheduling processing on the computing node through a scheduling assignment model includes the following steps:
in order to avoid the problem of node deadlock caused by repeated inclination of data, each Map node or Reduce node needs to be intelligently scheduled according to the following pre-calculated trend time;
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiNode with minimum capability value and slowest processing speed in nodeMaximum trend Time1 for a point to complete a batch taski,Wherein, TiFor the ith NodeiExecute the corresponding BathiThe longest processing Time in each processed stream data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein the function Min is used to select the smallest value among the values and the function Max is used to select the largest value among the values, e.g., Max (T)i) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data taski) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiThe minimum value, Max (Ab (cu)) is selected from the respective maximum processing times obtained for each processed stream data taski) Min (Ab (cu)) represents the maximum ability value selected among the nodesi) Means that the minimum ability value is selected among the respective nodes;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski,Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1(ii) a Wherein, the delay threshold value is set to all nodes NodeiTo finishThe average or delay threshold of the Time to batch process tasks is set to Time11.5 times of the total weight of the powder.
The invention also provides an off-line batch processing system of the multi-source multi-modal ocean big data, which comprises the following steps: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.
The beneficial effect of this disclosure does: the invention provides an off-line batch processing method and system for multi-source multi-modal marine big data, which can quickly detect and isolate error nodes under the condition of repeated inclination of data, dynamically dispatch and allocate new nodes to take over the calculation tasks of the error nodes, reduce the processing time, intelligently dispatch each calculation node according to the trend time, and avoid the possibility that each node with frequent revival/death is repeatedly called after the nodes are revived so as to cause deadlock.
Drawings
The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:
FIG. 1 is a flow chart of a method for offline batch processing of multi-source multi-modal marine big data;
FIG. 2 is a block diagram of an offline batch processing system for multi-source multi-modal marine big data.
Detailed Description
The conception, specific structure and technical effects of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, aspects and effects of the present disclosure. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Fig. 1 is a flowchart of an offline batch processing method of multi-source multi-modal marine big data, and the offline batch processing method of multi-source multi-modal marine big data according to an embodiment of the present invention is described below with reference to fig. 1, where the method includes the following steps:
s100, collecting flow data;
further, the method for collecting the flow data comprises the following steps: data series of physical quantity data acquired by an apparatus such as an Argo buoy, a surveying instrument, or the like, which is a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object, is used as stream data.
S200, carrying out data normalization on the stream data, wherein the data normalization comprises any one or more of time formatting, field completion, data cleaning, data integration and data reduction;
s300, processing streaming data in a partitioning mode through a MapReduce method;
further, in S300, the key-value pair of the stream data split-processed by the MapReduce method is: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
S400, constructing a scheduling distribution model;
further, in S400, the method for constructing the scheduling assignment model includes the following steps:
s401, MapReduce computing nodes (the MapReduce computing nodes comprise Map nodes or Reduce nodes, the MapReduce computing nodes are established on a built Map/Reduce frame and a Hadoop distributed file system), the MapReduce computing nodes are simply called nodes, the nodes are integrated into nodes, and the nodes are { nodes ═ nodes } nodesiI has a value range of [1, N ]]N is the number of computing nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after division, M>N;
The corresponding stream data processing tasks to be batch processed after being divided include, but are not limited to: any one or more of data compression, clustering, sampling, dimensionality reduction, and data transformation;
s402, sequentially mixing the BathiThe stream data Bath to be batched after the 1 st to the N th segmentationi,jRespectively correspondingly input to each NodeiIn a batch process (i.e., put Bath for the first time)i,1To Bathi,NInput to each Node in turniIn), recording a node which finishes a batch processing task for the first time in the nodes as a reference node;
s403, calculating the reference processing amount R of the reference node,orWherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is the total times of the batch processing tasks of the reference nodes;
s404, orderCuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(-(Cui÷R-1)2);
S405, every set time interval T, detecting each NodeiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi(ii) a The set time interval T is typically set to [500,2000 ]]Milliseconds;
s500, carrying out task scheduling processing on the computing nodes of the MapReduce through a scheduling distribution model;
further, in S500, the method for performing task scheduling processing on the MapReduce computing node through the scheduling assignment model includes the following steps:
in order to avoid the problem of node deadlock caused by repeated inclination of data, each Map node or Reduce node needs to be intelligently scheduled according to the following pre-calculated trend time;
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodesi,Wherein, TiFor the ith NodeiExecute the corresponding BathiThe longest processing Time in each processed stream data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value; for example, Max (T)i) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiSelecting a maximum value, Min (T), of the maximum processing times obtained in each processed stream data taski) The meaning of (1) is to select all nodes NodeiExecute the corresponding BathiThe minimum value, Max (Ab (cu)) is selected from the respective maximum processing times obtained for each processed stream data taski) Min (Ab (cu)) represents the maximum ability value selected among the nodesi) Means that the minimum ability value is selected among the respective nodes;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski,Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1(ii) a Wherein, the delay threshold value is set to all nodes NodeiIs set to Time11.5 times of the total weight of the powder.
An embodiment of the present disclosure provides an offline batch processing system for multi-source multi-modal marine big data, as shown in fig. 2, which is a structure diagram of the offline batch processing system for multi-source multi-modal marine big data, and the offline batch processing system for multi-source multi-modal marine big data of the embodiment includes: the system comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the embodiment of the off-line batch processing system of the multi-source multi-modal ocean big data.
The system comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.
The off-line batch processing system of the multi-source multi-modal ocean big data can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The multi-source multi-modal marine big data offline batch processing system can be operated by a system comprising, but not limited to, a processor and a memory. Those skilled in the art will appreciate that the example is merely an example of an offline batch processing system for multi-source multi-modal marine big data, and does not constitute a limitation of an offline batch processing system for multi-source multi-modal marine big data, and may include more or less components than a sub-scale, or combine certain components, or different components, for example, the offline batch processing system for multi-source multi-modal marine big data may further include input and output devices, network access devices, buses, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general processor can be a microprocessor or the processor can be any conventional processor and the like, the processor is a control center of the off-line batch processing system operation system of the multi-source multi-modal marine big data, and various interfaces and lines are utilized to connect all parts of the whole off-line batch processing system operable system of the multi-source multi-modal marine big data.
The memory can be used for storing the computer programs and/or modules, and the processor realizes various functions of the multi-source multi-modal marine big data offline batch processing system by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Although the description of the present disclosure has been rather exhaustive and particularly described with respect to several illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiments, so as to effectively encompass the intended scope of the present disclosure. Furthermore, the foregoing describes the disclosure in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the disclosure, not presently foreseen, may nonetheless represent equivalent modifications thereto.
Claims (8)
1. An off-line batch processing method for multi-source multi-modal marine big data is characterized by comprising the following steps:
s100, collecting flow data;
s200, carrying out data normalization on the stream data;
s300, dividing the processing stream data;
s400, constructing a scheduling distribution model;
and S500, inputting the stream data into the computing nodes, and performing task scheduling processing on the computing nodes through a scheduling distribution model.
2. The off-line batch processing method of multi-source multi-modal marine big data according to claim 1, wherein the method for collecting flow data comprises: a data sequence of physical quantity data acquired by an apparatus such as an Argo float, a surveying instrument, or the like of a sensor for acquiring any one or more physical quantities of sonar data, wind, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality components, size, speed, and direction of a moving object is used as stream data.
3. The method of claim 1, wherein the data normalization includes any one or more of time formatting, field completion, data cleaning, data integration, and data reduction.
4. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S300, the method for processing the streaming data by splitting is to process the streaming data by a MapReduce method: the key-value pairs of the stream data are: < sensor number, physical quantity, acquisition time >, the physical quantity includes sonar data, wind-force, earthquake, electromagnetism, temperature, humidity, noise, light intensity, pressure, water quality composition, the size of moving object, speed and direction in arbitrary one or more physical quantity, through MapReduce algorithm will flow data division into a plurality of data flows.
5. The off-line batch processing method for multi-source multi-modal marine big data according to claim 1, wherein in S400, the method for constructing the dispatching assignment model comprises the following steps:
s401, calculating nodes of MapReduce, namely simply called nodes, and collecting the nodes as nodes, wherein the Node is { Node ═ Node }iI has a value range of [1, N ]]N is the number of nodes, each Node in the NodeiAll have a corresponding stream data processing task set Bath to be processed in batch after being dividedi={Bathi,jJ has the value range of [1, M ]]M is the number of stream data to be batched, Bathi,jFor the ith NodeiThe jth stream data processing task to be batch processed after the segmentation;
s402, sequentially mixing the BathiThe 1 st to the N th streaming data tasks to be processed in batches after being dividedi,jRespectively correspondingly input to each NodeiThe method comprises the steps of carrying out batch processing tasks, recording a node which finishes the batch processing tasks for the first time in the nodes as a reference node, and recording a node which finishes the batch processing tasks for the first time in the nodes as a streaming data task Bath which finishes the batch processing after being divided for the first timei,jNode ofi;
S402, calculating the reference processing amount R of the reference node,orWherein, K1i1The average number of threads or processes that batch process the task for the ith 1 time of reference node; k2i1The number of tasks for which the reference node performed batch processing tasks at the i1 th time; p is processed as reference node in batchThe total number of management tasks;
s403, making CuiRepresenting a NodeiCurrent batch task total of (1), i.e. CuiNode for NodeiNumber of average threads or processes currently performing batch processing tasks or NodeiThe number of tasks currently performing batch processing tasks is determined according to the reference throughput R and Cu of the reference nodesiAb (Cu) was obtained by calculationi),Ab(Cui) Representing a NodeiThe amount of processing tasks is CuiAbility value of (i) in which Ab (Cu)i)=exp(Cui÷R-1)2);
S404, detecting each Node at set time interval TiAb (Cu) ofi) Calculating the capability values Ab (Cu) of all nodesi) When the ith Node is Node, the average value ABV ofiAb (Cu) ofi) When the number of the acquired flow data processing tasks is larger than or equal to the ABV, adding the newly acquired k flow data processing tasks to be processed in batch after being divided into the nodes NodeiCorresponding BathiIn the method, the value of j is increased by k, and newly acquired k divided stream data processing tasks to be processed in batch are sequentially supplemented and added into Bathi,j+1To Bathi,j+kForm a new Bathi。
6. The offline batch processing method for multi-source multi-modal ocean big data according to claim 1, wherein in S500, the method for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model comprises the following steps:
s501, let Time1For the ith NodeiThe length of time to complete a batch task; calculating in sequence to obtain each NodeiLongest trend Time1 for batch task completion by node with the smallest capability value and the slowest processing speed among nodesi,Wherein, TiFor the ith NodeiExecute the corresponding BathiEach of which has been processedMaximum processing Time in the completed streaming data task, Time1iCorresponding to the node with the smallest capability value and the slowest processing speed, wherein a function Min is used for selecting the minimum value in each value, and a function Max is used for selecting the maximum value in each value;
s502, calculating in sequence to obtain each NodeiTime2, the shortest trend Time for the node with the largest capacity value and the fastest processing speed to complete the batch processing taski,Time2iCorresponding to the node with the largest capacity value and the fastest processing speed;
s503, when the ith NodeiMaximum trend Time ofiIf the delay threshold is larger than the delay threshold, the Node is made to be the N1 NodeN1Order and Time2iThe Node is the Node with the maximum corresponding capacity value and the fastest processing speedN2Node is to beN1Corresponding streaming data processing task set BanhN1Joining to NodeN2Corresponding streaming data processing task set BanhN2In, and emptying BathN1。
7. The method of claim 1, wherein the delay threshold is set to all nodes NodeiIs set to Time11.5 times of the total weight of the powder.
8. An offline batch processing system for multi-source multi-modal marine big data, the system comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the computer program to run in the units of the following system:
the data acquisition unit is used for acquiring flow data;
the data normalization unit is used for performing data normalization on the stream data;
a data dividing unit for dividing the processing stream data;
the model building unit is used for building a scheduling distribution model;
and the scheduling processing unit is used for inputting the stream data into the computing nodes and performing task scheduling processing on the computing nodes through the scheduling distribution model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110476164.1A CN113268505B (en) | 2021-04-29 | 2021-04-29 | Offline batch processing method and system for multi-source multi-mode ocean big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110476164.1A CN113268505B (en) | 2021-04-29 | 2021-04-29 | Offline batch processing method and system for multi-source multi-mode ocean big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113268505A true CN113268505A (en) | 2021-08-17 |
CN113268505B CN113268505B (en) | 2021-11-30 |
Family
ID=77230023
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110476164.1A Active CN113268505B (en) | 2021-04-29 | 2021-04-29 | Offline batch processing method and system for multi-source multi-mode ocean big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113268505B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519914A (en) * | 2018-04-09 | 2018-09-11 | 腾讯科技(深圳)有限公司 | Big data computational methods, system and computer equipment |
CN110119421A (en) * | 2019-04-03 | 2019-08-13 | 昆明理工大学 | A kind of electric power stealing user identification method based on Spark flow sorter |
CN110321223A (en) * | 2019-07-03 | 2019-10-11 | 湖南大学 | The data flow division methods and device of Coflow work compound stream scheduling perception |
CN111259933A (en) * | 2020-01-09 | 2020-06-09 | 中国科学院计算技术研究所 | High-dimensional feature data classification method and system based on distributed parallel decision tree |
US10713257B2 (en) * | 2017-09-29 | 2020-07-14 | International Business Machines Corporation | Data-centric reduction network for cluster monitoring |
-
2021
- 2021-04-29 CN CN202110476164.1A patent/CN113268505B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10713257B2 (en) * | 2017-09-29 | 2020-07-14 | International Business Machines Corporation | Data-centric reduction network for cluster monitoring |
CN108519914A (en) * | 2018-04-09 | 2018-09-11 | 腾讯科技(深圳)有限公司 | Big data computational methods, system and computer equipment |
CN110119421A (en) * | 2019-04-03 | 2019-08-13 | 昆明理工大学 | A kind of electric power stealing user identification method based on Spark flow sorter |
CN110321223A (en) * | 2019-07-03 | 2019-10-11 | 湖南大学 | The data flow division methods and device of Coflow work compound stream scheduling perception |
CN111259933A (en) * | 2020-01-09 | 2020-06-09 | 中国科学院计算技术研究所 | High-dimensional feature data classification method and system based on distributed parallel decision tree |
Also Published As
Publication number | Publication date |
---|---|
CN113268505B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134636B (en) | Model training method, server, and computer-readable storage medium | |
US11755452B2 (en) | Log data collection method based on log data generated by container in application container environment, log data collection device, storage medium, and log data collection system | |
CN111104459A (en) | Storage device, distributed storage system, and data processing method | |
CN114356587B (en) | Calculation power task cross-region scheduling method, system and equipment | |
CN107977167B (en) | Erasure code based degeneration reading optimization method for distributed storage system | |
US20210390405A1 (en) | Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof | |
US20190377655A1 (en) | Two-stage distributed estimation system | |
CN108776589B (en) | Deployment method of radar signal processing software component | |
Moaddeli et al. | The power of d choices in scheduling for data centers with heterogeneous servers | |
CN112579287B (en) | Cloud arrangement system and method based on read-write separation and automatic expansion | |
CN110764705A (en) | Data reading and writing method, device, equipment and storage medium | |
CN114780244A (en) | Container cloud resource elastic allocation method and device, computer equipment and medium | |
CN110232073A (en) | A kind of Data Management Analysis system and method | |
JP2021121921A (en) | Method and apparatus for management of artificial intelligence development platform, and medium | |
CN118260053B (en) | Memory scheduling method of heterogeneous computing system, heterogeneous computing system and device | |
CN111858656A (en) | Static data query method and device based on distributed architecture | |
CN113268505B (en) | Offline batch processing method and system for multi-source multi-mode ocean big data | |
CN112905596B (en) | Data processing method, device, computer equipment and storage medium | |
Amannejad et al. | Fast and lightweight execution time predictions for spark applications | |
CN116701001B (en) | Target task allocation method and device, electronic equipment and storage medium | |
CN112883110A (en) | Terminal big data distribution method, storage medium and system based on NIFI | |
TWI778924B (en) | Method for big data retrieval and system thereof | |
CN110209631A (en) | Big data processing method and its processing system | |
CN110764886B (en) | Batch job cooperative scheduling method and system supporting multi-partition processing | |
CN110415162B (en) | Adaptive graph partitioning method facing heterogeneous fusion processor in big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |