CN104391929A - Transmission method of data stream in ETL - Google Patents
Transmission method of data stream in ETL Download PDFInfo
- Publication number
- CN104391929A CN104391929A CN201410671540.2A CN201410671540A CN104391929A CN 104391929 A CN104391929 A CN 104391929A CN 201410671540 A CN201410671540 A CN 201410671540A CN 104391929 A CN104391929 A CN 104391929A
- Authority
- CN
- China
- Prior art keywords
- link
- data
- etl
- queue
- links
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 86
- 230000005540 biological transmission Effects 0.000 title claims description 15
- 230000008569 process Effects 0.000 claims abstract description 63
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000009826 distribution Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 230000010354 integration Effects 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to a method for transmitting data stream in ETL, which comprises the following steps: the method comprises the following steps: determining the data outflow mode of each link according to the defined ETL process; step two: determining the number of threads executed by each link; step three: determining the maximum number of input and output queues of each link; step four: determining the number of data acquired or put from the queue at one time by each link in the ETL process; step five: completing initialization of an ETL process and creation and initialization of an ETL link; step six: all links in the ETL process start to be executed in parallel, and data start to be normally circulated; step seven: and stopping the flow execution after the data processing of each execution link of the ETL is finished. According to the invention, the big data is split, the links are used as data buffer areas based on the memory queues, and the links are processed in parallel, so that a large amount of data can be efficiently transferred among the links of the ETL.
Description
Technical field
The present invention relates to Data Integration field, be specifically related to a kind of ETL(Extract – Transform – Load, namely data pick-up, conversion, loading/process) in the transmission method of data stream.
Background technology
Along with the development of science and technology, the informationalized degree of all trades and professions is more and more higher, the data volume of all trades and professions all towards the future development of mass data, in data integration field, in the face of the continuous lifting of mass data and performance requirement.Require also more and more higher to data integration tool, common data integration instrument mainly utilizes the data of database or memory or each step shared, owing to data not being split, each step is all that single-threaded order performs, like this when in the face of mass data, internal memory clearly becomes bottleneck, does not also make full use of the cpu resource of existing server.Cause the waste of system resource, the performance of data integration reduces.
Therefore, for currently available technology Problems existing, be necessary to develop research in fact, to provide a kind of scheme, mass data is made full use of to the transmission carrying out efficient data stream between system resource each step in ETL flow process, efficiently to complete the process of the extraction to mass data, conversion, loading, save system resource, improve the performance of data integration.
Summary of the invention
For solving the problem, the object of the present invention is to provide the transmission method of data stream in a kind of ETL, mass data is made full use of to the transmission carrying out efficient data stream between system resource each step in ETL flow process, efficiently to complete the process of the extraction to mass data, conversion, loading, save system resource, improve the performance of data integration.
For achieving the above object, technical scheme of the present invention is:
In ETL, a transmission method for data stream, comprises the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
Further, in step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes; When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively; During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
Further, in step 3, according to the performance of the process of this link and the speed of next link process data, rationally maximum queue is set.
Further, in step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
Further, in step 6, adopt respective memory queue to deposit carrier as the centre of data, do not interfere with each other mutually between links, adopt queue mechanism, the form making mass data pass through to split circulates between each link.
In ETL of the present invention, the transmission method of data stream can make full use of system resource, adopt and large data are split, between link based on memory queue as data buffer, the mode of links parallel processing achieves mass data and circulates efficiently between ETL links.
Term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the term used like this can exchange in the appropriate case, this is only describe in embodiments of the invention the differentiation mode that the object of same alike result adopts when describing.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, to comprise the process of a series of unit, method, system, product or equipment being not necessarily limited to those unit, but can comprise clearly do not list or for intrinsic other unit of these processes, method, product or equipment.
Below be described in detail respectively.
In a kind of ETL of the present invention, the transmission method of data stream, comprises the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
In step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes.When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively.During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
In step 2, according to the computing power of adjacent link in system resource situation and ETL flow process, the number of each link thread is rationally set.
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
In step 3, according to the performance of the process of this link and the speed of next link process data, rationally maximum queue is set, arranging incorrect meeting causes queue committed memory too high, thus affect the performance of whole system, rationally arrange and can make full use of system resource, thus improve the performance of whole flow processing.
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
In step 4, after needing to wait for that a upper link is all disposed for some, this link just manageable situation is arranged flexibly.As conversion links, some aggregation scene needs to wait for all data in the process that can start conversion links after all arriving, the real processing power according to different data volumes and link, process data.
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
In step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
In step 6, simultaneously described actuator all starts, and be all that multi-threaded parallel performs, and according to configuration, some actuator or many examples multithreading perform, and take full advantage of the feature of the many cpu of active computer, make the Longitudinal Extension ability of system be able to General Promotion;
In step 6, adopt respective memory queue to deposit carrier as the centre of data between links, do not interfere with each other mutually, adopt queue mechanism, make mass data pass through the form split, achieve the efficient circulation between each link, ensure that the ability of each link process big data quantity.
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
In step 7, whether whether links be complete based on a upper link and be unanimously empty comprehensive descision in input queue at the appointed time, to determine whether current link terminates to perform;
After each link terminates, terminate the execution thread of each link, after all links are finished, the process of a data processing also just terminates.
In the embodiment of the present invention, when a link has multiple different follow-up link, when the internal memory of system and cpu resource all relatively good time, previous link setting data ways of distribution is for copying, each follow-up link arranges multiple thread and performs simultaneously, ensures that each link can process for the data acquisition Multi-instance in an input queue simultaneously simultaneously like this.When the execution of a link is consuming time more, or perform be on the remote server time, multiple this identical link can be created on stream, this link previous link setting data ways of distribution for distribution, so both ensure that the correctness of data, reuse again system resource.
For different links and system resource, be not that the number of thread is The more the better, the number of the thread of corresponding link can be set in different environments by debugging.Due to adjacent two links, the output queue of previous link is the input queue of a rear link, when previous link process speed quickly time, for by the input queue being filled into this link very fast for a large amount of data, if the speed of next like this link process is slower, this queue will accumulate a large amount of data in memory queue, cause EMS memory occupation larger, so need to arrange corresponding input queue and the max cap. of output queue, like this to control the occupancy of internal memory according to different link processing poweies.
Data dimension difference according to each link process arranges the different data number of each link, if the dimension of process data is combings one by one, can 1 be set to, if pending all link process such as some link need, such as need to gather all data, a larger data volume can be set or arrange and wait for that a upper link processes after being disposed again.
In the embodiment of the present invention, the step of links executed in parallel is as follows:
ETL flow process links starts execution, and the actuator of each link starts to start to monitor to the input queue of correspondence;
After first link in ETL flow process gets data from data source, data are put into the output queue (queue1) of this link, then proceed the acquisition of next stage data, then queue is put into, repeat whole process, until total data obtains complete, terminate this link;
Output queue (queue1) from first link (step1) to this link first time put into data after, the link (step2) monitoring this queue can perceive entering of data immediately, then from this queue by after data acquisition, process in this link (step2), after being disposed, if there is follow-up link, data are put into the output queue (queue2) of this link (step2), then continue to monitor queue, repeat said process.After getting the instruction that a upper link is finished, and the input queue of this link has not had data can in acquisition, and this link terminates.
Follow-up link in ETL flow process, all according to the execution pattern in above-mentioned steps, performs, and after all flow performing, whole flow process terminates, and completes the processing procedure of an ETL.
Data are carried out stream compression by the present invention between ETL links, by mass data is split, temporary storage area using memory queue as links data, make full use of system resource and links is carried out executed in parallel, achieve data stream to circulate efficiently between system links, improved the Longitudinal Extension ability of ETL instrument simultaneously by the method.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required common hardware by software and realize, and can certainly comprise special IC, dedicated cpu, private memory, special components and parts etc. realize by specialized hardware.Generally, all functions completed by computer program can realize with corresponding hardware easily, and the particular hardware structure being used for realizing same function also can be diversified, such as mimic channel, digital circuit or special circuit etc.But under more susceptible for the purpose of the present invention condition, software program realizes is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in the storage medium that can read, as the floppy disk of computing machine, USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform method described in the present invention each embodiment.
In sum, above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in the various embodiments described above, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.
Accompanying drawing explanation
Fig. 1 is method flow of the present invention diagram.
Fig. 2 is the data stream conveying flow schematic diagram of the inventive method.
Embodiment
Embodiments provide the transmission method of data stream in a kind of ETL.
For making goal of the invention of the present invention, feature, advantage can be more obvious and understandable, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, the embodiments described below are only the present invention's part embodiments, and not all embodiments.Based on the embodiment in the present invention, the every other embodiment that those skilled in the art obtains, all belongs to the scope of protection of the invention.
Claims (5)
1. the transmission method of data stream in ETL, is characterized in that, comprise the steps:
Step one: according to the ETL flow process of definition, determine the outflow mode of the data of each link;
Step 2: according to the ETL flow process of definition, determine the number of each link execution thread;
Step 3: according to the ETL flow process of definition, determine the maximum quantity of the input and output queue of each link;
Step 4: the number determining the data that links once obtains from queue or puts in ETL flow process;
Step 5: complete the initialization of ETL flow process and the establishment of ETL link and initialization;
Step 6: in ETL flow process, links starts executed in parallel, data start normal circulation;
After each execution link data processing of step 7: ETL, whole flow process is stopped to terminate to perform successively.
2. the transmission method of data stream in ETL as claimed in claim 1, it is characterized in that, in step one, according to the number of adjacent link, and the data mode that flows out that each link needs data count to be processed to arrange each link copies or distributes; When copying, according to the number of the direct follow-up link of this link, by the data Replica many parts of this link, put into the input queue of follow-up link respectively; During distribution, by all output data of this link, circulate according to the input queue of follow-up link, adopt each queue to distribute the mode of, carry out circulation and put into.
3. the transmission method of data stream in ETL as claimed in claim 1, is characterized in that, in step 3, according to the performance of the process of this link and the speed of next link process data, rationally arrange maximum queue.
4. the transmission method of data stream in ETL as described in Claims 2 or 3, is characterized in that, in step 5, specifically comprise the following steps:
According to overall flow definition, complete the initialization of flow process, form the queue that this flow process is total;
Define according to during ETL flow scheme design, form the entity of links, the configuration of each link of main carrying;
According to flow definition, form the actuator that links is corresponding, for the real execution of each link;
By the queue generated, on the actuator generated to the link of correspondence according to regular allocation.
5. the transmission method of data stream in ETL as claimed in claim 4, is characterized in that, in step 6, respective memory queue is adopted to deposit carrier as the centre of data between links, do not interfere with each other mutually, adopt queue mechanism, the form making mass data pass through to split circulates between each link.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410671540.2A CN104391929A (en) | 2014-11-21 | 2014-11-21 | Transmission method of data stream in ETL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410671540.2A CN104391929A (en) | 2014-11-21 | 2014-11-21 | Transmission method of data stream in ETL |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104391929A true CN104391929A (en) | 2015-03-04 |
Family
ID=52609833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410671540.2A Pending CN104391929A (en) | 2014-11-21 | 2014-11-21 | Transmission method of data stream in ETL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104391929A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106469065A (en) * | 2016-09-06 | 2017-03-01 | 广西科技大学第附属医院 | A kind of software secondary development method based on data drain |
CN114385136A (en) * | 2021-12-29 | 2022-04-22 | 武汉达梦数据库股份有限公司 | Flow decomposition method and device for running ETL (extract transform load) by Flink framework |
US12026005B2 (en) | 2022-10-18 | 2024-07-02 | Sap Se | Control mechanism of extract transfer and load (ETL) processes to improve memory usage |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071842A1 (en) * | 2003-08-04 | 2005-03-31 | Totaletl, Inc. | Method and system for managing data using parallel processing in a clustered network |
US20080222634A1 (en) * | 2007-03-06 | 2008-09-11 | Yahoo! Inc. | Parallel processing for etl processes |
CN101388844A (en) * | 2008-11-07 | 2009-03-18 | 东软集团股份有限公司 | Data flow processing method and system |
CN101882165A (en) * | 2010-08-02 | 2010-11-10 | 山东中创软件工程股份有限公司 | Multithreading data processing method based on ETL (Extract Transform Loading) |
CN102722355A (en) * | 2012-06-04 | 2012-10-10 | 南京中兴软创科技股份有限公司 | Workflow mechanism-based concurrent ETL (Extract, Transform and Load) conversion method |
GB2505938A (en) * | 2012-09-17 | 2014-03-19 | Ibm | ETL debugging |
-
2014
- 2014-11-21 CN CN201410671540.2A patent/CN104391929A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071842A1 (en) * | 2003-08-04 | 2005-03-31 | Totaletl, Inc. | Method and system for managing data using parallel processing in a clustered network |
US20080222634A1 (en) * | 2007-03-06 | 2008-09-11 | Yahoo! Inc. | Parallel processing for etl processes |
CN101388844A (en) * | 2008-11-07 | 2009-03-18 | 东软集团股份有限公司 | Data flow processing method and system |
CN101882165A (en) * | 2010-08-02 | 2010-11-10 | 山东中创软件工程股份有限公司 | Multithreading data processing method based on ETL (Extract Transform Loading) |
CN102722355A (en) * | 2012-06-04 | 2012-10-10 | 南京中兴软创科技股份有限公司 | Workflow mechanism-based concurrent ETL (Extract, Transform and Load) conversion method |
GB2505938A (en) * | 2012-09-17 | 2014-03-19 | Ibm | ETL debugging |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106469065A (en) * | 2016-09-06 | 2017-03-01 | 广西科技大学第附属医院 | A kind of software secondary development method based on data drain |
CN114385136A (en) * | 2021-12-29 | 2022-04-22 | 武汉达梦数据库股份有限公司 | Flow decomposition method and device for running ETL (extract transform load) by Flink framework |
CN114385136B (en) * | 2021-12-29 | 2022-11-22 | 武汉达梦数据库股份有限公司 | Flow decomposition method and device for running ETL (extract transform load) by Flink framework |
US12026005B2 (en) | 2022-10-18 | 2024-07-02 | Sap Se | Control mechanism of extract transfer and load (ETL) processes to improve memory usage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3129870B1 (en) | Data parallel processing method and apparatus based on multiple graphic procesing units | |
EP3404587B1 (en) | Cnn processing method and device | |
Stehle et al. | A memory bandwidth-efficient hybrid radix sort on gpus | |
US20190087233A1 (en) | Task allocating method and system for reconfigurable processing system | |
EP4209902A1 (en) | Memory allocation method, related device, and computer readable storage medium | |
CN103765384A (en) | Data processing system and method for task scheduling in a data processing system | |
DE112010005705T5 (en) | Reschedule workload in a hybrid computing environment | |
DE112011101469T5 (en) | Compiling software for a hierarchical distributed processing system | |
CN107924327A (en) | System and method for multiple threads | |
CN103885826B (en) | Real-time task scheduling implementation method of multi-core embedded system | |
CN105808328A (en) | Task scheduling method, device and system | |
US11983564B2 (en) | Scheduling of a plurality of graphic processing units | |
CN106651748B (en) | A kind of image processing method and image processing apparatus | |
CN107656813A (en) | The method, apparatus and terminal of a kind of load dispatch | |
CN108021449A (en) | One kind association journey implementation method, terminal device and storage medium | |
CN110650347A (en) | Multimedia data processing method and device | |
CN110187970A (en) | A kind of distributed big data parallel calculating method based on Hadoop MapReduce | |
CN104391929A (en) | Transmission method of data stream in ETL | |
IL264794B2 (en) | Scheduling of tasks in a multiprocessor device | |
DE102015116036A1 (en) | Distributed real-time computational structure using in-memory processing | |
CN105957131A (en) | Graphic processing system and method thereof | |
CN102831016B (en) | Physical machine recycle method of cloud computing and device thereof | |
US20140215476A1 (en) | Apparatus and method for sharing function logic between functional units, and reconfigurable processor thereof | |
CN107195144A (en) | Method, device and the computer-readable recording medium of managing payment terminal hardware module | |
CN107402807A (en) | Method, system and the processor of multitask execution efficiency are effectively lifted in computer systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150304 |