CN110287245A

CN110287245A - Method and system for scheduling and executing distributed ETL (extract transform load) tasks

Info

Publication number: CN110287245A
Application number: CN201910401322.XA
Authority: CN
Inventors: 杨冬菊; 徐晨阳
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2019-09-27
Anticipated expiration: 2039-05-15
Also published as: CN110287245B

Abstract

The embodiment of the invention provides a method and a system for scheduling and executing distributed ETL tasks, which extract the association between an entity and an affiliated table, the association between the entity and a dimension table and the one-to-many association between the entity and the entity involved in the ETL task from an acquired target table contained in the ETL task to be scheduled and executed; determining the scheduling priority of the ETL task based on the preset weight for each association and the number of each association in the ETL task; and distributing each ETL task to each execution node according to the sequence from high to low of the scheduling priority. In the technical scheme of the embodiment of the invention, the ETL tasks are distributed to the execution nodes according to different weights based on factors such as the complexity of the service corresponding to the ETL tasks, the importance degree of the service data to be integrated and the like, so that the timeliness of core data loading and the load balance among the nodes are met, and the efficiency of data integration and the utilization rate of resources are improved.

Description

Method and system for scheduling and executing distributed ETL tasks

技术领域technical field

本发明涉及数据仓库，尤其涉及用于ETL任务调度执行的方法及系统。The invention relates to a data warehouse, in particular to a method and system for ETL task scheduling and execution.

背景技术Background technique

目前，数据抽取转换加载技术(Extract-Transform-Load，ETL)是大数据环境下构建数据仓库的关键步骤之一，是将分散、异构的数据经过抽取、转换、加载集成到统一标准库的过程。数据的抽取、转换、加载步骤可以组合成一个可调度的ETL脚本作业(也可以称为ETL任务)。在大数据环境下，常常需要执行数十乃至数万个ETL任务，如何高效率的调度这些任务是构建数据仓库的重要组成部分。目前主要采用分布式集群调度方案来进行ETL任务调度，利用诸如轮询算法、先来先服务算法、Min-Min算法之类的调度算法将ETL任务分配到集群中的各个执行节点。然而由于各ETL任务执行时间不同、任务所含数据量不同、各个执行节点当前负载不同等，容易造成集群资源负载不均衡，资源利用率低等问题，从而导致数据集成效率低下。At present, data extraction, transformation and loading technology (Extract-Transform-Load, ETL) is one of the key steps in building a data warehouse in a big data environment. It is to integrate scattered and heterogeneous data into a unified standard library process. The steps of data extraction, transformation, and loading can be combined into a schedulable ETL script job (also called an ETL task). In a big data environment, dozens or even tens of thousands of ETL tasks often need to be executed. How to efficiently schedule these tasks is an important part of building a data warehouse. At present, the distributed cluster scheduling scheme is mainly used for ETL task scheduling, and scheduling algorithms such as polling algorithm, first-come-first-serving algorithm, and Min-Min algorithm are used to distribute ETL tasks to each execution node in the cluster. However, due to the different execution time of each ETL task, the different amount of data contained in the task, and the current load of each execution node, it is easy to cause problems such as unbalanced cluster resource load and low resource utilization, resulting in low data integration efficiency.

发明内容Contents of the invention

经发明人研究发现，在进行数据集成时，不同ETL任务涉及的业务及相关业务数据的重要性不同，如果涉及与核心业务数据相关的集成业务的ETL任务在调度执行时等待时间过长，会直接影响数据集成的效率。而现有的ETL任务调度方法并没有考虑与ETL任务对应的业务的复杂性以及待集成的业务数据的重要性。因此，本发明实施例的目的在于克服上述现有技术的缺陷，提供一种新的用于分布式ETL任务调度执行的方法及系统。According to the research of the inventors, it is found that during data integration, the business and related business data involved in different ETL tasks have different importance. directly affect the efficiency of data integration. However, the existing ETL task scheduling method does not consider the complexity of the business corresponding to the ETL task and the importance of the business data to be integrated. Therefore, the purpose of the embodiments of the present invention is to overcome the above-mentioned defects in the prior art, and provide a new method and system for scheduling and executing distributed ETL tasks.

上述目的是通过以下技术方案实现的：The above-mentioned purpose is achieved through the following technical solutions:

根据本发明实施例的第一方面，提供了一种用于分布式ETL任务调度执行的方法，该方法包括：对于获取的待调度执行的每个ETL任务，基于该ETL任务中数据加载的目标表，提取该ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间一对多的关联；基于为每种关联预设的权重和每种关联在该ETL任务中的个数确定该ETL任务的调度优先级；以及按照调度优先级从高到低的次序将各个ETL任务分配至各执行节点。According to the first aspect of the embodiments of the present invention, there is provided a method for scheduling execution of distributed ETL tasks, the method includes: for each acquired ETL task to be scheduled for execution, based on the target of data loading in the ETL task table, to extract the association between the entity and the subsidiary table involved in the ETL task, the association between the entity and the dimension table, and the one-to-many association between the entity and the entity; based on the weight preset for each association and each The number associated with the ETL task determines the scheduling priority of the ETL task; and each ETL task is allocated to each execution node according to the order of scheduling priority from high to low.

在本发明的一些实施例中，该方法还可包括在进行ETL任务的分配之前，查询各个执行节点的性能指标；以及根据获得的各执行节点的性能指标确定各个执行节点的当前负载，按照执行节点的当前负载从低至高选择相应执行节点来进行ETL任务的分配。In some embodiments of the present invention, the method may also include querying performance indicators of each execution node before allocating ETL tasks; and determining the current load of each execution node according to the obtained performance indicators of each execution node, according to the execution The current load of the node is from low to high to select the corresponding execution node to distribute the ETL task.

在本发明的一些实施例中，该ETL任务的调度优先级可以通过下面的公式来计算：In some embodiments of the present invention, the scheduling priority of the ETL task can be calculated by the following formula:

其中，Wl1表示实体与附属表之间的关联的权重；Wl2表示实体与维度表之间的关联的权重；Wl3表示实体与实体之间的关联的权重；其中ni表示ETL任务中出现的第i种关联的个数。Among them, Wl1 represents the weight of the association between the entity and the subsidiary table; Wl2 represents the weight of the association between the entity and the dimension table; Wl3 represents the weight of the association between the entity and the entity; where ni represents the i-th value that appears in the ETL task number of associations.

在本发明的一些实施例中，所述将各个ETL任务分配至各执行节点可以包括：In some embodiments of the present invention, said assigning each ETL task to each execution node may include:

a)统计待调度执行的每个ETL任务的数据量；a) Count the data volume of each ETL task to be scheduled and executed;

b)统计每个执行节点上所有ETL任务的数据总量；b) Count the total amount of data of all ETL tasks on each execution node;

c)从待调度执行的ETL任务中选择最大数据量对应的ETL任务；c) Select the ETL task corresponding to the maximum amount of data from the ETL tasks to be scheduled and executed;

d)选择数据总量最小且当前还没有被分配ETL任务的执行节点；d) Select the execution node with the smallest amount of data and currently not assigned ETL tasks;

e)将所选择的ETL任务分配至所选择的执行节点，并将该执行节点标记为已分配；e) assign the selected ETL task to the selected execution node, and mark the execution node as assigned;

f)重复步骤c)-e)直到待调度执行的ETL任务被分配完毕或者直到所有执行节点都被标记为已分配；f) Repeat steps c)-e) until the ETL tasks to be scheduled for execution are allocated or until all execution nodes are marked as allocated;

g)检测是否还有待调度执行的ETL任务，如果有，则将所有执行节点重新标记为未分配，重复步骤c)-g)直到待调度执行的ETL任务被分配完毕。g) Detect whether there are ETL tasks to be scheduled for execution, and if so, re-mark all execution nodes as unallocated, and repeat steps c)-g) until the ETL tasks to be scheduled for execution are allocated.

在本发明的一些实施例中，该方法还可以包括：响应于执行节点收到新的ETL任务，将该待执行的ETL任务存入任务缓存队列中，并记录该ETL的到达时间；基于该ETL任务中的数据量来预估该ETL任务的执行时间；响应于执行节点的当前任务执行完毕，对于待执行的每个ETL任务，根据该ETL任务的等待时间和预估的执行时间确定该ETL任务的执行优先级；以及从待执行的ETL任务中选出执行优先级最高的ETL任务来执行。In some embodiments of the present invention, the method may further include: in response to the execution node receiving a new ETL task, storing the ETL task to be executed in the task cache queue, and recording the arrival time of the ETL; based on the The amount of data in the ETL task is used to estimate the execution time of the ETL task; in response to the execution of the current task of the execution node, for each ETL task to be executed, the ETL task is determined according to the waiting time of the ETL task and the estimated execution time. The execution priority of the ETL task; and the ETL task with the highest execution priority is selected from the ETL tasks to be executed for execution.

在本发明的一些实施例中，基于该ETL任务中的数据量来预估该ETL任务的执行时间可包括：确定该ETL任务中的数据量；从该执行节点上最近一段时间内已完成执行的ETL任务中，筛选出一批与待执行的ETL任务具有相似数据量的ETL任务；这批ETL任务的执行时间求平均值，将所得到的平均值作为所预估的该ETL任务的执行时间。In some embodiments of the present invention, estimating the execution time of the ETL task based on the amount of data in the ETL task may include: determining the amount of data in the ETL task; Among the ETL tasks, a batch of ETL tasks with similar data volume to the ETL tasks to be executed is selected; the execution time of these ETL tasks is averaged, and the obtained average value is used as the estimated execution of the ETL task time.

在本发明的一些实施例中，ETL任务的执行优先级可以利用下面的公式来确定：In some embodiments of the present invention, the execution priority of the ETL task can be determined using the following formula:

其中EP_i表示第i个ETL任务ei的执行优先级；Tei表示该ETL任务ei的执行时间；Twi表示该ETL任务ei的等待时间，其等于当前时间减去该ETL任务到达执行节点的时间。Where EP _i represents the execution priority of the i-th ETL task ei; Tei represents the execution time of the ETL task ei; Twi represents the waiting time of the ETL task ei, which is equal to the current time minus the time when the ETL task arrives at the execution node.

根据本发明实施例的第二方面，还提供了一种用于分布式ETL任务调度执行的系统，包括调度器和多个执行器，调度器用于将待调度执行的一个或多个ETL任务分配至多个执行器，执行器用于执行收到的ETL任务。其中调度器包括关系分析模块、优先级确定模块和调度模块。关系分析模块用于对于获取的待调度执行的每个ETL任务，基于该ETL任务中数据加载的目标表，提取该ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间一对多的关联。优先级确定模块用于基于为每种关联预设的权重和每种关联在该ETL任务中的个数确定该ETL任务的调度优先级。调度模块用于按照调度优先级从高到低的次序将各个ETL任务分配至各执行器。According to the second aspect of the embodiments of the present invention, there is also provided a system for scheduling and executing distributed ETL tasks, including a scheduler and a plurality of executors, and the scheduler is used to allocate one or more ETL tasks to be scheduled and executed To multiple executors, the executors are used to execute the received ETL tasks. The scheduler includes a relationship analysis module, a priority determination module and a scheduling module. The relationship analysis module is used for each ETL task to be scheduled and executed, based on the target table of data loading in the ETL task, to extract the association between the entity and the subsidiary table involved in the ETL task, and the relationship between the entity and the dimension table Associations, one-to-many associations between entities. The priority determination module is used to determine the scheduling priority of the ETL task based on the preset weight for each type of association and the number of each type of association in the ETL task. The scheduling module is used to assign each ETL task to each executor in descending order of scheduling priority.

在本发明的一些实施例中，所述调度器还可包括负载监控模快，用于查询各个执行器的性能指标，以及根据获得的各执行器的性能指标确定各个执行器的当前负载；以及所述调度模块还可被配置为按照执行器的当前负载从低至高选择相应执行器来进行ETL任务的分配。In some embodiments of the present invention, the scheduler may also include a load monitoring module, which is used to query the performance indicators of each executor, and determine the current load of each executor according to the obtained performance indicators of each executor; and The scheduling module can also be configured to select corresponding executors from low to high according to the current load of the executors to allocate ETL tasks.

在本发明的一些实施例中，所述执行器可以被配置为：响应于收到新的ETL任务，将该待执行的ETL任务存入任务缓存队列中，并记录该ETL的到达时间；基于该ETL任务中的数据量来预估该ETL任务的执行时间；响应于当前任务执行完毕，对于待执行的每个ETL任务，根据该ETL任务的等待时间和预估的执行时间确定该ETL任务的执行优先级；以及从待执行的ETL任务中选出执行优先级最高的ETL任务来执行。In some embodiments of the present invention, the executor may be configured to: in response to receiving a new ETL task, store the ETL task to be executed in the task cache queue, and record the arrival time of the ETL; The amount of data in the ETL task is used to estimate the execution time of the ETL task; in response to the completion of the current task, for each ETL task to be executed, the ETL task is determined according to the waiting time of the ETL task and the estimated execution time the execution priority; and select the ETL task with the highest execution priority from the ETL tasks to be executed for execution.

本发明实施例的技术方案可以包括以下有益效果：The technical solutions of the embodiments of the present invention may include the following beneficial effects:

依据与ETL任务对应的业务的复杂性、待集成的业务数据的重要程度、节点性能等因素在节点之间进行ETL任务分配，以及还可以在各执行节点上依据ETL任务执行时间以及待处理的数据量等调整ETL任务的执行顺序，既满足了核心数据加载的及时性和各执行节点之间的负载均衡性，又在整体上提高了数据集成的效率和资源的利用率。According to the complexity of the business corresponding to the ETL task, the importance of the business data to be integrated, node performance and other factors, the ETL task is allocated among the nodes, and it can also be based on the execution time of the ETL task and the pending processing time on each execution node. Adjusting the execution order of ETL tasks such as data volume not only meets the timeliness of core data loading and load balancing among execution nodes, but also improves the efficiency of data integration and resource utilization as a whole.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本发明。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention. Apparently, the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other drawings according to these drawings without creative efforts. In the attached picture:

图1示出了根据本发明一个实施例的用于分布式ETL任务调度执行的方法的流程示意图。FIG. 1 shows a schematic flowchart of a method for scheduling and executing distributed ETL tasks according to an embodiment of the present invention.

图2示出了根据本发明一个实施例的确定ETL任务权重的过程示意图。Fig. 2 shows a schematic diagram of a process of determining weights of ETL tasks according to an embodiment of the present invention.

图3示出了根据本发明一个实施例的执行节点上ETL任务执行过程示意图。FIG. 3 shows a schematic diagram of an ETL task execution process on an execution node according to an embodiment of the present invention.

图4示出了根据本发明一个实施例的用于分布式ETL任务调度执行的系统的结构示意图。FIG. 4 shows a schematic structural diagram of a system for scheduling and executing distributed ETL tasks according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的，技术方案及优点更加清楚明白，以下结合附图通过具体实施例对本发明进一步详细说明。应当理解，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动下获得的所有其他实施例，都属于本发明保护的范围。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the described embodiments are some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本发明的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本发明的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本发明的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of embodiments of the invention. However, those skilled in the art will appreciate that the technical solutions of the present invention may be practiced without one or more of the specific details, or other methods, components, means, steps, etc. may be employed. In other instances, well-known methods, apparatus, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flow charts shown in the drawings are only exemplary illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partly combined, so the actual order of execution may be changed according to the actual situation.

图1示出了根据本发明一个实施例的用于分布式ETL任务调度执行的方法的流程示意图。如图1所示，该方法主要包括：步骤S101)对于获取的待调度执行的每个ETL任务，基于该ETL任务中数据加载的目标表，提取该ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间的关联；步骤S102)基于为每种关联预设的权重和每种关联在该ETL任务中的个数确定该ETL任务的调度优先级；步骤S103)按照调度优先级从高到低的次序将各个ETL任务分配至执行ETL任务的执行节点。FIG. 1 shows a schematic flowchart of a method for scheduling and executing distributed ETL tasks according to an embodiment of the present invention. As shown in Figure 1, the method mainly includes: step S101) For each ETL task to be scheduled and executed, based on the target table of data loading in the ETL task, extract the relationship between the entity involved in the ETL task and the subsidiary table associations, associations between entities and dimension tables, and associations between entities; step S102) determine the scheduling of the ETL task based on the weight preset for each association and the number of each association in the ETL task Priority; Step S103) Allocating each ETL task to the execution node executing the ETL task in descending order of scheduling priority.

更具体地，在步骤S101)，首先可以从ETL任务资源库中获取待调度执行的多个ETL任务。ETL任务构建完成以后，通常会将ETL任务相关信息以元数据的形式存储到ETL任务资源库中。这些元数据包括ETL任务的名字、文件名、目录、状态、描述、扩展描述等元数据描述信息。ETL任务的状态可用于指示ETL任务是否已被调度执行，其具体取值可根据实际调度执行情况来进行设置或更改，例如已经被调度执行的ETL任务状态通常可设置为1，而还未被调度执行的ETL任务状态通常可设置为0。在一个实施例中，可以根据每个ETL任务的状态和创建时间来从ETL任务资源库中获取待调度执行的ETL任务。通过ETL任务的状态能够得知ETL任务是否等待被调度执行，同时根据ETL任务的创建时间，可以得到该ETL任务的等待时间。这样，在每次进行调度时，可以按等待时间长短从ETL任务资源库中选择一批未被调度执行的ETL任务。ETL任务的获取可以是基于请求响应机制或周期性进行的。例如，可以周期性地读取ETL任务资源库，从中提取待调度执行的一批ETL任务。该周期可以根据实际情况进行设置或改变，例如，可设置为2小时、1小时、0.5小时、10分钟等等。More specifically, in step S101), firstly, a plurality of ETL tasks to be scheduled and executed may be acquired from the ETL task resource library. After the ETL task is constructed, the ETL task-related information is usually stored in the ETL task repository in the form of metadata. These metadata include metadata description information such as the name, file name, directory, status, description, and extended description of the ETL task. The status of the ETL task can be used to indicate whether the ETL task has been scheduled for execution, and its specific value can be set or changed according to the actual scheduling execution situation. For example, the status of the ETL task that has been scheduled for execution can usually be set to 1, but not yet The status of ETL tasks scheduled for execution can usually be set to 0. In one embodiment, the ETL tasks to be scheduled for execution can be obtained from the ETL task resource library according to the status and creation time of each ETL task. Whether the ETL task is waiting to be scheduled can be known through the state of the ETL task, and the waiting time of the ETL task can be obtained according to the creation time of the ETL task. In this way, each time scheduling is performed, a batch of ETL tasks that have not been scheduled for execution can be selected from the ETL task resource library according to the waiting time. The acquisition of ETL tasks can be based on a request-response mechanism or periodically. For example, the ETL task resource library may be read periodically to extract a batch of ETL tasks to be scheduled for execution. The cycle can be set or changed according to actual conditions, for example, it can be set to 2 hours, 1 hour, 0.5 hours, 10 minutes and so on.

在数据仓库中，对于实体的描述以及实体与实体之间的关系通常都是以各种各样的表格的形式来体现的。在利用ETL任务进行数据的抽取转换和加载时，主要是将从各个分布的数据源抽取所需要的数据，对其进行转换后加载到设定的目标表中。每个ETL任务中通常会包含一个或多个目标表，例如包括对实体及其属性进行描述的目标表，对实体之间一对多关系进行描述的目标表，对实体之间多对多关系进行描述的目标表(也可以称为附属表)。另外，在数据仓库中，关于实体的属性所有可能的取值通常保存在维度表中，因此ETL任务中在加载指定实体的相关数据时通常也会将与该实体关联的一个或多个维度表设置为目标表进行加载。核心业务相关的ETL任务往往会涉及的实体种类比较多，并且实体之间的关系也比较复杂和多样化。在本发明的实施例中，通过ETL任务中涉及的实体及实体与实体之间的各种关联关系来衡量该ETL任务对应业务的重要程度，并由此设置ETL任务的调度优先级(也可以称为权重)。In the data warehouse, the description of entities and the relationship between entities are usually reflected in various forms. When using ETL tasks to extract, transform and load data, it mainly extracts the required data from various distributed data sources, converts them and loads them into the set target table. Each ETL task usually contains one or more target tables, such as a target table that describes entities and their attributes, a target table that describes one-to-many relationships between entities, and a many-to-many relationship between entities The target table (may also be referred to as an attached table) for the description. In addition, in the data warehouse, all possible values of the attributes of the entity are usually stored in the dimension table, so when the relevant data of the specified entity is loaded in the ETL task, one or more dimension tables associated with the entity are usually Set as the target table to load. ETL tasks related to core business often involve more types of entities, and the relationships between entities are also more complex and diverse. In an embodiment of the present invention, the importance of the business corresponding to the ETL task is measured by the entities involved in the ETL task and various association relationships between entities and entities, and thus the scheduling priority of the ETL task is set (also can be called the weight).

在步骤S101)，在获取到待调度执行的ETL任务后，可以基于每个ETL任务中数据加载的目标表，提取该ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间一对多的关联，并统计每种关联在该ETL任务中的个数。例如，通过遍历ETL任务中的目标表可以统计到该ETL涉及的多个实体并且可以同时确定各个实体之间的关系(包括一对多的关系和多对多的关系)。其中对于具有多对多关系的两个实体，二者之间的多对多的对应关系通常以数据记录的形式保存在附属表中，这两个实体都与该附属表关联。在统计实体与附属表之间的关联的个数时，对于每个实体都需要计数一次。对于具有一对多关系的两个实体，可以直接确定这两个实体互相关联，在统计实体与实体之间的关联的个数时，对于这两个实体只需要计数一次。每个实体还可能具有多个属性，维度表用于保存每种属性的所有可能的取值，因此基于目标表中出现的实体的属性可确定该实体与哪个或哪些维度表相关联，在统计实体与维度表之间的关联的个数时，对于每个维度表需要计数一次。In step S101), after obtaining the ETL task to be scheduled and executed, based on the target table of data loading in each ETL task, the association between the entity and the subsidiary table involved in the ETL task, and the relationship between the entity and the dimension table can be extracted. Associations between entities, one-to-many associations between entities, and count the number of each type of association in the ETL task. For example, multiple entities involved in the ETL can be counted by traversing the target table in the ETL task, and the relationships (including one-to-many relationships and many-to-many relationships) between entities can be determined at the same time. Wherein, for two entities with a many-to-many relationship, the many-to-many correspondence between the two is usually stored in the subsidiary table in the form of data records, and the two entities are associated with the subsidiary table. When counting the number of associations between entities and subsidiary tables, each entity needs to be counted once. For two entities with a one-to-many relationship, it can be directly determined that the two entities are related to each other. When counting the number of associations between entities, the two entities only need to be counted once. Each entity may also have multiple attributes, and the dimension table is used to store all possible values of each attribute. Therefore, based on the attributes of the entity appearing in the target table, it can be determined which dimension table or tables the entity is associated with. In statistics When counting the number of associations between entities and dimension tables, each dimension table needs to be counted once.

在确定了每个ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间的关联的个数之后，在步骤S102)，可对于每个待调度执行的ETL任务，基于为每种关联预设的权重和每种关联在该ETL任务中的个数确定该ETL任务的调度优先级或权重。例如，可以通过如下公式(1)进行计算：After determining the number of associations between entities and subsidiary tables involved in each ETL task, associations between entities and dimension tables, and associations between entities and entities, in step S102), for each to-be For scheduling the ETL tasks to be executed, the scheduling priority or weight of the ETL tasks is determined based on the preset weights for each type of association and the number of each type of association in the ETL task. For example, it can be calculated by the following formula (1):

其中，Wl1表示实体与附属表之间的关联的权重；Wl2表示实体与维度表之间的关联的权重；Wl3表示实体与实体之间的关联的权重；其中ni表示ETL任务中出现的第i种关联的个数，在公式中也可理解为Wli的个数，i为自然数。Wl1、Wl2、Wl3是根据具体的业务需求情境预先设定的权重，这些权值取值范围通常在2-10之间不等，并且这些权值的取值可根据业务变化而相应进行变化。Among them, Wl1 represents the weight of the association between the entity and the subsidiary table; Wl2 represents the weight of the association between the entity and the dimension table; Wl3 represents the weight of the association between the entity and the entity; where ni represents the i-th value that appears in the ETL task The number of associations can also be understood as the number of Wli in the formula, and i is a natural number. Wl1, Wl2, and Wl3 are weights preset according to specific business demand scenarios, and the value range of these weights is usually between 2-10, and the values of these weights can be changed correspondingly according to business changes.

下面以科技管理数据集成业务为例，对上述的ETL任务权重进行举例说明。图2给出了根据本发明实施例的确定ETL任务权重的过程示意图。如图2所示，该ETL任务包括四个实体：项目，课题，单位，人员，在构建该ETL任务时，设置的要加载的目标表中，项目和课题之间是一对多的关系(图中以“1.n”指示)，即对于一个项目可以有多个课题，但对于每个课题仅能对应一个项目，不能同时归属于两个项目。而项目与人员之间、项目与单位之间、课题与单位之间、课题与人员之间都是多对多的关系，例如，同一人员可以同时参加多个项目和多个课题，同一单位可以对应多个项目和多个课题。在该ETL任务中对于每个实体设置为分别加载一个维度表。从图2可以统计出在该ETL任务中，实体与附属表之间的关联有8个，实体与维度表之间的关联有4个；实体与实体之间的关联有1个，假定Wl1、Wl2、Wl3分别赋值为6、5、10，则相应地可以确定该ETL任务的权重为： The following takes the science and technology management data integration business as an example to illustrate the above-mentioned ETL task weights. FIG. 2 shows a schematic diagram of a process of determining weights of ETL tasks according to an embodiment of the present invention. As shown in Figure 2, the ETL task includes four entities: project, subject, unit, and personnel. When constructing the ETL task, in the target table to be loaded, there is a one-to-many relationship between the project and the subject ( Indicated by "1.n" in the figure), that is, there can be multiple topics for one project, but each topic can only correspond to one project, and cannot belong to two projects at the same time. There are many-to-many relationships between projects and personnel, between projects and units, between subjects and units, and between subjects and personnel. For example, the same person can participate in multiple projects and subjects at the same time, and the same unit can Corresponds to multiple projects and multiple subjects. In this ETL task, it is set to load a dimension table for each entity. It can be seen from Figure 2 that in this ETL task, there are 8 associations between entities and subsidiary tables, 4 associations between entities and dimension tables, and 1 association between entities, assuming Wl1, Wl2 and Wl3 are assigned values of 6, 5, and 10 respectively, and the weight of the ETL task can be determined accordingly:

继续参考图1，步骤S103)按照经步骤S102)确定的各个ETL任务的调度优先级对待调度执行的ETL任务进行排序。例如新一批ETL任务权重分别为{2,6,8,4,10,3,9}，则经过排序之后的ETL任务序列为{10,9,8,6,4,3,2}。经排序后得到按照权重从大到小依次排列的ETL任务序列，然后按照这样次序将ETL任务分配至分布式环境中的各执行节点上进行执行。Continuing to refer to FIG. 1 , step S103) sorts the ETL tasks to be scheduled and executed according to the scheduling priority of each ETL task determined in step S102). For example, the weights of a new batch of ETL tasks are {2, 6, 8, 4, 10, 3, 9}, and the sequence of ETL tasks after sorting is {10, 9, 8, 6, 4, 3, 2}. After sorting, the ETL task sequence is obtained in descending order according to the weight, and then the ETL tasks are assigned to each execution node in the distributed environment for execution in this order.

在该实施例中，通过该ETL任务中包含的目标表提取该ETL任务中实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间的关联，对于该ETL任务对应的业务的复杂性、待集成的业务数据的重要程度进行了有效的量化评估，形成了按权重大小排序的最优期待被调度的任务序列，能满足核心业务数据的及时性加载需求，改善了数据集成的效率。In this embodiment, the association between entities and subsidiary tables, the association between entities and dimension tables, and the association between entities in the ETL task are extracted through the target table included in the ETL task. For the ETL task The complexity of the corresponding business and the importance of the business data to be integrated are effectively quantified and evaluated, and the optimal expected scheduling task sequence sorted by weight is formed, which can meet the timely loading requirements of core business data and improve improve the efficiency of data integration.

又一个实施例中，在步骤S103)还可以包括获取各个执行节点的性能指标，以及基于各执行节点的性能指标，将待调度执行的ETL任务分配至各个执行节点。这是因为在将ETL任务调度分配到分布式环境下各个执行节点时，当前不同执行节点运行任务数和任务所含数据量不同，也就是同一时刻的各个执行节点的性能和当前负载是不同的，如果能根据执行节点的性能来合理控制分配给各执行节点的任务数量，不仅能保证各个执行节点乃至整个分布式环境的负载均衡，而且还能在整体上提高了任务执行的效率。因此，在步骤S103)在分配ETL任务之前，可以先查询各个执行节点的性能指标，根据获得的各执行节点的性能指标对于每个执行节点的当前负载进行等级划分，按照执行节点的当前负载从低至高排序，选择相应执行节点来进行ETL任务的分配。其中每个执行节点的当前负载可以根据所获取的该执行节点的性能指标来确定，例如，假设以CPU使用率、内存使用率作为性能指标为例，可以如下公式(2)来确定执行节点的当前负载：In yet another embodiment, step S103) may further include acquiring performance indicators of each execution node, and assigning ETL tasks to be scheduled for execution to each execution node based on the performance indicators of each execution node. This is because when ETL task scheduling is assigned to each execution node in a distributed environment, the number of tasks currently run by different execution nodes and the amount of data contained in the task are different, that is, the performance and current load of each execution node at the same time are different. , if the number of tasks assigned to each execution node can be reasonably controlled according to the performance of the execution nodes, it can not only ensure the load balance of each execution node and even the entire distributed environment, but also improve the efficiency of task execution as a whole. Therefore, in step S103) before allocating ETL tasks, the performance index of each execution node can be queried earlier, and the current load of each execution node is classified according to the obtained performance index of each execution node, according to the current load of the execution node from Sort from low to high, select the corresponding execution node to allocate ETL tasks. The current load of each execution node can be determined according to the obtained performance index of the execution node. For example, assuming that CPU usage and memory usage are used as performance indicators as an example, the following formula (2) can be used to determine the execution node. Current load:

其中，C为执行节点的CPU使用率；R为执行节点的内存使用率，L指示执行节点的当前负载，L越大，则指示执行节点的当前负载越小；L越小，则指示执行节点的当前负载越大。因此可按照L的取值从大到小排列来得到各执行节点的优先分配序列。在又一个实施例中，也可以通过各性能指标的加权平均来确定执行节点的当前负载，例如L＝w1*C+w2*R，其中w1和w2是为性能指标C和R设定的权重，其取值在0-1之间。L越大，则指示执行节点的当前负载越大；L越小，则指示执行节点的当前负载越小。因此可按照L的取值从小到大排列来得到各执行节点的优先分配序列。应理解，使用CPU使用率、内存使用率作为性能指标确定节点的当前负载仅是举例说明，而非进行任何限定，本领域技术人员可以根据实际需求进行调整或修改。Among them, C is the CPU usage rate of the execution node; R is the memory usage rate of the execution node, L indicates the current load of the execution node, the larger L, the smaller the current load of the execution node; the smaller L, the smaller the execution node The greater the current load. Therefore, the priority allocation sequence of each execution node can be obtained according to the value of L in descending order. In yet another embodiment, the current load of the execution node can also be determined by the weighted average of each performance index, for example, L=w1*C+w2*R, where w1 and w2 are the weights set for performance indexes C and R , whose value is between 0-1. A larger L indicates that the current load of the execution node is greater; a smaller L indicates that the current load of the execution node is smaller. Therefore, the priority allocation sequence of each execution node can be obtained by arranging according to the value of L from small to large. It should be understood that the determination of the current load of the node using the CPU usage rate and the memory usage rate as performance indicators is only an example for illustration rather than any limitation, and those skilled in the art may make adjustments or modifications according to actual requirements.

在又一个实施例中，还可以根据所确定的各执行节点的当前负载来对各执行节点进行分类，例如采用上面公式(2)确定的L将各执行节点分为高负载节点、中负载节点、低负载节点：In yet another embodiment, each execution node can also be classified according to the determined current load of each execution node, for example, each execution node can be divided into a high-load node and a medium-load node by using L determined by the above formula (2). , low-load nodes:

也就是将分布式集群环境中的执行节点划分为三个小组，每个组由零至多个节点组成，同组的节点成员它们的负载量类似。低负载节点中的执行节点，其负载量低，当前可再接受任务执行的能力最强。应当优先考虑将ETL任务调度分配到低负载执行器节点上。若低负载节点的小组为空，则分配ETL任务到中负载节点组成的小组，以此类推。如上述低、中负载节点都为空，则说明整个分布式环境中所有执行节点当前负载都很高。如果所有执行节点长时间划分到高负载节点的小组，则需要设置报警机制来提示分布式环境长期处于高负载的情况，从而提示系统管理人员将分布式环境的性能进行提升或增加相应执行器节点的数量，以此来改善整个分布式环境的可负载能力。That is, the execution nodes in the distributed cluster environment are divided into three groups, each group consists of zero or more nodes, and the loads of the node members in the same group are similar. The execution node in the low-load node has a low load and has the strongest ability to accept task execution. Priority should be given to assigning ETL task scheduling to low-load executor nodes. If the group of low-load nodes is empty, assign ETL tasks to the group of medium-load nodes, and so on. If the above-mentioned low and medium load nodes are all empty, it means that the current load of all execution nodes in the entire distributed environment is very high. If all execution nodes are divided into groups of high-load nodes for a long time, an alarm mechanism needs to be set to prompt the distributed environment to be under high load for a long time, so as to prompt the system administrator to improve the performance of the distributed environment or increase the corresponding executor nodes to improve the loadability of the entire distributed environment.

在上述实施例的方案中，由低负载到高负载依次选择执行节点进行ETL任务分配，使得调度优先级高的ETL任务优先分配到当前负载低的执行节点上执行，不仅有利于各执行节点之间的负载均衡，还可改善ETL任务的执行效率。In the solution of the above-mentioned embodiment, execution nodes are sequentially selected from low load to high load for ETL task distribution, so that ETL tasks with high scheduling priority are preferentially assigned to execution nodes with low current load for execution, which is not only beneficial to each execution node It can also improve the execution efficiency of ETL tasks.

在又一个实施例中，在步骤S103)可以基于ETL任务的数据量来将参与调度的ETL任务分配至各执行节点。不同的ETL任务涉及的数据总量不同，相应地ETL任务的执行时间也不同，如果将任务数据量大的多个ETL任务集中分配给一个或几个执行节点，那么这些执行节点上ETL任务的等待时间会变长，各执行节点的资源并不能保持有效的均衡利用。因此，在该实施例中，引入了ETL任务的数据量作为分配的参考因素，采用贪心平衡算法来进行ETL任务的分配。假设分布式集群下各个执行节点初始处理能力相同，且每个节点都可以独立工作，即不需要其他节点的辅助，E＝{e1,e2,e3...,en}表示新获取的一批待参与调度的相互独立的ETL任务集合，其中共有n个ETL任务，ei表示第i个任务；D＝{d1,d2,d3...,dn}表示n个ETL任务所含数据量的集合，其中di为第i个任务ei所含的数据量；N＝{n1,n2,n3...nj}表示分布式集群中执行节点的集合，共j个节点，其中ni为第i个执行器节点，dni_pre表示第i个执行节点ni上已有ETL任务所含的数据量，dni_aft表示当任务分配完毕后第i个执行节点ni上所有ETL任务所含的数据量，所有参与执行的执行节点上的ETL任务所含数据总量为第i个执行节点最优期待分配任务数据量Opt_i可表示为：In yet another embodiment, in step S103), the ETL tasks participating in scheduling may be allocated to execution nodes based on the data volume of the ETL tasks. The total amount of data involved in different ETL tasks is different, and the execution time of the ETL tasks is also different accordingly. If multiple ETL tasks with a large amount of task data are allocated to one or several execution nodes, then the ETL tasks on these execution nodes The waiting time will become longer, and the resources of each execution node cannot be effectively used in a balanced manner. Therefore, in this embodiment, the data volume of the ETL task is introduced as a reference factor for allocation, and a greedy balance algorithm is used to allocate the ETL task. Assuming that the initial processing capabilities of each execution node in the distributed cluster are the same, and each node can work independently, that is, without the assistance of other nodes, E={e1,e2,e3...,en} represents the newly acquired batch A set of mutually independent ETL tasks to be scheduled, in which there are n ETL tasks in total, ei represents the i-th task; D={d1,d2,d3...,dn} represents the set of data contained in n ETL tasks , where di is the amount of data contained in the i-th task ei; N={n1,n2,n3...nj} represents the set of execution nodes in the distributed cluster, a total of j nodes, where ni is the i-th execution dni _pre indicates the amount of data contained in the existing ETL tasks on the i-th execution node ni, and dni _aft indicates the amount of data contained in all ETL tasks on the i-th execution node ni after the tasks are allocated. The total amount of data contained in the ETL task on the execution node is The optimal expected distribution task data volume Opt _i of the i-th execution node can be expressed as:

通过下面公式计算的数据量的方差表示执行节点的数据负载指数μ_i，则第i个执行器节点ni的数据负载指数μ_i可表示为：The variance of the amount of data calculated by the following formula represents the data load index μ i of the execution node, then the data load index μ _i of the _i -th executor node ni can be expressed as:

μ_i＝(dni_aft-dni_pre-Opt_i)² (5)μ _i ＝(dni _aft -dni _pre -Opt _i ) ² (5)

分布式集群中执行节点总体的数据负载指数μ可表示为：The overall data load index μ of execution nodes in a distributed cluster can be expressed as:

在ETL任务的分发过程中，要尽量保证集群资源的数据负载均衡，即μ相对较小。可通过定义阈值δ来限定任务分发过程中μ的最大值，如果μ超过δ，则认为该节点数据负载很重，无法接受新的任务。这样在任务分发过程中实时计算μ的值，每次选择μ_i＜δ的节点来分配任务，从而保证集群资源负载均衡。在一个示例中，基于贪心平衡算法分配ETL任务主要包括如下步骤：In the distribution process of ETL tasks, try to ensure the data load balance of cluster resources, that is, μ is relatively small. The maximum value of μ in the task distribution process can be limited by defining a threshold δ. If μ exceeds δ, it is considered that the node has a heavy data load and cannot accept new tasks. In this way, the value of μ is calculated in real time during the task distribution process, and nodes with μ _i <δ are selected each time to distribute tasks, thereby ensuring the load balance of cluster resources. In an example, the allocation of ETL tasks based on the greedy balance algorithm mainly includes the following steps:

(1)初始化ETL任务集合E＝{e1,e2,e3...,en}，ETL任务所含数据量集合D＝{d1,d2,d3...,dn}，执行器节点集合N＝{n1,n2,n3...nj}；(1) Initialize the ETL task set E={e1,e2,e3...,en}, the data volume set D={d1,d2,d3...,dn} contained in the ETL task, the executor node set N= {n1,n2,n3...nj};

(2)对ETL任务按照数据量从大到小进行排序，存入到队列Q中，Q＝{q₁,q₂,q₃,q₄,...q_n}，其中q₁为(e₁,d₁),q₂为(e₂,d₂),...q_n为(e_n,d_n),d₁≥d₂≥d_n；(2) Sort the ETL tasks according to the amount of data from large to small, and store them in the queue Q, Q={q ₁ ,q ₂ ,q ₃ ,q ₄ ,...q _n }, where q ₁ is ( e ₁ ,d ₁ ),q ₂ is (e ₂ ,d ₂ ),...q _n is (e _n ,d _n ),d ₁ ≥d ₂ ≥d _n ;

(3)实时计算执行节点集合中所有执行节点的数据负载指数μ₁,μ₂,μ₃,...μ_j；根据数据负载指数调整节点顺序为从小到大，达到如下效果：如有μ₁＜μ₂＜μ₃＜...＜μ_j，则调整节点顺序为n₁,n₂,n₃,...,n_j；(3) Calculate the data load index μ ₁ , μ ₂ , μ ₃ ,...μ _j of all execution nodes in the execution node set in real time; adjust the order of nodes from small to large according to the data load index to achieve the following effect: if there is μ ₁ <μ ₂ <μ ₃ <...<μ _j , then adjust the order of nodes to be n ₁ ,n ₂ ,n ₃ ,...,n _j ;

(4)将μ_i＜δ的节点个数赋值给变量K，表示这一次可分配执行的节点个数；如果K＝0，则表明此时分布式环境负载过高，需暂时停止任务的继续分发或增设新的执行节点；(4) Assign the number of nodes with μ _i < δ to the variable K, indicating the number of nodes that can be allocated and executed at this time; if K=0, it indicates that the load of the distributed environment is too high at this time, and the continuation of the task needs to be temporarily stopped Distribute or add new execution nodes;

(5)对于Q中n个任务，如果n>K，则取出K个任务,依次分到K个节点，n＝n-K；否则如果0＜n≤K，则取出全部任务依次分发到前n个执行节点，例如e₁分发到n₁,e₂分发到n₂。如果n≤0，则说明本批所有任务均执行完毕，算法结束，否则执行(3)。(5) For n tasks in Q, if n>K, take out K tasks and distribute them to K nodes in turn, n=nK; otherwise, if 0<n≤K, take out all tasks and distribute them to the first n in turn Execution nodes, for example, e ₁ is distributed to n ₁ , and e ₂ is distributed to n ₂ . If n≤0, it means that all tasks in this batch have been executed, and the algorithm ends, otherwise execute (3).

在又一个实施例中，步骤S103)可以包括a)获取各个执行节点的性能指标，并根据各执行节点的性能指标确定各执行节点的当前负载；b)基于各执行节点的当前负载将分布式环境中执行节点划分为三个组：高负载节点组、中负载节点组、低负载节点组；c)首先在低负载节点组进行任务分配，统计待调度执行的每个ETL任务的数据量和低负载节点组中各执行节点上已有ETL任务的数据量，利用上文介绍的贪心平衡算法对于低负载节点组中的执行节点分配任务；若低负载节点的小组为空且还有ETL任务需要进行分配，则利用上文介绍的贪心平衡算法继续分配剩余的ETL任务到中负载节点组中的执行节点上，以此类推。如上述低、中负载节点都为空，则说明整个分布式环境中所有执行节点当前负载都很高，还可以设置报警机制来提示分布式环境长期处于高负载的情况，从而提示系统管理人员将分布式环境的性能进行提升或增加相应执行器节点的数量，以此来改善整个分布式环境的可负载能力。在又一个实施例中，当任务分发失败时，如果失败原因是分发的目的执行节点所导致的，可以设置在一段时间内(惩罚时间)不再执行向该执行节点的任务执行请求分发操作。通过这种方式，可以在一定程度上减少任务分发的失败率。In yet another embodiment, step S103) may include a) obtaining the performance index of each execution node, and determining the current load of each execution node according to the performance index of each execution node; b) distributing The execution nodes in the environment are divided into three groups: high-load node group, medium-load node group, and low-load node group; c) firstly, task allocation is performed in the low-load node group, and the data volume and Use the greedy balance algorithm introduced above to assign tasks to the execution nodes in the low-load node group; if the group of low-load nodes is empty and there are ETL tasks If distribution is required, use the greedy balance algorithm introduced above to continue to distribute the remaining ETL tasks to the execution nodes in the medium load node group, and so on. If the above-mentioned low and medium load nodes are all empty, it means that the current load of all execution nodes in the entire distributed environment is very high. An alarm mechanism can also be set to prompt the distributed environment to be under high load for a long time, so as to prompt the system administrator to The performance of the distributed environment is improved or the number of corresponding executor nodes is increased to improve the loadability of the entire distributed environment. In yet another embodiment, when task distribution fails, if the cause of the failure is caused by the destination execution node of the distribution, it can be set that within a period of time (penalty time) no more execution of the task execution request distribution operation to the execution node will be performed. In this way, the failure rate of task distribution can be reduced to a certain extent.

在将ETL任务分配到各执行节点后，每个执行器节点都有一个执行队列负责存储任务，每个任务占用该队列上的一个线程资源。由于ETL任务所含数据量不同导致相应的执行时间不同。在又一个实施例中，通过平衡ETL任务的执行时间与等待时间来提高ETL任务执行效率，从而间接地改善了整个分布式环境中数据集成的效率。在该实施例中，基于ETL任务的执行时间和等待时间来设置ETL任务的执行优先级，使得执行节点按照ETL任务的执行优先级从高到低的顺序来执行ETL任务，而且所设置的ETL任务的执行优先级会随着其执行时间和等待时间而不断进行调整。下面结合图3针对某一执行节点上的ETL任务的执行过程展开叙述。After the ETL tasks are assigned to each execution node, each executor node has an execution queue responsible for storing tasks, and each task occupies a thread resource on the queue. Due to the different amount of data contained in the ETL task, the corresponding execution time is different. In yet another embodiment, the execution efficiency of the ETL task is improved by balancing the execution time and waiting time of the ETL task, thereby indirectly improving the efficiency of data integration in the entire distributed environment. In this embodiment, the execution priority of the ETL task is set based on the execution time and waiting time of the ETL task, so that the execution node executes the ETL task in the order of the execution priority of the ETL task from high to low, and the set ETL task The execution priority of a task is constantly adjusted according to its execution time and waiting time. The following describes the execution process of the ETL task on a certain execution node in conjunction with FIG. 3 .

如图3所示，该过程主要包括步骤S301)响应于执行节点收到新的ETL任务，将该ETL任务存入任务缓存队列中，并记录该ETL的到达时间。步骤S302)基于该ETL任务中的数据量来预估该ETL任务的执行时间。首先获取该ETL任务中涉及的数据量，接着从该执行节点上最近一段时间内已完成执行的ETL任务中，筛选出一批与待执行的ETL任务具有相似数据量的ETL任务，通过这些选出的ETL任务的执行时间来预估该尚未执行的ETL任务的执行时间，例如对这批ETL任务的执行时间求平均值作为对待执行的ETL任务的执行时间的估计。步骤S303)响应于执行节点的当前任务执行完毕，对于待执行的每个ETL任务，根据该ETL任务的等待时间和预估的执行时间确定该ETL任务的执行优先级。假设目前执行节点上有n个ETL任务正在等待执行，设定Tei表示第i个ETL任务ei的执行时间(根据所含数据量进行估算),Twi表示第i个ETL任务ei的等待时间，则n个ETL任务在执行节点上执行时的目标函数TotalTime可表示为：As shown in Figure 3, the process mainly includes step S301) In response to the execution node receiving a new ETL task, store the ETL task in the task cache queue, and record the arrival time of the ETL. Step S302) Estimating the execution time of the ETL task based on the amount of data in the ETL task. First obtain the amount of data involved in the ETL task, and then select a batch of ETL tasks with similar data volume as the ETL tasks to be executed from the ETL tasks that have been executed on the execution node in the most recent period of time. The execution time of the ETL task that has been obtained is used to estimate the execution time of the ETL task that has not yet been executed, for example, the execution time of this batch of ETL tasks is averaged as an estimate of the execution time of the ETL task to be executed. Step S303) In response to the execution of the current task of the execution node, for each ETL task to be executed, the execution priority of the ETL task is determined according to the waiting time of the ETL task and the estimated execution time. Assuming that there are currently n ETL tasks waiting to be executed on the execution node, set Tei to represent the execution time of the i-th ETL task ei (estimated according to the amount of data contained), and Twi to represent the waiting time of the i-th ETL task ei, then The objective function TotalTime when n ETL tasks are executed on the execution node can be expressed as:

通过优先级调整ETL任务的执行次序的目的在于在该执行节点执行ETL任务的过程中，尽可能保证整个执行流程所耗费的时间最低(即TotalTime最小)，也就是使Tei、Twi尽可能达到相对平衡。在实施例中，对于任务缓存队列中待执行的每个ETL任务，通过在步骤S302)预估的任务执行时间和该ETL任务的等待时间来计算该ETL任务的执行优先级。例如，采用下面的公式来确定第i个ETL任务ei的执行优先级EP_i：The purpose of adjusting the execution order of ETL tasks through priority is to ensure that the time spent on the entire execution process is as low as possible (that is, the TotalTime is the smallest) during the execution of the ETL task by the execution node, that is, to make Tei and Twi as close as possible to the relative balance. In an embodiment, for each ETL task to be executed in the task cache queue, the execution priority of the ETL task is calculated based on the task execution time estimated in step S302) and the waiting time of the ETL task. For example, the following formula is used to determine the execution priority EP _i of the i-th ETL task ei:

其中Tti表示该任务ei到达执行节点时间，可取在步骤S302)中基于ETL任务中的数据量预估的任务执行之间；而每个ETL任务的等待时间Twi可以通过下面的方式来计算：T_wi＝T_ni-T_ti，也就是每个任务的等待时间等于当前时间减去该ETL任务到达执行节点的时间。从公式(8)可以看出：EP_i一定是大于1的，当Twi一定时，Tei越小，优先级EP_i越高，类似短作业优先算法；当Tei一定时，Twi越大，优先级EP_i越高，类似先来先服务算法；当Twi和Tei都处于不可定的状态时，这种优先级设置综合了执行节点上当前任务执行情况和任务的等待时间，达到ETL任务执行时间和等待时间整体上的相对平衡。继续参考图3，在步骤S304)从待执行的ETL任务中选出执行优先级最高的ETL任务来执行。Where Tti represents the time when the task ei arrives at the execution node, preferably between the task execution based on the estimated amount of data in the ETL task in step S302); and the waiting time Twi of each ETL task can be calculated in the following manner: T _wi =T _ni -T _ti , that is, the waiting time of each task is equal to the current time minus the time when the ETL task arrives at the execution node. It can be seen from formula (8) that EP _i must be greater than 1. When Twi is constant, the smaller Tei is, the higher the priority of EP _i is, similar to the short job priority algorithm; when Tei is constant, the larger Twi is, the higher the priority is. The higher the EP _i is, it is similar to the first-come-first-served algorithm; when both Twi and Tei are in an undeterminable state, this priority setting combines the current task execution status and task waiting time on the execution node to achieve the ETL task execution time and The overall relative balance of wait times. Continuing to refer to FIG. 3 , in step S304 ), the ETL task with the highest execution priority is selected from the ETL tasks to be executed for execution.

图4为根据本发明一个实施例的用于分布式ETL任务调度执行的系统的结构示意图。如图4所示，该系统调度器401和多个执行器402a-n(统称为402)，调度器401从ETL任务资源库获取一个或多个待调度执行的ETL任务，并将其分配至分布式环境中多个执行器上进行执行。执行器402用于执行收到的ETL任务。尽管该框图以功能上分开的方式来描述组件，但这样的描述仅为了说明的目的。图中所示的组件可以任意地进行组合或被分为独立的软件、固件和/或硬件组件。而且，无论这样的组件是如何被组合或划分的，它们都可以在同一计算装置或多个计算装置上执行，其中多个计算装置可以是由一个或多个网络连接。FIG. 4 is a schematic structural diagram of a system for scheduling and executing distributed ETL tasks according to an embodiment of the present invention. As shown in Figure 4, the system scheduler 401 and a plurality of executors 402a-n (collectively referred to as 402), the scheduler 401 obtains one or more ETL tasks to be scheduled and executed from the ETL task resource library, and distributes them to Execute on multiple executors in a distributed environment. Executor 402 is used to execute received ETL tasks. Although the block diagram depicts components in a functionally separate manner, such depiction is for illustration purposes only. The components shown in the figures may be arbitrarily combined or separated into separate software, firmware and/or hardware components. Moreover, no matter how such components are combined or divided, they may execute on the same computing device or on multiple computing devices, which may be connected by one or more networks.

其中调度器401包括关系分析模块、优先级确定模块、调度模块。关系分析模块用于对于获取的待调度执行的每个ETL任务，基于该ETL任务中数据加载的目标表，提取该ETL任务中涉及的实体与附属表之间的关联、实体与维度表之间的关联、实体与实体之间一对多的关联；优先级确定模块，用于基于为每种关联预设的权重和每种关联在该ETL任务中的个数确定该ETL任务的调度优先级；调度模块，用于按照调度优先级从高到低的次序将各个ETL任务分配至各执行器402。The scheduler 401 includes a relationship analysis module, a priority determination module, and a scheduling module. The relationship analysis module is used for each ETL task to be scheduled and executed, based on the target table of data loading in the ETL task, to extract the association between the entity and the subsidiary table involved in the ETL task, and the relationship between the entity and the dimension table Associations, one-to-many associations between entities and entities; a priority determination module, configured to determine the scheduling priority of the ETL task based on the weight preset for each association and the number of each association in the ETL task a scheduling module, configured to assign each ETL task to each executor 402 in descending order of scheduling priority.

在又一个实施例中，调度器401还可以包括负载监控模快，用于查询各个执行器的性能指标，以及根据获得的各执行器的性能指标确定各个执行器的当前负载。其中调度模块还可以被配置为按照执行器的当前负载从低至高选择相应执行器来进行ETL任务的分配。在又一个实施例中，执行器402可以被配置为响应于收到新的ETL任务，将该待执行的ETL任务存入任务缓存队列中，并记录该ETL的到达时间；基于该ETL任务中的数据量来预估该ETL任务的执行时间；响应于当前任务执行完毕，对于待执行的每个ETL任务，如上文介绍的那样根据该ETL任务的等待时间和预估的执行时间确定该ETL任务的执行优先级；从待执行的ETL任务中选出执行优先级最高的ETL任务来执行。In yet another embodiment, the scheduler 401 may also include a load monitoring module, which is used to query the performance indicators of each executor, and determine the current load of each executor according to the obtained performance indicators of each executor. The scheduling module can also be configured to select corresponding executors from low to high according to the current load of executors to allocate ETL tasks. In yet another embodiment, the executor 402 may be configured to store the ETL task to be executed in the task cache queue in response to receiving a new ETL task, and record the arrival time of the ETL; The amount of data to estimate the execution time of the ETL task; in response to the completion of the current task, for each ETL task to be executed, the ETL is determined according to the waiting time and estimated execution time of the ETL task as described above The execution priority of the task; select the ETL task with the highest execution priority from the ETL tasks to be executed to execute.

在本发明的又一个实施例中，还提供了一种计算机可读存储介质，其上存储有计算机程序或可执行指令，当所述计算机程序或可执行指令被执行时实现如前述实施例中所述的技术方案，其实现原理类似，此处不再赘述。在本发明的实施例中，计算机可读存储介质可以是任何能够存储数据且可以被计算装置读取的有形介质。计算机可读存储介质的实例包括硬盘驱动器、网络附加存储器(NAS)、只读存储器、随机存取存储器、CD-ROM、CD-R、CD-RW、磁带以及其它光学或非光学数据存储装置。计算机可读存储介质也可以包括分布在网络耦合计算机系统上的计算机可读介质，以便可以分布式地存储和执行计算机程序或指令。In another embodiment of the present invention, there is also provided a computer-readable storage medium, on which computer programs or executable instructions are stored, and when the computer program or executable instructions are executed, the above-mentioned embodiment can be realized. The implementation principles of the technical solutions described above are similar and will not be repeated here. In the embodiments of the present invention, a computer-readable storage medium may be any tangible medium capable of storing data and readable by a computing device. Examples of computer readable storage media include hard drives, network attached storage (NAS), read only memory, random access memory, CD-ROM, CD-R, CD-RW, magnetic tape, and other optical or non-optical data storage devices. The computer readable storage medium may also include computer readable media distributed over network coupled computer systems so that the computer programs or instructions are stored and executed in a distributed manner.

本说明书中针对“各个实施例”、“一些实施例”、“一个实施例”、或“实施例”等的参考指代的是结合所述实施例所描述的特定特征、结构、或性质包括在至少一个实施例中。因此，短语“在各个实施例中”、“在一些实施例中”、“在一个实施例中”、或“在实施例中”等在整个说明书中各地方的出现并非必须指代相同的实施例。此外，特定特征、结构、或性质可以在一个或多个实施例中以任何合适方式组合。因此，结合一个实施例中所示出或描述的特定特征、结构或性质可以整体地或部分地与一个或多个其他实施例的特征、结构、或性质无限制地组合，只要该组合不是非逻辑性的或不能工作。References in this specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment" refer to particular features, structures, or properties described in connection with the embodiments, including In at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment" in various places throughout this specification are not necessarily referring to the same implementation. example. Furthermore, the particular features, structures, or properties may be combined in any suitable manner in one or more embodiments. Therefore, a particular feature, structure, or property shown or described in connection with one embodiment may be combined in whole or in part with features, structures, or properties of one or more other embodiments without limitation, as long as the combination is not incompatible. Logical or not working.

本说明书中“包括”和“具有”以及类似含义的术语表达，意图在于覆盖不排他的包含，例如包含了一系列步骤或单元的过程、方法、系统、产品或设备并不限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。“一”或“一个”也不排除多个的情况。另外，本申请附图中的各个元素仅仅为了示意说明，并非按比例绘制。In this specification, "comprising" and "having" and terms with similar meanings are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally further include steps or units not listed, or optionally further include other steps or units inherent to these processes, methods, products or devices. "A" or "an" also does not exclude a plurality. In addition, each element in the drawings of the present application is only for illustration and not drawn to scale.

虽然本发明已经通过上述实施例进行了描述，然而本发明并非局限于这里所描述的实施例，在不脱离本发明范围的情况下还包括所做出的各种改变以及变化。Although the present invention has been described by the above-mentioned embodiments, the present invention is not limited to the embodiments described here, and includes various changes and changes made without departing from the scope of the present invention.

Claims

1. a kind of method executed for distributed ETL task schedule, comprising:

For each ETL task of the execution to be dispatched of acquisition, based on the object table that data in the ETL task load, extracting should Being associated between entity and attached table, being associated with, one between entity and entity between entity and dimension table involved in ETL task To more associations；

It is based upon every kind of preset weight of association and every kind of number being associated in the ETL task determines the scheduling of the ETL task Priority；

Each ETL task is distributed to each execution node according to the order of dispatching priority from high to low.

2. according to the method described in claim 1, further including inquiring each execution node before the distribution for carrying out ETL task Performance indicator；And each present load for executing node is determined according to the performance indicator of each execution node of acquisition, according to The present load for executing node carries out the distribution of ETL task from node is accordingly executed down to high selection.

3. according to the method described in claim 1, wherein the dispatching priority of the ETL task is calculated by following formula:

Wherein, the associated weight between Wl1 presentation-entity and attached table；It is associated between Wl2 presentation-entity and dimension table Weight；Associated weight between Wl3 presentation-entity and entity；Wherein ni indicates i-th kind occurred in ETL task associated Number.

4. according to the method described in claim 2, described distribute each ETL task to each execution node includes:

A) data volume of each ETL task of execution to be dispatched is counted；

B) each total amount of data for executing all ETL tasks on node is counted；

C) the corresponding ETL task of maximum amount of data is selected from the ETL task of execution to be dispatched；

D) selection total amount of data is minimum and is currently assigned the execution node of ETL task not yet；

E) selected ETL task is distributed to selected execution node, and is to have distributed by the execution vertex ticks；

F) repeat step c)-e) until execution to be dispatched ETL task be assigned finish or until all execution nodes all by Labeled as having distributed；

G) the ETL task for needing to be dispatched execution is detected whether, if so, then re-flagging all execution nodes not divide Match, repeat step c)-g) it is finished until the ETL task of execution to be dispatched is assigned.

5. according to the method described in claim 1, further include:

New ETL task is received in response to executing node, which is stored in task buffer queue, and remembers Record the arrival time of the ETL；

The execution time of the ETL task is estimated based on the data volume in the ETL task；

Current task in response to executing node is finished, for pending each ETL task, according to the ETL task Waiting time and the execution time estimated determine the execution priority of the ETL task；

The highest ETL task of execution priority is selected from pending ETL task to execute.

6. according to the method described in claim 5, wherein estimating holding for the ETL task based on the data volume in the ETL task The row time includes:

Determine the data volume in the ETL task；

From the ETL task that execution is completed in a period of time nearest on the execution node, a batch and pending ETL are filtered out Task has the ETL task of set of metadata of similar data amount；

The execution time of this batch of ETL task averages, using obtained average value as the execution for the ETL task estimated Time.

7. according to the method described in claim 6, wherein the execution priority of ETL task is determined using following formula:

Wherein EP_iIndicate the execution priority of i-th of ETL task ei；Tei indicates the execution time of ETL task ei；Twi is indicated The waiting time of ETL task ei is equal to current time and subtracts the time that the ETL task reaches execution node.

8. a kind of system executed for distributed ETL task schedule, including scheduler and multiple actuators, scheduler is used for will One or more ETL tasks of execution to be dispatched are distributed to multiple actuators, and actuator is for executing the ETL task received； Wherein scheduler includes:

Relationship analysis module is added for each ETL task of the execution to be dispatched for acquisition based on data in the ETL task The object table of load extracts involved in the ETL task being associated between entity and attached table, the pass between entity and dimension table It is one-to-many between connection, entity and entity to be associated with；

Priority Determination module, for being based upon every kind of preset weight of association and every kind of number being associated in the ETL task Determine the dispatching priority of the ETL task；

Scheduler module, for distributing each ETL task to each actuator according to the order of dispatching priority from high to low.

9. system according to claim 8, wherein the scheduler further includes that load monitoring mould is fast, for inquiring each hold The performance indicator of row device, and determine according to the performance indicator of each actuator of acquisition the present load of each actuator；And The scheduler module is additionally configured to according to the present load of actuator from carrying out ETL task down to high selection respective actuators Distribution.

10. system according to claim 8, wherein the actuator is configured as:

In response to receiving new ETL task, which is stored in task buffer queue, and record the ETL's Arrival time；

Be finished in response to current task, for pending each ETL task, according to the waiting time of the ETL task and The execution time estimated determines the execution priority of the ETL task；