CN115544055A

CN115544055A - Calculation engine determination method and device

Info

Publication number: CN115544055A
Application number: CN202211197950.9A
Authority: CN
Inventors: 任中涛
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-30

Abstract

The invention provides a calculation engine determination method and equipment. In an embodiment, the method includes: obtaining a data operation statement to be processed; determining a first memory consumption of the first engine executing the data operation statement; wherein, the first engine is a memory-based engine, and the first engine is used for Processing target data; determining the current remaining memory amount of the total memory allocated to the first engine; if the current remaining memory amount is greater than the first memory consumption amount, determining that the first engine processes the data operation statement. Thus, the intelligent selection of the engine is realized, and the balance between efficiency and stability is realized.

Description

Calculation engine determination method and device

技术领域technical field

本发明涉及数据查询技术领域，尤其涉及一种计算引擎确定方法及设备。The invention relates to the technical field of data query, in particular to a calculation engine determination method and device.

背景技术Background technique

大数据业务场景中通常需要多个计算引擎来满足各方面的业务需求，由此带来的一个问题：在保证稳定性要求的前提下，为保证性能最佳，不同的任务需要分别选用不同的计算引擎，更困难的是业务人员需要根据个人经验来选择合适的计算引擎，不仅增加工作量而且根据经验选择计算引擎本身就无法保证能够做出最佳选择。In big data business scenarios, multiple computing engines are usually required to meet various business needs, which brings about a problem: under the premise of ensuring stability requirements, in order to ensure the best performance, different tasks need to use different computing engines. Computing engine, what is more difficult is that business personnel need to choose a suitable computing engine based on personal experience, which not only increases the workload, but also choosing a computing engine based on experience itself cannot guarantee the best choice.

图1示出了大数据的计算引擎选择的流程示意图。如图1所示，具体实现过程如下：Fig. 1 shows a schematic flowchart of the selection of computing engines for big data. As shown in Figure 1, the specific implementation process is as follows:

1.Hive引擎作为统一SQL入口，接收SQL语句；1. The Hive engine acts as a unified SQL entry to receive SQL statements;

2.SQL经过Hive引擎解析后，从生成的AST树中获取该SQL涉及到的所有表的数据量大小；2. After the SQL is parsed by the Hive engine, the data size of all tables involved in the SQL is obtained from the generated AST tree;

3.将表数据量与预先设置的固定阈值进行比较，若小于则认为该任务数据量较小，提交至Trino执行以获取最佳性能；若大于则认为该任务数据量较大，SQL继续由Hive执行，以保证大数据量下的稳定性。3. Compare the amount of data in the table with the preset fixed threshold. If it is less than the threshold, it is considered that the amount of data in the task is small and submitted to Trino for execution to obtain the best performance; Hive is executed to ensure stability under large amounts of data.

但是，对于上述技术方案，考虑到预先设置的固定阈值无法适用于不同规模的集群，需要开发人员根据经验不断的手动调整阈值以提高引擎选择准确率，无法真正实现计算引擎智能选择。However, for the above technical solution, considering that the preset fixed threshold cannot be applied to clusters of different sizes, developers need to manually adjust the threshold based on experience to improve the accuracy of engine selection, and it is impossible to truly realize the intelligent selection of computing engines.

因此，亟需在一种可以实现计算引擎智能选择的方法。Therefore, there is an urgent need for a method that can realize intelligent selection of computing engines.

公开于该背景技术部分的信息仅仅旨在增加对本发明的总体背景的理解，而不应当被视为承认或以任何形式暗示该信息构成已为本领域一般技术人员所公知的现有技术。The information disclosed in this Background section is only for enhancing the understanding of the general background of the present invention and should not be taken as an acknowledgment or any form of suggestion that the information constitutes the prior art that is already known to those skilled in the art.

发明内容Contents of the invention

本发明实施例提供了一种计算引擎确定方法及集群，能够通过分配给基于内存的引擎的总的内存的剩余量和基于内存的引擎执行数据操作语句的内存消耗量的比较结果，从基于内存读写的引擎和基于磁盘读写的引擎中选择执行数据操作语句的引擎，进而实现了引擎的智能化选择，实现效率和稳定性的平衡。The embodiment of the present invention provides a calculation engine determination method and a cluster, which can determine the memory-based The engine for reading and writing and the engine based on disk reading and writing are selected to execute data operation statements, thereby realizing the intelligent selection of engines and achieving a balance between efficiency and stability.

第一方面，本发明实施例提供了一种计算引擎确定方法，方法包括：获取待处理的数据操作语句；确定第一引擎执行数据操作语句的第一内存消耗量；其中，第一引擎为基于内存的引擎，第一引擎用于处理目标数据；确定分配给第一引擎的总的内存的当前内存剩余量；若当前内存剩余量大于第一内存消耗量，确定第一引擎处理数据操作语句。In the first aspect, the embodiment of the present invention provides a method for determining a computing engine, the method includes: obtaining a data operation statement to be processed; determining the first memory consumption of the first engine executing the data operation statement; wherein, the first engine is based on The memory engine, the first engine is used to process the target data; determine the current memory remaining amount of the total memory allocated to the first engine; if the current memory remaining amount is greater than the first memory consumption, determine that the first engine processes the data operation statement.

本方案中，能够通过分配给基于内存的引擎的总的内存的剩余量和基于内存的引擎执行数据操作语句的内存消耗量的比较结果，从基于内存读写的第一引擎和基于磁盘读写的第二引擎中选择执行数据操作语句的引擎，进而实现了引擎的智能化选择，实现效率和稳定性的平衡。In this solution, by comparing the remaining amount of total memory allocated to the memory-based engine and the memory consumption of the memory-based engine executing data operation statements, the first engine based on memory reading and writing and disk-based reading and writing can be compared. In the second engine, the engine that executes the data operation statement is selected, and then the intelligent selection of the engine is realized, and the balance between efficiency and stability is achieved.

在一种可能的实现方式中，若当前内存剩余量小于等于第一内存消耗量，确定第二引擎处理数据操作语句；其中，第二引擎为基于磁盘的引擎；第二引擎用于处理目标数据。In a possible implementation, if the current remaining memory is less than or equal to the first memory consumption, it is determined that the second engine processes the data operation statement; wherein, the second engine is a disk-based engine; the second engine is used to process the target data .

本方案中，当第一内存消耗量大于当前剩余内存资源时，一方面说明当前的数据量较高，另一说明说明基于内存的引擎不具有足够的内存执行数据操作语句，此时调用基于磁盘的引擎执行数据操作语句。In this solution, when the first memory consumption is greater than the current remaining memory resources, on the one hand, it means that the current data volume is high, and on the other hand, it means that the memory-based engine does not have enough memory to execute data operation statements. The engine executes data manipulation statements.

可选地，获取待处理的数据操作语句，包括：将第二引擎接收的数据操作语句作为待处理的数据操作语句。Optionally, obtaining the data operation statement to be processed includes: using the data operation statement received by the second engine as the data operation statement to be processed.

本方案中，第二引擎作为统一对接数据操作语句的接口，可以接收各种数据操作语句，从而保证业务的正常实现。In this solution, the second engine serves as an interface for unified docking of data operation statements, and can receive various data operation statements, thereby ensuring the normal realization of business.

在一种可能的实现方式中，确定第一引擎执行数据操作语句的第一内存消耗量，包括：确定数据操作语句的逻辑执行树；其中，逻辑执行树用于指示数据操作语句表示的数据处理的逻辑流程；基于目标数据的元数据和逻辑执行树，确定第一引擎执行数据操作语句的第一内存消耗量。In a possible implementation manner, determining the first memory consumption of the first engine executing the data operation statement includes: determining a logical execution tree of the data operation statement; where the logical execution tree is used to indicate the data processing represented by the data operation statement A logical flow; based on the metadata of the target data and the logical execution tree, determine the first memory consumption for the first engine to execute the data operation statement.

本方案中，基于目标数据的元数据和逻辑执行树，模拟第一引擎实际执行数据操作语句的分析过程，得到第一引擎实际执行数据操作语句过程中的资源消耗的情况，分析执行数据操作语句的真实的内存资源消耗，进而确保引擎的智能化选择。In this solution, based on the metadata and logical execution tree of the target data, the analysis process of the first engine actually executing the data operation statement is simulated, the resource consumption in the process of the first engine actually executing the data operation statement is obtained, and the execution data operation statement is analyzed. The real memory resource consumption, thus ensuring the intelligent selection of the engine.

在一个例子中，基于目标数据的元数据和逻辑执行树，确定第一引擎执行数据操作语句的第一内存消耗量，包括：对于逻辑执行树中的每个节点，基于目标数据的元数据，确定节点对应的数据量；基于节点对应的数据量，确定节点对应的第二内存消耗量；其中，第二内存消耗量用于指示第一引擎执行节点对应的任务的内存的消耗量；基于逻辑执行树中每个节点各自的第二内存消耗量，确定第一引擎执行数据操作语句的第一内存消耗量。In an example, based on the metadata of the target data and the logical execution tree, determining the first memory consumption of the first engine executing the data manipulation statement includes: for each node in the logical execution tree, based on the metadata of the target data, Determine the amount of data corresponding to the node; based on the amount of data corresponding to the node, determine the second memory consumption corresponding to the node; wherein, the second memory consumption is used to indicate the memory consumption of the first engine to execute the task corresponding to the node; based on logic Execute the respective second memory consumption of each node in the tree to determine the first memory consumption for executing the data operation statement by the first engine.

本方案中，基于目标数据的元数据和逻辑执行树中每个节点对应的内存资源消耗情况，实现内存消耗量的分析。In this solution, based on the metadata of the target data and the memory resource consumption corresponding to each node in the logical execution tree, the analysis of memory consumption is realized.

可选地，基于节点对应的数据量，确定节点对应的第二内存消耗量，包括：将节点对应的任务需要处理的数据量作为内存消耗，得到节点对应的第三内存消耗量；确定节点对应的数据处理操作；基于节点对应的数据处理操作，确定第三内存消耗量对应的修正值；基于修正值对第三内存消耗量进行修正，确定节点对应的第二内存消耗量。Optionally, based on the amount of data corresponding to the node, determining the second memory consumption corresponding to the node includes: taking the amount of data to be processed by the task corresponding to the node as the memory consumption to obtain the third memory consumption corresponding to the node; determining the corresponding The data processing operation; based on the data processing operation corresponding to the node, determine the correction value corresponding to the third memory consumption; based on the correction value, correct the third memory consumption, and determine the second memory consumption corresponding to the node.

本方案中，基于修正值对内存消耗量进行分析，确定能够反映实际内存消耗的内存消耗量，从而能够选择表现较好的引擎执行数据操作语句。In this solution, the memory consumption is analyzed based on the correction value to determine the memory consumption that can reflect the actual memory consumption, so that the engine with better performance can be selected to execute the data operation statement.

在一个实现方式中，数据处理操作为扫描，修正值为数据并行度的倒数。In one implementation, the data processing operation is a scan, and the correction value is the reciprocal of the data parallelism.

在一个实现方式中，数据处理操作为基于网络通信的操作，修正值包括第一数值和第二数值，第一数值用于指示第一引擎执行节点对应的任务的过程中申请内存的次数；第二数值用于指示第一引擎执行节点对应的任务的过程中申请和释放内存的比例。In one implementation, the data processing operation is an operation based on network communication, and the correction value includes a first value and a second value, and the first value is used to indicate the number of times the first engine applies for memory during the process of executing the task corresponding to the node; the first The two values are used to indicate the ratio of the first engine to allocate and release memory during the process of executing the task corresponding to the node.

在一个实现方式中，数据处理操作为本地的处理操作，修正值为第二数值。In one implementation, the data processing operation is a local processing operation, and the correction value is the second value.

可选地，基于逻辑执行树中每个节点各自的第二内存消耗量，确定执行数据操作语句的第一内存消耗量，包括：对逻辑执行树中每个节点各自的第二内存消耗量进行求和，将求和后的结果作为执行数据操作语句的第一内存消耗量。Optionally, based on the respective second memory consumption of each node in the logical execution tree, determining the first memory consumption of executing the data operation statement includes: performing the respective second memory consumption of each node in the logical execution tree Summing, use the summed result as the first memory consumption for executing the data manipulation statement.

在一个例子中，确定数据操作语句的逻辑执行树，包括：对数据操作语句进行语法分析，确定抽象语法树；对抽象语法树进行语义分析，确定逻辑执行树。In an example, determining the logic execution tree of the data operation statement includes: performing syntax analysis on the data operation statement to determine an abstract syntax tree; performing semantic analysis on the abstract syntax tree to determine the logic execution tree.

在一种可能的实现方式中，获取待处理的数据操作语句，包括：将接收终端的数据操作语句作为待处理的数据操作语句。In a possible implementation manner, obtaining the data operation statement to be processed includes: using the data operation statement of the receiving terminal as the data operation statement to be processed.

第二方面，本发明实施例提供了一种计算引擎确定装置，该装置包括若干个模块，各个模块用于执行本发明实施例第一方面提供的计算引擎确定方法中的各个步骤，关于模块的划分在此不做限制。该计算引擎确定装置中的各个模块所执行的具体功能及达到的有益效果请参考本发明实施例第一方面提供的计算引擎确定方法的各个步骤的功能，在此不再赘述。In the second aspect, the embodiment of the present invention provides an apparatus for determining a computing engine, which includes several modules, and each module is used to execute each step in the method for determining a computing engine provided in the first aspect of the embodiment of the present invention. Regarding the modules The division is not limited here. Please refer to the functions of each step of the calculation engine determination method provided in the first aspect of the embodiment of the present invention for the specific functions performed by each module in the calculation engine determination device and the beneficial effects achieved, and details will not be repeated here.

示例地，计算引擎确定装置安装基于内存的第一引擎，第一引擎用于管理目标数据；该装置包括：Exemplarily, the calculation engine determining means installs a memory-based first engine, and the first engine is used for managing target data; the means includes:

语句确定模块，用于获取待处理的数据操作语句；A statement determination module is used to obtain the data operation statement to be processed;

消耗资源确定模块，用于确定第一引擎执行数据操作语句的第一内存消耗量；其中，第一引擎为基于内存的引擎，第一引擎用于处理目标数据；A consumption resource determination module, configured to determine the first memory consumption of the first engine executing the data operation statement; wherein, the first engine is a memory-based engine, and the first engine is used to process target data;

剩余资源确定模块，用于确定分配给第一引擎中的总的内存的当前内存剩余量；A remaining resource determination module, configured to determine the current remaining amount of memory allocated to the total memory in the first engine;

引擎选择模块，用于若当前内存剩余量大于第一内存消耗量，确定第一引擎处理数据操作语句。The engine selection module is configured to determine that the first engine processes the data operation statement if the current remaining amount of memory is greater than the first amount of memory consumption.

在一种可能的实现方式中，引擎选择模块还用于若当前内存剩余量小于等于第一内存消耗量，确定第二引擎处理数据操作语句；其中，第二引擎为基于磁盘的引擎；第二引擎用于处理目标数据。In a possible implementation, the engine selection module is further configured to determine that the second engine processes the data operation statement if the current remaining memory is less than or equal to the first memory consumption; wherein, the second engine is a disk-based engine; the second Engines are used to process target data.

可选地，语句确定模块，用于将接收终端的数据操作语句作为待处理的数据操作语句或将第二引擎接收的数据操作语句作为待处理的数据操作语句。Optionally, the statement determination module is configured to use the data operation statement of the receiving terminal as the data operation statement to be processed or the data operation statement received by the second engine as the data operation statement to be processed.

在一种可能的实现方式中，消耗资源确定模块，包括执行树确定单元和消耗量确定单元；其中，In a possible implementation manner, the consumption resource determination module includes an execution tree determination unit and a consumption determination unit; wherein,

执行树确定单元，用于确定数据操作语句的逻辑执行树；其中，逻辑执行树用于指示数据操作语句表示的数据处理的逻辑流程；The execution tree determination unit is used to determine the logical execution tree of the data operation statement; wherein, the logical execution tree is used to indicate the logical flow of data processing represented by the data operation statement;

消耗量确定单元，用于基于目标数据的元数据和逻辑执行树确定第一引擎执行数据操作语句的第一内存消耗量。A consumption amount determining unit, configured to determine the first memory consumption amount for the first engine to execute the data operation statement based on the metadata of the target data and the logical execution tree.

在一个例子中，第一消耗量确定单元，包括第一消耗量确定子单元和第二消耗量确定子单元；In one example, the first consumption amount determination unit includes a first consumption amount determination subunit and a second consumption amount determination subunit;

第一消耗量确定子单元，用于对于逻辑执行树中的每个节点，基于目标数据的元数据，确定节点对应的数据量；基于节点对应的数据量，确定第一引擎执行节点对应的任务的第二内存消耗量；其中，第二内存消耗量用于指示第一引擎执行节点对应的任务的内存的消耗量；The first consumption determination subunit is configured to, for each node in the logical execution tree, determine the amount of data corresponding to the node based on the metadata of the target data; determine the task corresponding to the first engine execution node based on the amount of data corresponding to the node The second memory consumption amount; wherein, the second memory consumption amount is used to indicate the memory consumption amount of the task corresponding to the execution node of the first engine;

第二消耗量确定子单元，用于基于逻辑执行树中每个节点各自的第二内存消耗量，确定第一引擎执行数据操作语句的第一内存消耗量。The second consumption determining subunit is configured to determine the first memory consumption for executing the data operation statement by the first engine based on the respective second memory consumption of each node in the logical execution tree.

可选地，第一消耗量确定子单元，具体执行如下内容：Optionally, the first consumption determination subunit specifically executes the following content:

将节点对应的任务需要处理的数据量作为内存消耗，得到节点对应的第三内存消耗量；确定节点对应的数据处理操作；基于节点对应的数据处理操作，确定第三内存消耗量对应的修正值；基于修正值对第三内存消耗量进行修正，确定节点对应的第二内存消耗量。Use the amount of data that needs to be processed by the task corresponding to the node as the memory consumption to obtain the third memory consumption corresponding to the node; determine the data processing operation corresponding to the node; determine the correction value corresponding to the third memory consumption based on the data processing operation corresponding to the node ; Correct the third memory consumption based on the correction value, and determine the second memory consumption corresponding to the node.

在一个实现方式中，数据处理操作为本地的处理操作，修正值为第二数值，第二数值用于指示第一引擎执行节点对应的任务的过程中申请和释放内存的比例。In an implementation manner, the data processing operation is a local processing operation, and the correction value is a second value, and the second value is used to indicate a ratio of memory application and release in the process of the first engine executing a task corresponding to the node.

可选地，第二消耗量确定单元，用于对逻辑执行树中每个节点各自的第二内存消耗量进行求和，将求和后的结果作为执行数据操作语句的第一内存消耗量。Optionally, the second consumption determination unit is configured to sum the respective second memory consumption of each node in the logical execution tree, and use the summed result as the first memory consumption for executing the data operation statement.

在一个例子中，执行树确定单元，包括：语法分析单元和语义分析单元；其中，In one example, the execution tree determination unit includes: a syntax analysis unit and a semantic analysis unit; wherein,

语法分析单元，用于对数据操作语句进行语法分析，确定抽象语法树；The syntax analysis unit is used to perform syntax analysis on the data operation statement and determine the abstract syntax tree;

语义分析单元，用于对抽象语法树进行语义分析，确定逻辑执行树。The semantic analysis unit is used for performing semantic analysis on the abstract syntax tree to determine the logic execution tree.

在一种可能的实现方式中，语句确定模块，用于将接收终端的数据操作语句作为待处理的数据操作语句。In a possible implementation manner, the statement determination module is configured to use the data operation statement of the receiving terminal as the data operation statement to be processed.

第三方面，本发明实施例提供了一种计算引擎确定设备，该设备包括处理器和存储器；存储器存储有程序指令；处理器用于执行程序指令，以使得设备执行如第一方面的方法。In a third aspect, an embodiment of the present invention provides a calculation engine determination device, the device includes a processor and a memory; the memory stores program instructions; the processor is used to execute the program instructions, so that the device executes the method as in the first aspect.

在实际应用中，第一引擎包括调度节点和若干个工作节点，计算引擎确定设备包括调度节点。具体地，计算引擎确定设备为设备集群中安装第一引擎的设备。具体地，设备集群包括至少一个电子设备，至少一个电子设备安装第一引擎。In practical applications, the first engine includes a scheduling node and several working nodes, and the calculation engine determines that the device includes a scheduling node. Specifically, the computing engine determines that the device is a device on which the first engine is installed in the device cluster. Specifically, the device cluster includes at least one electronic device, and at least one electronic device is installed with the first engine.

进一步地，设备集群中还安装第二引擎。在一个例子中，第一引擎和第二引擎安装在相同的电子设备上，比如，计算引擎确定设备同时安装第一引擎和第二引擎。在另一个例子中，第一引擎和第二引擎安装在不同的电子设备上。Further, a second engine is also installed in the device cluster. In one example, the first engine and the second engine are installed on the same electronic device, for example, the computing engine determines that the first engine and the second engine are installed on the device at the same time. In another example, the first engine and the second engine are mounted on different electronic devices.

第四方面，本发明实施例提供了一种计算引擎确定装置，包括：至少一个存储器，用于存储程序；至少一个处理器，用于执行存储器存储的程序，当存储器存储的程序被执行时，处理器用于执行第一方面中所提供的方法。示例地，该程序为第一引擎的程序，计算引擎确定装置可以为一个设备。In a fourth aspect, an embodiment of the present invention provides a calculation engine determination device, including: at least one memory for storing programs; at least one processor for executing the programs stored in the memory, when the programs stored in the memory are executed, The processor is configured to execute the method provided in the first aspect. Exemplarily, the program is a program of the first engine, and the calculation engine determining means may be a device.

第五方面，本发明实施例提供了一种计算引擎确定装置，其特征在于，装置运行计算机程序指令，以执行第一方面中所提供的方法。示例性的，该装置可以为芯片，或处理器。示例地，计算机程序指令为第一引擎的程序。In a fifth aspect, an embodiment of the present invention provides a computing engine determination device, wherein the device runs computer program instructions to execute the method provided in the first aspect. Exemplarily, the device may be a chip or a processor. Exemplarily, the computer program instructions are programs of the first engine.

在一个例子中，该装置可以包括处理器，该处理器可以与存储器耦合，读取存储器中的指令并根据该指令执行第一方面中所提供的方法。其中，该存储器可以集成在芯片或处理器中，也可以独立于芯片或处理器之外。In an example, the apparatus may include a processor, and the processor may be coupled to the memory, read instructions in the memory and execute the method provided in the first aspect according to the instructions. Wherein, the memory may be integrated in the chip or the processor, or independent of the chip or the processor.

第六方面，本发明实施例提供了一种计算机存储介质，计算机存储介质中存储有指令，当指令在计算机上运行时，使得计算机执行第一方面中所提供的方法。示例地，指令为第一引擎的程序。In a sixth aspect, an embodiment of the present invention provides a computer storage medium, and instructions are stored in the computer storage medium, and when the instructions are run on a computer, the computer is made to execute the method provided in the first aspect. Exemplarily, the instruction is a program of the first engine.

第七方面，本发明实施例提供了一种包含指令的计算机程序产品，当指令在计算机上运行时，使得计算机执行第一方面中所提供的方法。示例地，指令为第一引擎的程序。In a seventh aspect, an embodiment of the present invention provides a computer program product containing instructions, and when the instructions are run on a computer, the computer is made to execute the method provided in the first aspect. Exemplarily, the instruction is a program of the first engine.

附图说明Description of drawings

图1是一种计算引擎确定的流程示意图；Fig. 1 is a schematic flow diagram of a calculation engine determination;

图2是本发明实施例提供的一种计算引擎确定系统的框架示意图一；FIG. 2 is a first schematic diagram of the framework of a calculation engine determination system provided by an embodiment of the present invention;

图3是本发明实施例提供的一种计算引擎确定方法的流程示意图；Fig. 3 is a schematic flowchart of a calculation engine determination method provided by an embodiment of the present invention;

图4是图3提供的步骤320的流程示意图；FIG. 4 is a schematic flow chart of step 320 provided in FIG. 3;

图5是图4提供的步骤322的流程示意图；FIG. 5 is a schematic flow chart of step 322 provided in FIG. 4;

图6是本发明实施例提供的一种逻辑执行树的示意图；Fig. 6 is a schematic diagram of a logical execution tree provided by an embodiment of the present invention;

图7是本发明实施例提供的Trino引擎和Hive引擎选择方法的示意图；7 is a schematic diagram of a Trino engine and a Hive engine selection method provided by an embodiment of the present invention;

图8是本发明实施例提供的一种计算引擎确定装置的流程示意图；Fig. 8 is a schematic flowchart of a calculation engine determination device provided by an embodiment of the present invention;

图9是本发明实施例提供的一种电子设备的结构示意图。Fig. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式detailed description

为了使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图，对本发明实施例中的技术方案进行描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described below in conjunction with the accompanying drawings.

在本发明实施例的描述中，“示例性的”、“例如”或者“举例来说”等词用于表示作例子、例证或说明。本发明实施例中被描述为“示例性的”、“例如”或者“举例来说”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言，使用“示例性的”、“例如”或者“举例来说”等词旨在以具体方式呈现相关概念。In the description of the embodiments of the present invention, words such as "exemplary", "for example" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design described as "exemplary", "for example" or "for example" in the embodiments of the present invention shall not be construed as being more preferred or more advantageous than other embodiments or designs. Rather, the use of words such as "exemplary", "for example" or "for example" is intended to present related concepts in a specific manner.

在本发明实施例的描述中，术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，单独存在B，同时存在A和B这三种情况。另外，除非另有说明，术语“多个”的含义是指两个或两个以上。例如，多个系统是指两个或两个以上的系统，多个终端是指两个或两个以上的终端。In the description of the embodiments of the present invention, the term "and/or" is only a kind of association relationship describing associated objects, which means that there may be three kinds of relationships, for example, A and/or B can mean: A exists alone, A exists alone There is B, and there are three cases of A and B at the same time. In addition, unless otherwise specified, the term "plurality" means two or more. For example, multiple systems refer to two or more systems, and multiple terminals refer to two or more terminals.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”，除非是以其他方式另外特别强调。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.

自大数据的概念被提出以来，企业对于数据信息变为数据资产的诉求越来越强烈，为了满足不同大数据应用场景的需求，技术领域出现多种大数据计算引擎服务，包括用于离线分析场景的Hive/Spark、用于实时计算场景的Flink/Storm、用于交互式查询场景的Trino/Impala等。随着大数据生态的不断成熟以及湖仓一体化架构的广泛应用，底层统一数据湖存储+上层兼容多种计算引擎的趋势已逐渐明朗。目前，Hive及Trino分别成为离线分析和交互式查询场景中最常用计算引擎。Since the concept of big data was put forward, enterprises have increasingly strong demands for data information to be transformed into data assets. In order to meet the needs of different big data application scenarios, a variety of big data computing engine services have emerged in the technical field, including offline analysis Hive/Spark for scenarios, Flink/Storm for real-time computing scenarios, Trino/Impala for interactive query scenarios, etc. With the continuous maturity of the big data ecology and the wide application of the integrated lake warehouse architecture, the trend of unified data lake storage at the bottom layer and compatibility with multiple computing engines at the upper layer has gradually become clear. Currently, Hive and Trino are the most commonly used computing engines in offline analysis and interactive query scenarios respectively.

基于Trino引擎可以避免数据落盘，使其在中小数据量下的交互式分析性能极佳，已逐渐成为最常用的交互式查询引擎，但大数据量下的分析任务Trino的稳定性无法保证。Hive作为最常用的离线分析引擎，提供强大的数据仓库能力，已成为事实上的离线数仓领域的技术标准，但中小数据量下较差的离线分析性能是Hive一直存在的问题。Based on the Trino engine, data storage can be avoided, and its interactive analysis performance under small and medium data volumes is excellent. It has gradually become the most commonly used interactive query engine, but the stability of Trino cannot be guaranteed for analysis tasks under large data volumes. As the most commonly used offline analysis engine, Hive provides powerful data warehouse capabilities and has become the de facto technical standard in the field of offline data warehouses. However, poor offline analysis performance under small and medium data volumes has always been a problem in Hive.

但是，上述技术方案存在如下2方面的技术问题：However, there are two following technical problems in the above-mentioned technical solution:

第一方面，无法真正实现计算引擎智能选择：考虑到预先设置的固定阈值无法适用于不同规模的集群，需要开发人员根据经验不断的手动调整阈值以提高引擎选择准确率。First, intelligent selection of computing engines cannot be truly realized: Considering that the preset fixed thresholds cannot be applied to clusters of different sizes, developers need to manually adjust the thresholds based on experience to improve the accuracy of engine selection.

第二方面，计算引擎自动选择准确率不足：从生成的AST树中获取该SQL涉及到的表数据量大小为表的物理存储空间，实际计算过程中数据分批加载至内存，且内存占用和释放同时进行，因此通过表的物理存储空间来判断任务运行时的资源消耗的方法也不准确。In the second aspect, the calculation engine's automatic selection accuracy is insufficient: the size of the table data involved in obtaining the SQL from the generated AST tree is the physical storage space of the table, and the data is loaded into the memory in batches during the actual calculation process, and the memory usage and The release is performed at the same time, so the method of judging the resource consumption when the task is running through the physical storage space of the table is also inaccurate.

为了解决上述技术问题，本发明实施例提供了大数据的计算引擎选择方法。In order to solve the above technical problems, an embodiment of the present invention provides a method for selecting a computing engine for big data.

图2示出了本发明实施例提供的一种计算引擎选择系统的架构示例图。本发明实施例提供了计算引擎确定方法可以应用于如图2所示的系统架构图。如图2所示，计算引擎选择系统包括终端设备101，设备集群102。Fig. 2 shows an example architecture diagram of a calculation engine selection system provided by an embodiment of the present invention. The embodiment of the present invention provides that the calculation engine determination method can be applied to the system architecture diagram shown in FIG. 2 . As shown in FIG. 2 , the computing engine selection system includes a terminal device 101 and a device cluster 102 .

其中，终端设备101可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。本方案中涉及的终端设备的示例性实施例包括但不限于搭载iOS、android、Windows、鸿蒙系统(Harmony OS)或者其他操作系统的电子设备。本发明实施例对电子设备的类型不做具体限定。Wherein, the terminal device 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices. Exemplary embodiments of terminal devices involved in this solution include but are not limited to electronic devices equipped with iOS, android, Windows, Harmony OS or other operating systems. The embodiment of the present invention does not specifically limit the type of the electronic device.

其中，设备集群102可以用独立的电子设备或者是多个电子设备组成的设备集群来实现。在一些可能的实现方式中，设备集群102中的电子设备可为终端也可为计算机，还可以为服务器。Wherein, the device cluster 102 may be implemented by an independent electronic device or a device cluster composed of multiple electronic devices. In some possible implementation manners, the electronic devices in the device cluster 102 may be terminals, computers, or servers.

在一个例子中，本方案中涉及的服务器可以用于提供云服务，其可以为一种可以与其他的设备建立通信连接、且能为其他的设备提供运算功能和/或存储功能的服务器或者是超级终端。其中，本方案中涉及的服务器可以是硬件服务器，也可以植入虚拟化环境中，例如，本方案中涉及的服务器可以是在包括一个或多个其他虚拟机的硬件服务器上执行的虚拟机。In an example, the server involved in this solution can be used to provide cloud services, which can be a server that can establish a communication connection with other devices and can provide computing functions and/or storage functions for other devices or a HyperTerminal. Wherein, the server involved in this solution may be a hardware server, and may also be embedded in a virtualization environment. For example, the server involved in this solution may be a virtual machine executed on a hardware server including one or more other virtual machines.

其中，终端设备101通过网络与设备集群102通过网络进行通信。网络可以为有线网络或无线网络。示例地，有线网络可以为电缆网络、光纤网络、数字数据网(Digital DataNetwork，DDN)等，无线网络可以为电信网络、内部网络、互联网、局域网络(Local AreaNetwork,LAN)、广域网络(Wide Area Network,WAN)、无线局域网络(Wireless Local AreaNetwork,WLAN)、城域网(Metropolitan Area Network,MAN)、公共交换电话网络(PublicService Telephone Network,PSTN)、蓝牙网络、紫蜂网络(ZigBee)、移动电话(GlobalSystem for Mobile Communications，GSM)、CDMA(Code Division Multiple Access)网络、CPRS(GeneralPacketRadioService)网络等或其任意组合。可以理解的是，网络可使用任何已知的网络通信协议来实现不同客户端层和网关之间的通信，上述网络通信协议可以是各种有线或无线通信协议，诸如以太网、通用串行总线(universal serial bus，USB)、火线(firewire)、全球移动通讯系统(global system for mobile communications，GSM)、通用分组无线服务(general packet radio service，GPRS)、码分多址接入(code divisionmultiple access，CDMA)、宽带码分多址(wideband code division multiple access，WCDMA)，时分码分多址(time-division code division multiple access，TD-SCDMA)、长期演进(long term evolution，LTE)、新空口(new radio，NR)、蓝牙(bluetooth)、无线保真(wireless fidelity，Wi-Fi)等通信协议。Wherein, the terminal device 101 communicates with the device cluster 102 through the network. The network can be a wired network or a wireless network. Exemplarily, the wired network can be a cable network, an optical fiber network, a digital data network (Digital DataNetwork, DDN), etc., and the wireless network can be a telecommunication network, an internal network, the Internet, a local area network (Local Area Network, LAN), a wide area network (Wide Area Network) Network, WAN), wireless local area network (Wireless Local Area Network, WLAN), metropolitan area network (Metropolitan Area Network, MAN), public switched telephone network (Public Service Telephone Network, PSTN), Bluetooth network, ZigBee network (ZigBee), mobile Telephone (Global System for Mobile Communications, GSM), CDMA (Code Division Multiple Access) network, CPRS (General Packet Radio Service) network, etc. or any combination thereof. It can be understood that the network can use any known network communication protocol to realize the communication between different client layers and gateways, and the above network communication protocol can be various wired or wireless communication protocols, such as Ethernet, Universal Serial Bus (universal serial bus, USB), fire wire (firewire), global system for mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access , CDMA), wideband code division multiple access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (long term evolution, LTE), new air interface (new radio, NR), Bluetooth (bluetooth), wireless fidelity (wireless fidelity, Wi-Fi) and other communication protocols.

本发明实施例中，设备集群102可以存储大数据(为了便于描述和区别，称为目标数据)，还可以安装多个引擎，多个引擎可以对目标数据进行读取和计算。这里，多个引擎包括基于内存进行读写的引擎(为了便于描述和区别，称为第一引擎)和基于磁盘进行读写的引擎(为了便于描述和区别，称为第二引擎)。第一引擎可以为Trino引擎，也可以为impala引擎，第二引擎可以为Hive引擎。不同的引擎在查询过程中性能各有优劣。比如Hive引擎在查询过程中需要执行多次磁盘的读写，每次对磁盘的读写都会产生延时，但是Hive引擎适用于数据量较大的查询过程。而Trino引擎是基于内存的分布式实时查询引擎，在查询过程中不存在磁盘读写产生的延时问题，具有较快的查询速度，但是所能查询的查询数据的数据量存在限制。因此，需要基于实际的数据处理过程选择不同的引擎。In the embodiment of the present invention, the device cluster 102 can store large data (referred to as target data for ease of description and distinction), and can also install multiple engines, which can read and calculate the target data. Here, the multiple engines include an engine for reading and writing based on memory (for ease of description and distinction, referred to as the first engine) and an engine for read and write based on disk (for ease of description and distinction, referred to as the second engine). The first engine can be a Trino engine or an impala engine, and the second engine can be a Hive engine. Different engines have their own advantages and disadvantages in the query process. For example, the Hive engine needs to perform multiple disk reads and writes during the query process, and each read and write to the disk will cause a delay, but the Hive engine is suitable for the query process with a large amount of data. The Trino engine is a distributed real-time query engine based on memory. There is no delay caused by disk read and write during the query process, and it has a faster query speed, but the amount of query data that can be queried is limited. Therefore, different engines need to be selected based on the actual data processing process.

需要说明的是，本发明实施例提供的设备集群102中第一引擎、第二引擎和目标数据所在的电子设备可以不同，也可以部分重叠。比如，设备集群102中的电子设备中可以存储有目标数据中的部分，也可以执行第一引擎的部分功能，还可以执行第二引擎的部分功能；再比如，设备集群102可以划分为3个集群，一个集群存储目标数据，一个集群安装第一引擎，另一个集群安装第二引擎。本发明实施例对此不做具体限定，具体可结合实际情况设置第一引擎、第二引擎和目标数据所在的服务器。It should be noted that, in the device cluster 102 provided by the embodiment of the present invention, the electronic devices where the first engine, the second engine, and the target data reside may be different, or may partially overlap. For example, the electronic devices in the device cluster 102 may store part of the target data, and may also perform part of the functions of the first engine, and may also perform part of the functions of the second engine; for another example, the device cluster 102 may be divided into three Clusters, one cluster stores target data, one cluster installs the first engine, and the other cluster installs the second engine. This embodiment of the present invention does not specifically limit this, and the server where the first engine, the second engine, and the target data are located may be specifically set in combination with actual conditions.

基于此，本发明实施例的终端设备101在获取到用户触发的查询指令之后，可得到数据操作语句，并将数据操作语句上传至设备集群102，设备集群102中的第一引擎可以预计算出执行该数据操作语句的总的内存消耗量，获取分配给第一引擎的内存中的当前内存剩余量，将总的内存消耗量与分配给第一引擎的内存中的当前内存剩余量进行比较，若小于，则交由第一引擎比如Trino引擎执行数据操作语句，若大于，则交由第二引擎比如Hive引擎执行数据操作语句，从而选择最佳的数据处理的引擎。Based on this, after the terminal device 101 of the embodiment of the present invention acquires the query command triggered by the user, it can obtain the data operation statement and upload the data operation statement to the device cluster 102, and the first engine in the device cluster 102 can pre-calculate and execute The total memory consumption of the data operation statement is to obtain the current remaining amount of memory in the memory allocated to the first engine, and compare the total memory consumption with the current remaining amount of memory in the memory allocated to the first engine, if If it is less than, the first engine such as the Trino engine will execute the data operation statement; if it is greater than, the second engine such as the Hive engine will execute the data operation statement, so as to select the best data processing engine.

需要指出，当设备集群102安装的第一引擎有多个时，需要将执行数据操作语句的内存资源消耗量与每个第一引擎的当前内存剩余量进行比较，在所有的第一引擎的当前内存剩余量均小于执行数据操作语句的内存资源消耗量，则将数据操作语句交由第二引擎处理。It should be pointed out that when there are multiple first engines installed in the device cluster 102, it is necessary to compare the consumption of memory resources for executing data operation statements with the current remaining memory of each first engine. If the remaining amount of memory is less than the consumption of memory resources for executing the data operation statement, the data operation statement is handed over to the second engine for processing.

上述计算引擎选择系统的架构图仅仅作为示例。在一些可能的实现方式，计算引擎选择系统还包括引擎选择服务器，终端设备101通过网络和引擎选择服务器通信，设备集群102通过网络和引擎选择服务器通信。关于网络的详细介绍参见上文，不再赘述。引擎选择服务器可以用独立的服务器或者是多个服务器组成的设备集群来实现。在一些可能的实现方式，本发明实施例的终端设备101在获取到用户触发的查询指令之后，可得到数据操作语句，并将数据操作语句上传至引擎选择服务器。引擎选择服务器请求设备集群102中的第一引擎预计算出执行该数据操作语句的总的内存消耗量，引擎选择服务器获取分配给第一引擎的内存中的当前内存剩余量，将总的内存消耗量与分配给第一引擎的内存中的当前内存剩余量进行比较，若小于，则交由设备集群102中的第一引擎比如Trino引擎执行数据操作语句，若大于，则交由设备集群102中的第二引擎比如Hive引擎执行数据操作语句，从而选择最佳的数据处理的引擎。The architecture diagram of the calculation engine selection system mentioned above is just an example. In some possible implementation manners, the computing engine selection system further includes an engine selection server, the terminal device 101 communicates with the engine selection server through a network, and the device cluster 102 communicates with the engine selection server through a network. For the detailed introduction of the network, refer to the above, and will not go into details. The engine selection server can be implemented by an independent server or a device cluster composed of multiple servers. In some possible implementation manners, the terminal device 101 in the embodiment of the present invention may obtain the data operation statement after acquiring the query instruction triggered by the user, and upload the data operation statement to the engine selection server. The engine selection server requests the first engine in the device cluster 102 to pre-calculate the total memory consumption for executing the data operation statement, the engine selection server obtains the current remaining memory in the memory allocated to the first engine, and calculates the total memory consumption Compare with the current remaining amount of memory in the memory allocated to the first engine, if it is less than, then hand over to the first engine in the device cluster 102 such as the Trino engine to execute the data operation statement, if it is greater than, then hand over to the device cluster 102 The second engine, such as the Hive engine, executes data manipulation statements to select the best data processing engine.

接下来，对本发明实施例提供的一种计算引擎确定方法进行介绍。其中，计算引擎可以包括基于内存读写的第一引擎和基于磁盘读写的第二引擎。这里，第一引擎和第二引擎安装在上述设备集群102。该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。比如，上述图2中的设备集群102，考虑到本发明实施例提供的技术方案涉及到了数据操作语句的解析和消耗量的估算，而关于这些功能需要特定的引擎才能确定，因此，优选设备集群102中安装第一引擎的电子设备(为了便于描述和区别，称为目标设备)执行。需要指出，第一引擎包括调度节点和多个工作节点，调度节点用于将数据操作语句转化为多个任务，调度到多个工作节点执行。调度节点可以为一个实体的电子设备，也可以为一个虚拟的电子设备，比如虚拟服务器，虚拟机等。工作节点类同。示例地，实体的一个电子设备可以同时设置调度节点和处理节点，调度节点和处理节点可以共享电子设备的物理资源，对应的，调度节点和处理节点可以为虚拟机。通常由调度节点解析数据操作语句和估算消耗量实现任务调度，因此，目标设备为调度节点所在的电子设备。下面以目标设备作为执行主体为例进行说明。Next, a calculation engine determination method provided by an embodiment of the present invention is introduced. Wherein, the calculation engine may include a first engine based on memory reading and writing and a second engine based on disk reading and writing. Here, the first engine and the second engine are installed in the aforementioned device cluster 102 . The method can be executed by any device, device, platform, or device cluster that has computing and processing capabilities. For example, the above-mentioned device cluster 102 in Figure 2, considering that the technical solution provided by the embodiment of the present invention involves the analysis of data operation statements and the estimation of consumption, and these functions need a specific engine to determine, therefore, the preferred device cluster In 102, the electronic device on which the first engine is installed (for ease of description and distinction, referred to as a target device) executes. It should be pointed out that the first engine includes a scheduling node and multiple working nodes, and the scheduling node is used to convert data operation statements into multiple tasks and schedule them to multiple working nodes for execution. The scheduling node can be a physical electronic device, or a virtual electronic device, such as a virtual server, a virtual machine, and the like. Worker nodes are similar. For example, an electronic device of an entity may be configured with a scheduling node and a processing node at the same time, and the scheduling node and the processing node may share physical resources of the electronic device. Correspondingly, the scheduling node and the processing node may be virtual machines. Usually, the scheduling node parses the data operation statement and estimates the consumption to implement task scheduling. Therefore, the target device is the electronic device where the scheduling node is located. The following takes the target device as the execution subject as an example for description.

需要指出，设备集群102存储有目标数据。示例地，目标数据可以为存储于存储设备集群103中的海量数据，比如为结构化的数据库表，半结构化的文本数据，以及，非结构化的语音、图片、视频等数据。另外，设备集群102上存储有目标数据的元数据(Metadata)，基于此对目标数据进行管理。其中，元数据(Metadata)，又称中介数据、中继数据，为描述数据的数据(data about data)，主要是描述数据属性(property)的信息，用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据是关于数据的组织、数据域及其关系的信息，简言之，元数据就是关于数据的数据。在实际应用中，目标数据的元数据存储在分布式文件系统，比如HDFS(Hadoop DistributedFile System，Hadoop分布式文件系统)、NFS(Network File System，网络文件系统)等，但不限于此。部署的分布式文件系统可以对目标数据进行管理。示例地，设备集群102可以部署有分布式文件系统，基于此对存储的目标数据进行管理。It should be pointed out that the device cluster 102 stores target data. Exemplarily, the target data may be massive data stored in the storage device cluster 103, such as structured database tables, semi-structured text data, and unstructured voice, picture, video and other data. In addition, metadata of object data is stored in the device cluster 102, and the object data is managed based on this. Among them, metadata (Metadata), also known as intermediary data and relay data, is data describing data (data about data), mainly describing information about data attributes (property), used to support such as indicating storage locations, Resource search, file recording and other functions. Metadata is information about the organization of data, data domains and their relationships, in short, metadata is data about data. In practical applications, the metadata of the target data is stored in a distributed file system, such as HDFS (Hadoop Distributed File System, Hadoop Distributed File System), NFS (Network File System, Network File System), etc., but not limited thereto. The deployed distributed file system can manage the target data. Exemplarily, the device cluster 102 may be deployed with a distributed file system, based on which the stored target data is managed.

图3示出了本发明实施例提供的一种计算引擎确定方法的流程示意图。该方法包括以下步骤：Fig. 3 shows a schematic flowchart of a calculation engine determination method provided by an embodiment of the present invention. The method includes the following steps:

步骤310、接收待处理的数据操作语句。Step 310, receiving the data operation statement to be processed.

在实际应用中，在一个例子中，目标设备接收终端设备101发送的数据操作语句，该语句为待处理的数据操作语句。在一个例子中，若设备集群102中的第二引擎作为接收数据操作语句的统一接口，可选地，设备集群102中的第二引擎接收终端设备101发送的数据操作语句，并将该数据操作语句发送到安装第一引擎的目标设备。In practical applications, in an example, the target device receives the data operation statement sent by the terminal device 101, and the statement is a data operation statement to be processed. In one example, if the second engine in the device cluster 102 serves as a unified interface for receiving data operation statements, optionally, the second engine in the device cluster 102 receives the data operation statements sent by the terminal device 101, and converts the data operation The statement is sent to the target device where the first engine is installed.

这里，数据操作语句指示了将所需的数据从目标数据中筛选出来并进行数据处理的逻辑。数据操作语句可以由SQL(Structured Query Language，结构化查询语言)、HQL(Hibernate Query Language，一种查询语言)等查询语言体现。示例性的，本发明主要围绕这SQL语句来说明本发明提供的方法，但考虑到数据操作语句的多样性，本发明并不限制数据操作语句的具体类型。需要指出，设备集群102安装的第一引擎和第二引擎均可以处理相同的数据操作语句。Here, the data operation statement indicates the logic of filtering required data from target data and performing data processing. The data operation statement may be embodied by a query language such as SQL (Structured Query Language, Structured Query Language), HQL (Hibernate Query Language, a query language). Exemplarily, the present invention mainly focuses on the SQL statement to illustrate the method provided by the present invention, but considering the diversity of data manipulation statements, the present invention does not limit the specific types of data manipulation statements. It should be pointed out that both the first engine and the second engine installed in the device cluster 102 can process the same data operation statement.

在一个实施例中，数据操作语句可根据用户对终端设备101提供的查询界面的目标操作生成。In one embodiment, the data operation statement may be generated according to the user's target operation on the query interface provided by the terminal device 101 .

示例地，查询界面包括输入框，对应的，查询操作包括但不限于用户通过输入框输入的查询内容。数据操作语句可以由输入框中的输入的内容生成，可选地，输入的内容可以为数据操作语句。Exemplarily, the query interface includes an input box, and correspondingly, the query operation includes but not limited to query content input by the user through the input box. The data operation statement may be generated from the input content in the input box, and optionally, the input content may be a data operation statement.

示例地，查询界面包括数据操作语句组件，数据操作语句组件是指封装了的数据操作语句的组件，可用于重复使用。对应的，查询操作为用户通过拖拽查询界面上的至少一个数据操作语句组件，拼装成完整的数据操作语句。Exemplarily, the query interface includes a data operation statement component, which refers to a component that encapsulates a data operation statement and can be used repeatedly. Correspondingly, the query operation is that the user assembles a complete data operation statement by dragging and dropping at least one data operation statement component on the query interface.

进一步地，查询界面包括数据操作语句生成控件，当终端设备101检测到作用于数据操作语句生成控件的点击操作时，根据上述两个示例场景下的查询操作生成数据操作语句，并将数据操作语句上传至设备集群102。Further, the query interface includes a data operation statement generation control. When the terminal device 101 detects a click operation acting on the data operation statement generation control, it generates a data operation statement according to the query operations in the above two example scenarios, and converts the data operation statement to Upload to the device cluster 102.

步骤320、确定第一引擎执行数据操作语句的第一内存消耗量。Step 320, determine the first memory consumption of the first engine executing the data operation statement.

值得注意的是，第一内存消耗量指示了第一引擎执行数据操作语句所消耗的内存。需要说明的是，考虑到本发明实施例中需要确定第一引擎执行数据操作语句的第一内存消耗量，而不同的引擎处理的逻辑不是完全一致的，因此，通常由设备集群102中的第一引擎进行评估，从而能够准确的评估第一引擎执行数据操作语句的第一内存消耗量。It should be noted that the first memory consumption indicates the memory consumed by the first engine to execute the data operation statement. It should be noted that, considering that in the embodiment of the present invention, it is necessary to determine the first memory consumption of the first engine to execute the data operation statement, and the processing logic of different engines is not completely consistent, therefore, usually the first engine in the device cluster 102 An engine performs the evaluation, so that the first memory consumption of the first engine executing the data operation statement can be accurately evaluated.

步骤330、确定分配给第一引擎的总的内存的当前内存剩余量。Step 330, determine the current remaining memory amount of the total memory allocated to the first engine.

在实际应用中，目标设备管理设备集群102分配给第一引擎的资源的使用情况，从而确定分配给第一引擎的总的内存的使用情况，得到当前内存剩余量。In practical applications, the target device manages the use of resources allocated to the first engine by the device cluster 102, so as to determine the use of the total memory allocated to the first engine, and obtain the current remaining amount of memory.

接着，目标设备判断第一内存消耗量是否小于当前剩余内存值。如果是，将数据操作语句交由第一引擎执行，换言之，执行步骤340a。否则交由第二引擎执行，换言之，执行步骤340b。如图3所示，在步骤330之后，要么执行340a，要么执行340b。本发明实施例主要围绕这Trino引擎和Hive引擎处理SQL语句为例来说明本发明提供的方法，但考虑到计算引擎的多样性，本发明并不限制计算引擎的具体类型。Next, the target device judges whether the first memory consumption is smaller than the current remaining memory value. If yes, send the data manipulation statement to the first engine for execution, in other words, execute step 340a. Otherwise, it is executed by the second engine, in other words, step 340b is executed. As shown in FIG. 3, after step 330, either 340a or 340b is performed. The embodiment of the present invention mainly uses the Trino engine and the Hive engine to process SQL statements as an example to illustrate the method provided by the present invention, but considering the diversity of computing engines, the present invention does not limit the specific types of computing engines.

步骤340a、若当前内存剩余量大于第一内存消耗量，确定第一引擎处理数据操作语句。Step 340a, if the current remaining amount of memory is greater than the first amount of memory consumption, determine that the first engine processes the data operation statement.

示例地，第一引擎可以为trino引擎。其工作原理如下：trino引擎包含一个调度节点和多个工作节点。调度节点用于在接收到数据操作语句之后解析数据操作语句、生成逻辑执行计划、生成多个实际任务、分发多个实际任务给所有的工作节点。工作节点负责实际执行实际任务，多个工作节点之间可进行数据传输，且每个工作节点都可与分布式文件系统进行交互，读取分布式文件系统对应的设备集群102上存储的数据。工作节点计算完成后通知调度节点结束查询，并将查询结果发送给调度节点。值得注意的是，调用节点和工作节点都表示线程或进程。Exemplarily, the first engine may be a trino engine. Its working principle is as follows: the trino engine contains a scheduling node and multiple working nodes. The scheduling node is used to parse the data operation statement after receiving the data operation statement, generate a logical execution plan, generate multiple actual tasks, and distribute multiple actual tasks to all working nodes. The working nodes are responsible for actually executing actual tasks, and data transmission can be performed between multiple working nodes, and each working node can interact with the distributed file system and read data stored on the device cluster 102 corresponding to the distributed file system. After the calculation is completed, the working node notifies the scheduling node to end the query, and sends the query result to the scheduling node. It is worth noting that both caller nodes and worker nodes represent threads or processes.

具体实现时，针对trino引擎，可预先构建一个线程池，并在线程池中预置多个工作线程(由多个工作节点创建)。在trino引擎接收到SQL语句之后，可调用SQL解析器解析SQL语句，得到AST树。之后，通过逻辑执行计划组件将AST树转化为逻辑执行树。通过分布式计划组件对逻辑执行树进行分布式解析，得到多个计划，并将每个计划转化为相应的任务。可通过调用算法将多个任务调度至多个工作线程执行。这里，调度算法为随机算法、轮询调度算法、加权轮询算法等方式。In specific implementation, for the trino engine, a thread pool can be pre-built, and multiple worker threads (created by multiple worker nodes) can be preset in the thread pool. After the trino engine receives the SQL statement, it can call the SQL parser to parse the SQL statement and obtain the AST tree. Afterwards, the AST tree is converted into a logical execution tree by the logical execution plan component. Through the distributed analysis of the logical execution tree through the distributed plan component, multiple plans are obtained, and each plan is converted into a corresponding task. Multiple tasks can be scheduled to multiple worker threads for execution by calling algorithms. Here, the scheduling algorithm is a random algorithm, a round-robin scheduling algorithm, a weighted round-robin algorithm, and the like.

在一个实施例中，当第一内存消耗量小于当前剩余内存资源时，一方面说明当前的数据量不高，另一说明说明Trino引擎具有足够的内存执行数据操作语句，此时调用Trino引擎执行数据操作语句，确保执行效率。In one embodiment, when the first memory consumption is less than the current remaining memory resources, on the one hand, it indicates that the current data volume is not high, and on the other hand, it indicates that the Trino engine has enough memory to execute the data operation statement. At this time, the Trino engine is called to execute Data manipulation statements to ensure execution efficiency.

值得注意的是，逻辑计算执行树为trino引擎中逻辑执行计划组件得到的，在确定由Trino引擎执行数据操作语句时，调度节点可以直接解析逻辑执行树，得到多个任务，将多个任务调度到多个工作节点执行。It is worth noting that the logical calculation execution tree is obtained by the logical execution plan component in the trino engine. When it is determined that the data operation statement is executed by the Trino engine, the scheduling node can directly parse the logical execution tree to obtain multiple tasks and schedule multiple tasks. to multiple worker nodes for execution.

步骤340b、若当前内存剩余量小于等于第一内存消耗量，确定第一引擎处理数据操作语句。Step 340b, if the current remaining memory amount is less than or equal to the first memory consumption amount, determine that the first engine processes the data operation statement.

示例地，第二引擎可以为Hive引擎。Hive引擎是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce(一种编程模型)任务进行运行。Exemplarily, the second engine may be a Hive engine. The Hive engine is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provide a simple sql query function, which can convert sql statements into MapReduce (a programming model) tasks for operation.

具体实现时，针对Hive引擎，在Hive引擎接收到SQL语句之后，可调用SQL解析器解析SQL语句，得到AST树。通过逻辑执行计划组件将AST树转化为逻辑执行树。通过分布式计划组件对逻辑执行树进行分布式解析，得到多个计划，并将每个计划转化为相应的Map任务。每一个Map任务从磁盘上读取数据处理后，将中间结果再输出到磁盘上存储。由于Hive引擎需要执行多次磁盘读写，导致查询过程的会有较长的延时。但正也是因为Hive引擎将中间结果写入磁盘，因此Hive引擎对数据量没有太多限制。值得注意的是，在实际应用中，Hive引擎通常也包含一个调度节点和多个工作节点，调度节点用于在接收到数据操作语句之后解析数据操作语句、生成逻辑执行计划、生成多个Map任务、分发多个Map任务给所有的工作节点。In specific implementation, for the Hive engine, after the Hive engine receives the SQL statement, it can call the SQL parser to parse the SQL statement to obtain the AST tree. The AST tree is converted into a logical execution tree through the logical execution plan component. Through the distributed analysis of the logical execution tree through the distributed plan component, multiple plans are obtained, and each plan is converted into a corresponding Map task. After each Map task reads data from the disk for processing, it outputs the intermediate results to the disk for storage. Because the Hive engine needs to perform multiple disk reads and writes, there will be a long delay in the query process. But it is also because the Hive engine writes intermediate results to disk, so the Hive engine does not have too many restrictions on the amount of data. It is worth noting that in practical applications, the Hive engine usually also includes a scheduling node and multiple working nodes. The scheduling node is used to parse the data operation statement after receiving the data operation statement, generate a logical execution plan, and generate multiple Map tasks. , Distribute multiple Map tasks to all working nodes.

在一个实施例中，当第一内存消耗量大于当前剩余内存资源时，一方面说明当前的数据量较高，另一说明说明Trino引擎不具有足够的内存执行数据操作语句，此时调用Hive引擎执行数据操作语句。In one embodiment, when the first memory consumption is greater than the current remaining memory resources, on the one hand, it indicates that the current data volume is high, and on the other hand, it indicates that the Trino engine does not have enough memory to execute the data operation statement. At this time, the Hive engine is called Execute data manipulation statements.

由此，本发明实施例中，能够通过分配给基于内存的引擎的总的内存的剩余量和基于内存的引擎执行数据操作语句的内存消耗量的比较结果，从基于内存读写的引擎和基于磁盘读写的引擎中选择执行数据操作语句的引擎，可以基于内存读写的引擎快速实现数据操作语句的处理，或者，基于磁盘读写的引擎稳定的实现数据操作语句的处理，进而实现了引擎的智能化选择，实现效率和稳定性的平衡。Thus, in the embodiment of the present invention, the memory-based engine and the memory-based The engine that executes the data operation statement is selected among the engines for disk read and write, and the engine based on the memory read and write can quickly realize the processing of the data operation statement, or the engine based on the disk read and write can stably realize the processing of the data operation statement, and then realize the engine The intelligent selection achieves the balance between efficiency and stability.

图4示出了如图3所示的实施例中步骤302的流程示意图。FIG. 4 shows a schematic flowchart of step 302 in the embodiment shown in FIG. 3 .

如图4所示，在上述图3所示实施例的基础上，本发明一个示例性实施例中，步骤320具体可以包括如下步骤：As shown in FIG. 4 , on the basis of the above-mentioned embodiment shown in FIG. 3 , in an exemplary embodiment of the present invention, step 320 may specifically include the following steps:

步骤321、确定数据操作语句的逻辑执行树；其中，逻辑执行树指示了数据操作语句表示的数据处理的逻辑流程。Step 321. Determine the logical execution tree of the data operation statement; wherein, the logical execution tree indicates the logical flow of data processing represented by the data operation statement.

首先，目标设备对数据操作语句进行语法分析，得到AST树。这里，AST(abstractsyntax code，抽象语法树)是源代码的抽象语法结构的树状表示，树上的每个节点都表示源代码中的一种结构，这所以说是抽象的，是因为抽象语法树并不会表示出真实语法出现的每一个细节，比如说，嵌套括号被隐含在树的结构中，并没有以节点的形式呈现。抽象语法树并不依赖于源语言的语法，也就是说语法分析阶段所采用的上下文无文文法。First, the target device performs grammatical analysis on the data operation statement to obtain an AST tree. Here, AST (abstract syntax code, abstract syntax tree) is a tree representation of the abstract syntax structure of the source code, and each node on the tree represents a structure in the source code, which is abstract because the abstract syntax Trees don't represent every detail that occurs in the real grammar, for example, nested parentheses are implicit in the tree structure and not represented as nodes. The abstract syntax tree does not depend on the grammar of the source language, that is to say the context-free grammar used in the parsing phase.

接着对AST树进行语义分析，得到逻辑执行树。Then semantically analyze the AST tree to obtain a logical execution tree.

其中，逻辑执行树为一棵由操作符组成的树，指示了对数据库表的处理逻辑。在实际应用中，逻辑执行树为优化之后的树，逻辑执行树的优化为第一引擎和第二引擎本身具有的功能，具体的优化方法需要结合引擎的设计确定。Wherein, the logical execution tree is a tree composed of operators, indicating the processing logic of the database table. In practical applications, the logic execution tree is an optimized tree, and the optimization of the logic execution tree is a function of the first engine and the second engine itself. The specific optimization method needs to be determined in conjunction with the design of the engine.

需要说明的是，在对AST树进行语义分析之后，可以对AST树进行语义分析，得到SQL语句处理的数据表的表名、数据处理操作,进一步得到逻辑执行树。图5为本发明实施例提供的一种逻辑执行树的示意图。如图6所示，数据处理操作包括数据扫描(TatleScan)、过滤(Filter)、投影(Project)、实现表连接的哈希连接(Hashjion)、数据聚合(Aggregete)等。It should be noted that after the semantic analysis of the AST tree, the semantic analysis of the AST tree can be performed to obtain the table name and data processing operation of the data table processed by the SQL statement, and further obtain the logical execution tree. FIG. 5 is a schematic diagram of a logical execution tree provided by an embodiment of the present invention. As shown in Figure 6, data processing operations include data scanning (TatleScan), filtering (Filter), projection (Project), hash join (Hashjion) to realize table join, data aggregation (Aggregete), etc.

其中，数据扫描(TatleScan)的意思就是要把表中所有数据过一遍才能显示数据结果,在实际应用中，按照索引去找,扫描一部分数据就可以得到数据结果。Among them, data scanning (TatleScan) means to go through all the data in the table to display the data results. In practical applications, search according to the index and scan a part of the data to get the data results.

其中，过滤(Filter)的意思就是选择表中符合条件的数据。Among them, filter (Filter) means to select the qualified data in the table.

其中，投影(Project)的意思就是把表中的数据转换到另一个内存空间进行处理。Among them, projection (Project) means to convert the data in the table to another memory space for processing.

其中，表连接(JOIN)可以理解为在多个表中间通过一定的连接条件,使表之间发生关联进而能从多个表之间获取数据。由于不同的表在不同的服务器，因此表连接过程涉及到设备集群102中的服务器之间的通信。其中，哈希连接(Hashjion)仅仅是实现表连接的一种方式，可以提高连接效率，在实际应用中，还有其他的表连接的方式。Among them, table connection (JOIN) can be understood as passing certain connection conditions between multiple tables, so that the tables are associated and data can be obtained from multiple tables. Since different tables are on different servers, the table join process involves communication between servers in the device cluster 102 . Among them, hash join (Hashjion) is only a way to realize table join, which can improve the join efficiency. In practical application, there are other ways of join table.

其中，数据聚合(Aggregete)可以理解为收集表中的位于不同服务器的数据并以汇总表示的过程。需要说明的是，一个表的数据可以存放在不同的服务器，因此，需要进行数据聚合。Among them, data aggregation (Aggregete) can be understood as a process of collecting data located in different servers in a table and expressing it in summary. It should be noted that the data of a table can be stored in different servers, therefore, data aggregation is required.

需要说明的是，考虑到本发明实施例中需要确定第一引擎执行数据操作语句的第一内存消耗量，而不同的引擎生成的逻辑执行树的逻辑不是完全一致的，因此，通常由设备集群102中的第一引擎确定逻辑执行树，确保能够准确的评估第一引擎执行数据操作语句的第一内存消耗量。It should be noted that, considering the need to determine the first memory consumption of the first engine to execute the data operation statement in the embodiment of the present invention, and the logic of the logic execution tree generated by different engines is not completely consistent, therefore, usually by the device cluster The first engine in 102 determines the logical execution tree, ensuring that the first memory consumption of the first engine executing the data operation statement can be accurately evaluated.

之后，可以基于目标数据的元数据和逻辑执行树确定第一引擎执行所述数据操作语句的第一内存消耗量。Afterwards, the first memory consumption for the first engine to execute the data operation statement may be determined based on the metadata of the target data and the logical execution tree.

其中，基于元数据可以知道数据表的大小、多少行、每行的数据存储空间等信息，进而得到按照逻辑执行树执行的时需要处理的数据情况，从而较为较为准确的分析执行数据操作语句的内存消耗情况。如图4所示，具体包括如下步骤：Among them, based on the metadata, you can know the size of the data table, how many rows, the data storage space of each row, etc., and then get the data that needs to be processed when executing according to the logical execution tree, so as to analyze and execute the data operation statement more accurately. Memory consumption. As shown in Figure 4, it specifically includes the following steps:

步骤322、对于逻辑执行树中的每个节点，基于目标数据的元数据，确定节点对应的数据量；基于节点对应的数据量，确定节点对应的第二内存消耗量。Step 322 , for each node in the logical execution tree, based on the metadata of the target data, determine the data volume corresponding to the node; based on the data volume corresponding to the node, determine the second memory consumption corresponding to the node.

其中，第二内存消耗量用于指示第一引擎执行节点对应的任务的内存的消耗量。Wherein, the second memory consumption is used to indicate the memory consumption of the task corresponding to the execution node of the first engine.

首先，目标设备对于逻辑执行树中的每个节点，基于目标数据的元数据和该节点处理的数据表的名称，确定该节点对应的数据量，基于该节点对应的数据量，确定该节点对应的第二内存消耗量。第二内存消耗量指示了第一引擎执行节点对应的任务的内存的消耗量，在实际应用中，基于逻辑执行树可以生成每个节点对应的任务，该任务指示了实现节点指示的数据操作的计划。其中，基于元数据可以知道数据表的大小、多少行、每行的数据存储空间等信息，进而得到逻辑执行树中每个节点所处理的数据量，从而得到每个节点对应的第二内存消耗量。具体地，数据量可以为数据行数和每行数据存储空间的乘积。First, for each node in the logical execution tree, the target device determines the amount of data corresponding to the node based on the metadata of the target data and the name of the data table processed by the node, and determines the amount of data corresponding to the node based on the amount of data corresponding to the node. The second memory consumption. The second memory consumption indicates the memory consumption of the task corresponding to the execution node of the first engine. In practical applications, the task corresponding to each node can be generated based on the logical execution tree, and the task indicates the implementation of the data operation indicated by the node. plan. Among them, based on the metadata, you can know the size of the data table, how many rows, the data storage space of each row, etc., and then get the amount of data processed by each node in the logical execution tree, so as to get the second memory consumption corresponding to each node quantity. Specifically, the amount of data may be the product of the number of data rows and the data storage space of each row.

步骤323、基于逻辑执行树中每个节点各自的第二内存消耗量，确定第一引擎执行数据操作语句的第一内存消耗量。Step 323 , based on the respective second memory consumption of each node in the logic execution tree, determine the first memory consumption for executing the data operation statement by the first engine.

接着，基于逻辑执行树中每个节点各自的第二内存消耗量，确定执行数据操作语句的第一内存消耗量。具体地，对逻辑执行树中每个节点各自对应的第二内存消耗量求和，得到执行数据操作语句的第一内存消耗量。Next, based on the respective second memory consumption of each node in the logical execution tree, the first memory consumption for executing the data operation statement is determined. Specifically, the second memory consumption corresponding to each node in the logical execution tree is summed to obtain the first memory consumption for executing the data operation statement.

由此，本发明实施例中，通过还原第一引擎实际执行数据操作语句的分析过程，得到第一引擎实际执行数据操作语句过程中的资源消耗的情况，分析执行数据操作语句的真实的内存资源消耗，进而确保引擎的智能化选择。Therefore, in the embodiment of the present invention, by restoring the analysis process of the first engine actually executing the data operation statement, the resource consumption in the process of the first engine actually executing the data operation statement is obtained, and the real memory resources for executing the data operation statement are analyzed Consumption, thereby ensuring intelligent selection of engines.

图5示出了如图4所示的实施例中步骤322的流程示意图。FIG. 5 shows a schematic flowchart of step 322 in the embodiment shown in FIG. 4 .

如图5所示，在上述图3所示实施例的基础上，本发明一个示例性实施例中，步骤322所示基于目标数据的元数据和逻辑执行树，确定执行数据操作语句的第一内存消耗量这一步骤，具体可以包括如下步骤：As shown in FIG. 5, on the basis of the embodiment shown in FIG. 3 above, in an exemplary embodiment of the present invention, as shown in step 322, based on the metadata and logical execution tree of the target data, the first step to execute the data operation statement is determined. The step of memory consumption may specifically include the following steps:

步骤3221、将节点对应的任务需要处理的数据量作为内存消耗，得到节点对应的第三内存消耗量。Step 3221, taking the amount of data to be processed by the task corresponding to the node as the memory consumption, and obtaining the third memory consumption corresponding to the node.

在具体实现时，设备集群102可以确定节点对应的任务需要处理的数据量，将数据量作为该节点的第三内存消耗量。During specific implementation, the device cluster 102 may determine the amount of data that needs to be processed by the task corresponding to the node, and use the amount of data as the third memory consumption of the node.

步骤3222、确定节点对应的任务下的数据处理操作。Step 3222, determine the data processing operation under the task corresponding to the node.

本发明实施例中，逻辑执行树中的多个节点表示的数据处理操作有多种类型，通常一个节点指示一种数据处理操作。具体可以有3种类型，类型1：数据扫描(TatleScan)；类型2：本地处理，比如，过滤(Filter)、投影(Project)；类型3，网络通信，比如，表连接(JOIN)、数据聚合(Aggregete)。In the embodiment of the present invention, there are multiple types of data processing operations represented by multiple nodes in the logical execution tree, and generally one node indicates a type of data processing operation. Specifically, there are three types, type 1: data scanning (TatleScan); type 2: local processing, such as filtering (Filter), projection (Project); type 3, network communication, such as table connection (JOIN), data aggregation (Aggregete).

如图6所示，数据处理操作包括数据扫描(TatleScan)、过滤(Filter)、投影(Project)、实现表连接的哈希连接(Hashjion)、数据聚合(Aggregete)。对应的，第三内存消耗量为cost(TatleScan)、cost(Filter)、cost(Project)、cost(Hashjion)、cost(Aggregete)。As shown in Figure 6, the data processing operations include data scanning (TatleScan), filtering (Filter), projection (Project), hash join (Hashjion) to realize table join, and data aggregation (Aggregete). Correspondingly, the third memory consumption is cost(TatleScan), cost(Filter), cost(Project), cost(Hashjion), cost(Aggregete).

步骤3223、基于节点对应的任务下的数据处理操作，确定第三内存消耗量对应的修正值。Step 3223, based on the data processing operation under the task corresponding to the node, determine the correction value corresponding to the third memory consumption.

对于类型1：数据扫描。For type 1: data scan.

考虑到数据并不是一次性全部加载至内存，而是根据maxPartition(最大并行度)决定同时加载至内存的数据比例。Considering that the data is not loaded into the memory all at once, but the proportion of data loaded into the memory at the same time is determined according to maxPartition (maximum parallelism).

因此，当节点的数据处理操作为数据扫描，即为类型1时，该节点对应的修正值用于表示数据并行度。示例地，该修正值可以为W1，W1＝1/maxPartition，从而反映出真实的内存占用情况。Therefore, when the data processing operation of the node is data scanning, that is, type 1, the correction value corresponding to the node is used to represent the degree of data parallelism. For example, the correction value may be W1, W1=1/maxPartition, so as to reflect the actual memory usage.

对于类型2：本地处理。For type 2: local processing.

考虑到第一引擎处理节点对应的任务是并行处理的，对内存资源的占用和释放也是同时进行的，比如，对于trino引擎，通过多个工作节点并行处理节点对应的任务。因此在计算内存资源消耗时，只考虑内存资源申请总量是不准确的，需要同时将第一引擎并行处理节点对应的任务的过程中内存释放的情况考虑进来。Considering that the tasks corresponding to the processing nodes of the first engine are processed in parallel, the occupation and release of memory resources are also performed at the same time. For example, for the trino engine, tasks corresponding to the nodes are processed in parallel through multiple working nodes. Therefore, when calculating memory resource consumption, it is inaccurate to only consider the total amount of memory resource applications, and it is necessary to take into account the release of memory during the parallel processing of tasks corresponding to nodes by the first engine.

因此当节点的数据处理操作为本地的处理操作时，即为类型2时，该节点对应的修正值用于表示第一引擎执行节点对应的任务的过程中申请和释放内存的比例，从而反映出真实的内存占用情况。示例地，该可以为W(也可以称为第二数值)，W＝r1/T1，其中，r1表示第一引擎执行节点对应的实际任务的过程的真实内存消耗，T1表示第一引擎执行节点对应的实际任务的过程的内存资源申请总量。如图7所示，表示过滤(Filter)的节点所对应的修正值为W(Filter)，表示投影(Project)的节点所对应的修正值为W(Project)。Therefore, when the data processing operation of the node is a local processing operation, which is type 2, the correction value corresponding to the node is used to indicate the ratio of memory application and release in the process of the first engine executing the task corresponding to the node, thus reflecting The actual memory usage. For example, this may be W (also referred to as the second value), W=r1/T1, where r1 represents the real memory consumption of the process of the actual task corresponding to the first engine execution node, and T1 represents the first engine execution node The total amount of memory resources requested by the process corresponding to the actual task. As shown in FIG. 7 , the correction value corresponding to the node representing Filter (Filter) is W (Filter), and the correction value corresponding to the node representing Project (Project) is W (Project).

对于类型3：网络通信。For type 3: network communication.

节点对应的实际任务一方面通过多个节点并行执行，另一方面，考虑到网络通信会接收大量的数据，因此，实际占用的内存是输入数据的N(也可以称为第一数值)倍。这里，N可以通过节点对应的任务在执行过程中申请内存的次数表示。On the one hand, the actual tasks corresponding to the nodes are executed in parallel by multiple nodes. On the other hand, considering that the network communication will receive a large amount of data, the actual occupied memory is N (also called the first value) times of the input data. Here, N can be represented by the number of times the task corresponding to the node applies for memory during execution.

因此，因此当节点的数据处理操作为基于网络通信的操作时，即为类型3时，该节点对应的修正值，一方面用于表示第一引擎执行节点对应的任务的过程中申请和释放内存的比例，另一方面用于表示第一引擎执行节点对应的任务的过程中通过网络通信所新增的内存占用(通过申请内存的次数表示)，从而反映出真实的内存占用情况。Therefore, when the data processing operation of a node is an operation based on network communication, which is type 3, the correction value corresponding to the node is used to indicate that the first engine applies for and releases memory during the process of executing the task corresponding to the node. On the other hand, it is used to indicate the new memory usage (indicated by the number of times of memory application) through network communication during the process of the first engine executing the corresponding task of the node, so as to reflect the real memory usage.

示例地，该修正值可以包括第二数值W和第一数值N，其中，W参见上文描述，不再赘述。具体地，修正值可以为W*N。如图7所示，实现哈希连接(Hashjion)的节点所对应的修正值为W(Hashjion)*N(Hashjion)、数据聚合(Aggregete)的节点所对应的修正值为W(Aggregete)*N(Aggregete)。Exemplarily, the correction value may include a second value W and a first value N, wherein W refers to the above description and will not be repeated here. Specifically, the correction value may be W*N. As shown in Figure 7, the correction value corresponding to the node that implements the hash connection (Hashjion) is W(Hashjion)*N(Hashjion), and the correction value corresponding to the node that implements the data aggregation (Aggregete) is W(Aggregete)*N (Aggregete).

步骤3224、基于修正值对第三内存消耗量进行修正，确定节点对应的第二内存消耗量。Step 3224: Correct the third memory consumption based on the correction value, and determine the second memory consumption corresponding to the node.

具体地，设备集群102将该节点对应的修正值和第三内存消耗量的乘积，作为该节点对应的第二内存消耗量。Specifically, the device cluster 102 uses the product of the correction value corresponding to the node and the third memory consumption as the second memory consumption corresponding to the node.

如图6所示，类型1的节点的第二内存消耗量为cost(TatleScan)/maxPartition；类型2的节点的第二内存消耗量为cost(Filter)*W(Filter)；cost(Project)*W(Project)；类型3的节点的第二内存消耗量为cost(Hashjion)*W(Hashjion)*N(Hashjion)；cost(Aggregete)*W(Aggregete)*N(Aggregete)。As shown in Figure 6, the second memory consumption of type 1 nodes is cost(TatleScan)/maxPartition; the second memory consumption of type 2 nodes is cost(Filter)*W(Filter); cost(Project)* W(Project); the second memory consumption of the type 3 node is cost(Hashjion)*W(Hashjion)*N(Hashjion); cost(Aggregete)*W(Aggregete)*N(Aggregete).

由此，本发明实施例中，通过考虑每个节点在任务实际执行过程中的内存资源的消耗情况得到修正值，基于修正值对内存资源消耗进行修正，得到每个节点在任务实际执行过程中的资源消耗的情况，确保执行该数据操作语句的真实的内存资源消耗，进而确保引擎的智能化选择。Therefore, in the embodiment of the present invention, the correction value is obtained by considering the memory resource consumption of each node during the actual task execution process, and the memory resource consumption is corrected based on the correction value, so that the memory resource consumption of each node during the actual task execution process is obtained. The situation of resource consumption ensures the real memory resource consumption of executing the data operation statement, thereby ensuring the intelligent selection of the engine.

应理解，上述实施例中各步骤的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present invention.

基于上述提供的计算引擎确定方法，对方法的具体的应用进行说明。图6为本发明实施提供的一种Trino引擎和Hive引擎选择方法的示意图。如图6所示，具体内容包括：Based on the calculation engine determination method provided above, the specific application of the method will be described. FIG. 6 is a schematic diagram of a method for selecting a Trino engine and a Hive engine provided by the implementation of the present invention. As shown in Figure 6, the specific contents include:

1.Hive引擎作为统一SQL入口，接收SQL语句。将SQL语句交由Trino引擎处理。1. The Hive engine acts as a unified SQL entry to receive SQL statements. Hand over the SQL statement to the Trino engine for processing.

2.Trino引擎解析SQL语句，生成的AST树，并基于AST树生成逻辑执行树。2. The Trino engine parses the SQL statement, generates an AST tree, and generates a logical execution tree based on the AST tree.

3.Trino引擎统计执行SQL语句的内存消耗量。3. The Trino engine counts the memory consumption of executing SQL statements.

4.Trino引擎将内存消耗量与分配给自身的总的内存的剩余内存量进行比较，若小于则认为该任务数据量较小，提交至Trino执行以获取最佳性能和效率；若大于则认为该任务数据量较大，SQL语句交由Hive引擎执行，以保证大数据量下的稳定性。4. The Trino engine compares the memory consumption with the remaining memory of the total memory allocated to itself. If it is less than that, it considers that the data volume of the task is small, and submits it to Trino for execution to obtain the best performance and efficiency; if it is greater, it considers it The task has a large amount of data, and the SQL statement is executed by the Hive engine to ensure stability under a large amount of data.

本发明实施例中，能够基于Trino引擎的内存剩余量和Trino引擎执行SQL语句的内存消耗量的比较结果，从Trino引擎和Hive引擎中选择执行SQL语句的引擎，可以基于Trino引擎快速实现SQL语句的处理，或者，基于Hive引擎稳定的实现SQL语句的处理，进而实现了引擎的智能化选择，实现效率和稳定性的平衡。In the embodiment of the present invention, it is possible to select the engine for executing the SQL statement from the Trino engine and the Hive engine based on the comparison result of the memory remaining amount of the Trino engine and the memory consumption of the Trino engine executing the SQL statement, and the SQL statement can be quickly implemented based on the Trino engine Or, based on the Hive engine, the stable processing of SQL statements is realized, and then the intelligent selection of the engine is realized, and the balance between efficiency and stability is achieved.

基于与本发明方法实施例相同的构思，本发明实施例还提供了一种计算引擎确定装置。该装置包括若干个模块，各个模块用于执行本发明实施例第一方面提供的计算引擎确定方法中的各个步骤，关于模块的划分在此不做限制。该计算引擎确定装置中的各个模块所执行的具体功能及达到的有益效果请参考上文描述的本发明实施例提供的计算引擎确定方法的各个步骤310至步骤340b的功能，在此不再赘述。Based on the same idea as the method embodiment of the present invention, the embodiment of the present invention also provides a computing engine determining device. The device includes several modules, and each module is used to execute each step in the calculation engine determination method provided in the first aspect of the embodiment of the present invention, and there is no limitation on the division of the modules. For the specific functions performed by each module in the calculation engine determination device and the beneficial effects achieved, please refer to the functions of each step 310 to step 340b of the calculation engine determination method provided by the embodiment of the present invention described above, and will not be repeated here. .

图8是本发明实施例提供的一种计算引擎确定装置的结构示意图。Fig. 8 is a schematic structural diagram of an apparatus for determining a computing engine provided by an embodiment of the present invention.

如图8所示，本发明实施例提供的一种计算引擎确定装置，计算引擎确定装置安装基于内存的第一引擎，第一引擎用于管理目标数据，包括：As shown in FIG. 8, an embodiment of the present invention provides a calculation engine determination device. The calculation engine determination device installs a memory-based first engine, and the first engine is used to manage target data, including:

语句确定模块801，用于获取待处理的数据操作语句；Statement determination module 801, configured to acquire data operation statements to be processed;

消耗资源确定模块802，用于确定第一引擎执行数据操作语句的第一内存消耗量；其中，第一引擎为基于内存的引擎，第一引擎用于处理目标数据；The resource consumption determination module 802 is configured to determine the first memory consumption of the first engine executing the data operation statement; wherein, the first engine is a memory-based engine, and the first engine is used to process target data;

剩余资源确定模块803，用于确定分配给第一引擎中的总的内存的当前内存剩余量；A remaining resource determining module 803, configured to determine the current remaining amount of memory allocated to the total memory in the first engine;

引擎选择模块804，用于若当前内存剩余量大于第一内存消耗量，确定第一引擎处理数据操作语句。The engine selection module 804 is configured to determine that the first engine processes the data operation statement if the current remaining amount of memory is greater than the first amount of memory consumption.

语句确定模块801、消耗资源确定模块802、剩余资源确定模块803和引擎选择模块804各个模块的详细功能参见上文对步骤310至步骤340b的的描述不再赘述。For the detailed functions of the statement determination module 801 , the resource consumption determination module 802 , the remaining resource determination module 803 and the engine selection module 804 , please refer to the description of steps 310 to 340 b above, and will not repeat them here.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，仅以上述各模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的模块完成，即将所述装置的内部结构划分成不同的模块，以完成以上描述的全部或者部分功能。实施例中的各模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中，上述集成的单元既可以采用硬件的形式实现，比如，计算引擎确定装置可以为设备集群102中若干个电子设备，也可以采用软件功能单元的形式实现，比如，计算引擎确定装置可以部署在第一引擎中。另外，各功能单元、模块的具体名称也只是为了便于相互区分，并不用于限制本发明的保护范围。上述装置中模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned modules is used for illustration. The internal structure of the device is divided into different modules to complete all or part of the functions described above. Each module in the embodiment can be integrated into one processing unit, or each unit can exist separately physically, or two or more units can be integrated into one unit, and the above-mentioned integrated units can be implemented in the form of hardware, For example, the means for determining a computing engine may be several electronic devices in the device cluster 102, and may also be implemented in the form of a software functional unit. For example, the means for determining a computing engine may be deployed in the first engine. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present invention. For the specific working process of the modules in the above-mentioned device, reference may be made to the corresponding process in the aforementioned method embodiments, which will not be repeated here.

基于与本发明方法实施例相同的构思，本发明实施例还提供了一种电子设备。Based on the same idea as the method embodiment of the present invention, the embodiment of the present invention also provides an electronic device.

如图9所示，电子设备900包括处理器901、存储器902和网络接口903。As shown in FIG. 9 , an electronic device 900 includes a processor 901 , a memory 902 and a network interface 903 .

处理器901可以是中央处理单元(Central Processing Unit，CPU)，还可以是其它通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 901 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

存储器902可以包括一个或多个计算机程序产品，计算机程序产品可以包括各种形式的计算机可读存储介质，可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlinkDRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。在计算机可读存储介质上可以存储一个或多个计算机程序，处理器901执行计算机程序时实现上述各个计算引擎确定方法的实施例中的步骤，例如图3所示的步骤310至340b。Memory 902 may include one or more computer program products, which may include various forms of computer-readable storage media, may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile Both sexual memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlinkDRAM, SLDRAM) and direct memory Bus random access memory (direct rambus RAM, DR RAM). One or more computer programs may be stored on the computer-readable storage medium. When the processor 901 executes the computer programs, the steps in the above embodiments of each calculation engine determination method are implemented, such as steps 310 to 340b shown in FIG. 3 .

示例性的，计算机程序可以被分割成一个或多个模块/单元，所述一个或者多个模块/单元被存储在所述存储器902中，并由所述处理器901执行，以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段。例如，计算机程序可以被分割成语句确定模块801、消耗资源确定模块802、剩余资源确定模块803和引擎选择模块804，各模块具体功能参见上文描述。Exemplarily, the computer program can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 902 and executed by the processor 901 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions. For example, the computer program can be divided into a sentence determination module 801, a resource consumption determination module 802, a remaining resource determination module 803, and an engine selection module 804, and the specific functions of each module refer to the above description.

网络接口903用于收发数据，例如，将处理器901处理后的数据发送至其他的电子设备，或者，接收其他的电子设备发送的数据等。The network interface 903 is used for sending and receiving data, for example, sending data processed by the processor 901 to other electronic devices, or receiving data sent by other electronic devices.

当然，为了简化，图9中仅示出了该电子设备900中与本发明有关的组件中的一些，省略了诸如总线、输入/输出接口等等的组件。除此之外，根据具体应用情况，电子设备900还可以包括任何其他适当的组件。另外，所述电子设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解，图9仅仅是电子设备900的示例，并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件，例如所述电子设备还可以包括输入设备、输出设备、网络接入设备、总线等。示例地，该输入装置可以是麦克风阵列、还可以包括例如键盘、鼠标等等。示例地，该输出装置可以向外部输出各种信息，可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。Of course, for simplicity, only some components related to the present invention in the electronic device 900 are shown in FIG. 9 , and components such as bus, input/output interface, etc. are omitted. In addition, according to specific application conditions, the electronic device 900 may further include any other appropriate components. In addition, the electronic device may be computing devices such as desktop computers, notebooks, palmtop computers, and cloud servers. Those skilled in the art can understand that FIG. 9 is only an example of an electronic device 900, and does not constitute a limitation to the electronic device. It may include more or less components than those shown in the figure, or combine certain components, or different components, For example, the electronic device may also include an input device, an output device, a network access device, a bus, and the like. Exemplarily, the input device may be a microphone array, and may also include, for example, a keyboard, a mouse, and the like. Exemplarily, the output device may output various information to the outside, and may include, for example, a display, a speaker, a printer, a communication network and a remote output device connected thereto, and the like.

除了上述方法、装置和电子设备以外，本发明实施例还可以提供了一种计算机程序产品，其包括计算机程序指令，所述计算机程序指令在被处理器运行时使得所述处理器执行本发明实施例的计算引擎确定方法中的步骤。其中，所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本发明实施例操作的计算机程序代码，所述程序设计语言包括面向对象的程序设计语言，诸如Java、C++等，还包括常规的过程式程序设计语言，诸如“C”语言或类似的程序设计语言。其中，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机程序代码可以完全地在电子设备上执行、部分地在电子设备上执行、作为一个独立的软件包执行、部分在电子设备上部分在远程的电子设备上执行、或者完全在远程的电子设备上执行。In addition to the above-mentioned method, device, and electronic equipment, an embodiment of the present invention may also provide a computer program product, which includes computer program instructions, and when the computer program instructions are executed by a processor, the processor executes the implementation of the present invention. The calculation engine of the example determines the steps in the method. Wherein, the computer program product can be written in any combination of one or more programming languages to execute the computer program codes for performing the operations of the embodiments of the present invention, and the programming languages include object-oriented programming languages, such as Java , C++, etc., and also includes conventional procedural programming languages such as the "C" language or similar programming languages. Wherein, the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer program code may be executed entirely on the electronic device, partly on the electronic device, as a stand-alone software package, partly on the electronic device and partly on a remote electronic device, or entirely on the remote electronic device implement.

此外，本发明还提供了一种设备集群102。设备集群102包括多个电子设备。示例地，电子设备为上述电子设备900。设备集群102用于执行本发明实施例的计算引擎确定方法中的步骤。In addition, the present invention also provides a device cluster 102 . Device cluster 102 includes a plurality of electronic devices. Exemplarily, the electronic device is the above-mentioned electronic device 900 . The device cluster 102 is configured to execute the steps in the calculation engine determining method of the embodiment of the present invention.

在实际应用中，设备集群102中安装第一引擎的若干个电子设备900中的处理器901运行第一引擎的程序以执行本发明实施例提供的计算引擎确定方法。更为具体的，通过第一引擎中的调度节点所在的目标设备中的处理器901运行第一引擎的程序以执行本发明实施例提供的计算引擎确定方法。In practical application, the processors 901 in the several electronic devices 900 installed with the first engine in the device cluster 102 run the program of the first engine to execute the calculation engine determination method provided by the embodiment of the present invention. More specifically, the processor 901 in the target device where the scheduling node in the first engine is located runs the program of the first engine to execute the calculation engine determination method provided by the embodiment of the present invention.

可选地，设备集群102同时安装第一引擎和第二引擎。比如，上述目标设备同时安装第一引擎和第二引擎。Optionally, the device cluster 102 installs the first engine and the second engine at the same time. For example, the above-mentioned target device is equipped with the first engine and the second engine at the same time.

可选地，设备集群102可以划分为第一集群和第二集群，第一集群安装第一引擎，第二集群安装第二引擎。Optionally, the device cluster 102 may be divided into a first cluster and a second cluster, the first cluster is installed with the first engine, and the second cluster is installed with the second engine.

这里的安装可以理解为将引擎的程序存储在存储器902中，从而使得处理器901可以运行引擎的程序，实现引擎所能实现的功能。The installation here can be understood as storing the program of the engine in the memory 902, so that the processor 901 can run the program of the engine to realize the functions that the engine can realize.

此外，本发明实施例还可以提供了一种计算机可读存储介质，其上存储有计算机程序指令，所述计算机程序指令在被处理器运行时使得所述处理器执行本发明实施例的计算引擎确定方法中的步骤。所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。需要说明的是，所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减，例如在某些司法管辖区，根据立法和专利实践，计算机可读介质不包括电载波信号和电信信号。In addition, an embodiment of the present invention may also provide a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the processor executes the computing engine of the embodiment of the present invention Identify the steps in the method. The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, but not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excludes electrical carrier signals and telecommunication signals.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述或记载的部分，可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.

以上结合具体实施例描述了本发明的基本原理，但是，需要指出的是，在本发明中提及的优点、优势、效果等仅是示例而非限制，不能认为这些优点、优势、效果等是本发明的各个实施例必须具备的。另外，上述公开的具体细节仅是为了示例的作用和便于理解的作用，而非限制，上述细节并不限制本发明为必须采用上述具体的细节来实现。The basic principles of the present invention have been described above in conjunction with specific embodiments, but it should be pointed out that the advantages, advantages, effects, etc. mentioned in the present invention are only examples rather than limitations, and these advantages, advantages, effects, etc. Every embodiment of the invention must have. In addition, the specific details disclosed above are only for the purpose of illustration and understanding, rather than limitation, and the above details do not limit the present invention to be implemented by using the above specific details.

本发明中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的，可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇，指“包括但不限于”，且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”，且可与其互换使用，除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”，且可与其互换使用。The block diagrams of devices, devices, equipment, and systems involved in the present invention are only illustrative examples and are not intended to require or imply that they must be connected, arranged, and configured in the manner shown in the block diagrams. As will be appreciated by those skilled in the art, these devices, devices, devices, systems may be connected, arranged, configured in any manner. Words such as "including", "comprising", "having" and the like are open-ended words meaning "including but not limited to" and may be used interchangeably therewith. As used herein, the words "or" and "and" refer to the word "and/or" and are used interchangeably therewith, unless the context clearly dictates otherwise. As used herein, the word "such as" refers to the phrase "such as but not limited to" and can be used interchangeably therewith.

还需要指出的是，在本发明的装置、设备和方法中，各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。It should also be pointed out that in the apparatus, equipment and method of the present invention, each component or each step can be decomposed and/or reassembled. These decompositions and/or recombinations should be considered equivalents of the present disclosure.

为了例示和描述的目的给出了以上描述。此外，上述描述不意图将本发明实施例限制到上述描述。尽管以上已经讨论了多个示例方面和实施例，但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。The foregoing description has been presented for purposes of illustration and description. In addition, the above description is not intended to limit the embodiments of the present invention to the above description. Although a number of example aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, changes, additions and sub-combinations thereof.

可以理解的是，在本发明实施例中涉及的各种数字编号仅为描述方便进行的区分，并不用来限制本发明的实施例的范围。It can be understood that the various numbers involved in the embodiments of the present invention are only for convenience of description, and are not used to limit the scope of the embodiments of the present invention.

Claims

1. A method for computing engine determination, the method comprising:

acquiring a data operation statement to be processed;

determining a first memory consumption of the first engine for executing the data operation statement; the first engine is a memory-based engine and is used for processing target data;

determining a current remaining amount of memory allocated to the total memory of the first engine;

and if the current memory surplus is larger than the first memory consumption, determining that the first engine processes the data operation statement.

2. The method of claim 1, further comprising:

if the current memory surplus is less than or equal to the first memory consumption, determining that a second engine processes the data operation statement;

wherein the second engine is a disk-based engine; the second engine is to process the target data.

3. The method of claim 1 or 2, wherein determining the first memory consumption amount of the first engine to execute the data operation statement comprises:

determining a logical execution tree of the data operation statement; wherein the logic execution tree is used for indicating the logic flow of the data processing represented by the data operation statement;

and determining a first memory consumption amount of the first engine for executing the data operation statement based on the metadata of the target data and the logic execution tree.

4. The method of claim 3, wherein determining a first amount of memory consumption for a first engine to execute the data operation statement based on the metadata of the target data and the logic execution tree comprises:

for each node in the logic execution tree, determining a data amount corresponding to the node based on the metadata of the target data; determining a second memory consumption corresponding to the node based on the data amount corresponding to the node; the second memory consumption is used for indicating the consumption of the memory of the task corresponding to the node executed by the first engine;

and determining a first memory consumption of the first engine for executing the data operation statement based on the respective second memory consumption of each node in the logic execution tree.

5. The method according to claim 4, wherein the determining the second memory consumption amount corresponding to the node based on the data amount corresponding to the node comprises:

taking the data quantity which needs to be processed by the task corresponding to the node as memory consumption, and obtaining a third memory consumption quantity corresponding to the node;

determining data processing operation corresponding to the node;

determining a correction value corresponding to the third memory consumption based on the data processing operation corresponding to the node;

and correcting the third memory consumption based on the correction value, and determining a second memory consumption corresponding to the node.

6. The method according to claim 4 or 5, wherein the determining a first memory consumption amount for executing the data operation statement based on a respective second memory consumption amount of each node in the logic execution tree comprises:

and summing the respective second memory consumption of each node in the logic execution tree, and taking the summed result as the first memory consumption for executing the data operation statement.

7. The method according to claim 5 or 6, wherein the determining a correction value corresponding to the third memory consumption amount based on the data processing operation corresponding to the node comprises:

if the data processing operation is scanning, the correction value is the reciprocal of the data parallelism; or

If the data processing operation is based on network communication, the correction value comprises a first numerical value and a second numerical value, and the first numerical value is used for indicating the number of times of applying for the memory in the process of executing the task corresponding to the node by the first engine; the second value is used for indicating the proportion of applying for and releasing the memory in the process of executing the task corresponding to the node by the first engine; or

And if the data processing operation is a local processing operation, the correction value is the second numerical value.

8. The method according to any one of claims 1 to 7, wherein the obtaining the data operation statement to be processed comprises:

and taking the data operation statement of the receiving terminal as the data operation statement to be processed, or taking the data operation statement received by the second engine as the data operation statement to be processed.

9. The method according to any one of claims 1 to 8, wherein after determining that the first engine processes the data operation statement if the current amount of memory remaining is greater than the first amount of memory consumption, further comprising:

determining a task for each node in the logical execution tree, the task being executed by the first engine.

10. A computing engine determination device, comprising: the apparatus includes a processor and a memory;

the memory stores program instructions;

the processor is configured to execute the program instructions to cause the apparatus to perform the method of claims 1 to 9.