CN118227656A

CN118227656A - Query method and device based on data lake

Info

Publication number: CN118227656A
Application number: CN202410650121.4A
Authority: CN
Inventors: 陈刚; 陈纯; 伍赛; 赵俊博; 张东祥; 唐秀; 宋明黎; 高云君
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-05-24
Filing date: 2024-05-24
Publication date: 2024-06-21
Anticipated expiration: 2044-05-24
Also published as: CN118227656B

Abstract

The present invention discloses a query method and device based on a data lake, including: sampling the target data set of the query according to the user input, obtaining the pattern information M and the data sample information, thereby constructing the query; decomposing the query into a number of subtasks, thereby constructing a processing graph; correcting the processing graph, using the shuffle technology and/or the collapse technology, and optimizing the corrected processing graph in combination with the cost model; generating and executing the code according to the optimized processing graph to output the user query result. The present invention does not require an intermediary mode, simplifies the query process, does not require data conversion and loading, simplifies the operation, and improves the query efficiency as a whole. In terms of query details, a query optimizer for LLM generated code is designed, which greatly improves the execution efficiency of the LLM generated code and the interpretability of the corresponding method, wherein the correction of the processing graph to assist LLM can improve the query accuracy, so that the accuracy of the entire natural language query task exceeds the traditional method.

Description

A query method and device based on data lake

技术领域Technical Field

本发明涉及数据管理技术领域，尤其涉及一种基于数据湖的查询方法和装置。The present invention relates to the field of data management technology, and in particular to a query method and device based on a data lake.

背景技术Background technique

数据湖被设计成一个综合性的存储中心，旨在适应、管理和保护大量的数据，不论其结构如何。无论数据是精心组织的、半结构化的，还是完全非结构化的，数据湖都能够以其原始格式保存这些数据。此外，它擅长处理广泛的数据类型，不受大小限制。然而，提供数据湖服务需要复杂的工程努力。传统数据湖通常采用两种不同的方法。在这两种策略中，定义统一的、中介的模式并建立每个源模式与这个中心模式之间的映射关系是必要的。在查询驱动的方法中，针对中介模式的任何查询都会根据这些映射关系转换为每个源数据集的相应查询。然后这些查询由源系统单独处理。数据湖的作用是将结果合并成一个连贯的输出。相反，在数据驱动的模型中，我们将所有传入的数据转换以与中介模式一致，然后将它们加载到数据湖内的宿主系统中。这里，宿主系统负责处理所有查询，提供一个集中的数据处理解决方案。Data lakes are designed as comprehensive storage centers that are designed to accommodate, manage, and protect large amounts of data, regardless of their structure. Whether the data is well-organized, semi-structured, or completely unstructured, the data lake is able to preserve this data in its original format. In addition, it excels at handling a wide range of data types, regardless of size. However, providing data lake services requires complex engineering efforts. Traditional data lakes usually take two different approaches. In both strategies, it is necessary to define a unified, intermediary schema and establish mapping relationships between each source schema and this central schema. In the query-driven approach, any query against the intermediary schema is converted into a corresponding query for each source dataset based on these mapping relationships. These queries are then processed separately by the source systems. The role of the data lake is to merge the results into a coherent output. In contrast, in the data-driven model, we transform all incoming data to be consistent with the intermediary schema and then load them into the host system within the data lake. Here, the host system is responsible for processing all queries, providing a centralized data processing solution.

实施数据湖系统面临三个主要挑战：There are three main challenges in implementing a data lake system:

1、为模式定义需要专家知识。制定适当的中介模式需要广泛的领域专业知识和对所有源数据的全面了解。达到这种洞察力可能并非总是可行的。1. Expert knowledge is required for schema definition. Developing an appropriate mediation schema requires extensive domain expertise and a comprehensive understanding of all source data. Achieving this insight may not always be feasible.

2、模式映射的复杂性。即使有半自动工具的帮助，建立模式映射也是一个复杂和劳动密集的任务。这一过程经常导致显著的处理开销，并需要大量的人工干预。2. Complexity of schema mapping. Even with the help of semi-automatic tools, creating schema mapping is a complex and labor-intensive task. This process often results in significant processing overhead and requires a lot of manual intervention.

3、数据转换困境。这个挑战有两面，取决于所选择的方法。在数据驱动的方法中，数据转换和加载的过程是资源密集型和昂贵的。另一方面，查询驱动的方法引入了查询重写和结果合并的复杂性，使得确保查询结果的正确性和可追溯性变得具有挑战性。3. Data conversion dilemma. This challenge has two sides, depending on the chosen approach. In the data-driven approach, the process of data conversion and loading is resource-intensive and expensive. On the other hand, the query-driven approach introduces the complexity of query rewriting and result merging, making it challenging to ensure the correctness and traceability of query results.

发明内容Summary of the invention

本发明的目的在于针对当前多模态数据库领域中的模式定义困难、模式映射复杂、数据转化困难的问题，提供一种基于数据湖的查询方法和装置。The purpose of the present invention is to provide a query method and device based on a data lake to address the problems of difficult schema definition, complex schema mapping, and difficult data conversion in the current multimodal database field.

本发明的目的是通过以下技术方案来实现的：一种基于数据湖的查询方法，包括：The object of the present invention is to achieve the following technical solution: a query method based on a data lake, comprising:

（1）根据用户输入，对查询的目标数据集进行采样，得到模式信息M和数据样本信息S，从而构建查询Q；(1) Based on the user input, the target data set of the query is sampled to obtain the pattern information M and data sample information S, thereby constructing the query Q;

（2）将查询Q分解成若干个子任务，从而构建处理图G；(2) Decompose the query Q into several subtasks to construct a processing graph G;

（3）修正处理图G，采用shuffle技术和/或Collapse技术，并结合成本模型对修正后的处理图G进行优化；(3) Modify the processing graph G, use the shuffle technology and/or the collapse technology, and optimize the modified processing graph G in combination with the cost model;

（4）根据优化后的处理图生成代码并执行，以输出用户查询结果。(4) Generate code based on the optimized processing graph and execute it to output the user query results.

进一步地，从目标数据集中进行数据采样，获取最具代表性的数据样本，其核心逻辑在于数据特征统计和空值去除。此外，遵循主外键关系从关联的数据中检索可连接的元组，整合数据关联信息，加入到模式信息M，最终形成查询Q。Furthermore, data sampling is performed from the target data set to obtain the most representative data samples. The core logic is data feature statistics and null value removal. In addition, the primary and foreign key relationships are followed to retrieve connectable tuples from the associated data, and the data association information is integrated and added to the pattern information M to finally form the query Q.

进一步地，所述将查询Q分解成若干个子任务，包括：Furthermore, the query Q is decomposed into several subtasks, including:

使用大模型将查询Q分解成若干个子任务。Use the big model to decompose the query Q into several subtasks.

进一步地，所述大模型还经过如下训练：Furthermore, the large model is trained as follows:

以子任务和对应的操作符为训练集，用于训练大模型；The subtasks and corresponding operators are used as training sets to train the large model;

训练好的大模型用于将所述子任务转化为对应的操作符。The trained large model is used to convert the subtasks into corresponding operators.

进一步地，所述修正处理图G包括：形式修正、逻辑修正和全局修正；所述形式修为针对处理图中子任务描述引用的表名、列名进行检查；所述逻辑修正为针对子任务描述中的条件进行检查；所述全局修正为针对子任务之间的关系进行检查。Furthermore, the modified processing graph G includes: formal correction, logical correction and global correction; the formal correction is to check the table name and column name referenced by the subtask description in the processing graph; the logical correction is to check the conditions in the subtask description; the global correction is to check the relationship between the subtasks.

所述采用shuffle技术和/或Collapse技术，并结合成本模型对修正后的处理图G进行优化，包括：The shuffle technology and/or collapse technology are used, and the modified processing graph G is optimized in combination with the cost model, including:

采用shuffle技术，并结合成本模型改变处理图G中的操作符的顺序来优化查询计划；The shuffle technology is used, and the order of operators in the processing graph G is changed in combination with the cost model to optimize the query plan;

采用Collapse技术，并结合成本模型循环计算合并多个操作符带来的成本收益，从而优化查询计划；Collapse technology is used, and the cost model is combined to cyclically calculate the cost benefits of merging multiple operators, thereby optimizing the query plan;

其中，所述操作符由所述子任务转化而来。The operator is transformed from the subtask.

进一步地，还包括：重复执行步骤（2）和步骤（3），得到多个优化后的处理图，并选取最优的处理图执行步骤（4）。Furthermore, the method further includes: repeatedly executing step (2) and step (3) to obtain multiple optimized processing graphs, and selecting the optimal processing graph to execute step (4).

进一步地，所述最优的处理图是通过成本模型从多个优化后的处理图选取的。Furthermore, the optimal processing graph is selected from a plurality of optimized processing graphs through a cost model.

进一步地，还包括：在生成代码之后进行代码验证，若验证失败，则重新执行步骤（4），若仍然失败，则重新执行步骤（2）-步骤（4）。Furthermore, the method further includes: performing code verification after the code is generated, and if the verification fails, re-executing step (4); if the verification still fails, re-executing steps (2) to (4).

进一步地，所述进行代码验证，包括：Furthermore, the code verification includes:

根据生成的代码和对应的优化后的处理图进行操作符级别的验证。Operator-level verification is performed based on the generated code and the corresponding optimized processing graph.

本发明还提供了一种基于数据湖的查询装置，包括：The present invention also provides a query device based on a data lake, comprising:

预处理模块，用于根据用户输入，对查询的目标数据集进行采样，得到模式信息M和数据样本信息S，从而构建查询Q；The preprocessing module is used to sample the target data set of the query according to the user input, obtain the pattern information M and data sample information S, and thus construct the query Q;

处理图构建模块，用于将查询Q分解成若干个子任务，从而构建处理图G；The processing graph construction module is used to decompose the query Q into several subtasks to construct the processing graph G;

处理图优化模块，用于修正处理图G，采用shuffle技术和/或Collapse技术，并结合成本模型对修正后的处理图G进行优化；A processing graph optimization module is used to modify the processing graph G, and optimize the modified processing graph G by using the shuffle technology and/or the collapse technology in combination with the cost model;

代码生成和执行模块，用于根据优化后的处理图生成代码并执行，以输出用户查询结果。The code generation and execution module is used to generate and execute code according to the optimized processing graph to output user query results.

与现有技术相比，本发明的有益效果是：与现有数据湖查询方法相比，无需中介模式，简化了查询过程，不需要数据转换和加载，简化了操作，从整体上提高了查询效率。Compared with the prior art, the beneficial effects of the present invention are: compared with the existing data lake query method, no intermediary mode is required, the query process is simplified, data conversion and loading are not required, the operation is simplified, and the query efficiency is improved overall.

在查询细节上，设计了针对LLM生成代码的查询优化器，极大提高了LLM生成代码的执行效率和对应方法的可解释性，其中对处理图进行修正以辅助LLM能极大提高查询准确性，使整个自然语言查询任务的准确性超过传统方法。In terms of query details, a query optimizer was designed for LLM generated code, which greatly improved the execution efficiency of LLM generated code and the interpretability of the corresponding method. Among them, modifying the processing graph to assist LLM can greatly improve query accuracy, making the accuracy of the entire natural language query task exceed that of traditional methods.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1是本发明实施例提供的一种基于数据湖的查询方法流程示意图；FIG1 is a schematic diagram of a query method based on a data lake according to an embodiment of the present invention;

图2是本发明实施例提供的查询优化示例图；FIG2 is a diagram showing an example of query optimization provided by an embodiment of the present invention;

图3是本发明的功能模块图。FIG. 3 is a functional module diagram of the present invention.

具体实施方式Detailed ways

下面结合附图，对本发明进行详细说明。在不冲突的情况下，下述的实施例及实施方式中的特征可以相互组合。The present invention is described in detail below in conjunction with the accompanying drawings. In the absence of conflict, the features of the following embodiments and implementations can be combined with each other.

本发明的一种基于数据湖的查询方法，如图1所示，包括：A query method based on a data lake of the present invention, as shown in FIG1 , includes:

例如，当用户对目标数据集提出查询任务：查询学校教师总人数，之后会对整个数据库的相关表（基于关键词的相似性比较进行简单过滤）进行扫描，获取每列的列名，类型，主外键关系，统称为模式信息M，并进行数据扫描采样，根据统计获取具有代表性的统计信息（如评分等级区间1-5），主要高频值等，统称为数据样本信息S。For example, when a user asks for a query on the target data set: to query the total number of teachers in a school, the relevant tables of the entire database will be scanned (based on simple filtering based on keyword similarity comparison) to obtain the column name, type, and primary and foreign key relationships of each column, collectively referred to as pattern information M, and data scanning and sampling will be performed to obtain representative statistical information (such as the rating level range 1-5), main high-frequency values, etc., collectively referred to as data sample information S.

具体的，使用大型语言模型（LLMs）将查询Q分解成若干个子任务J_i，以构建处理图G。其中大模型还经过如下预训练：将操作符数据库的操作符列表（包含详细用法）和查询处理实例训练LLM，训练后的LLM用于将输入的查询Q转换为多个子任务组成的处理图G。Specifically, a large language model (LLM) is used to decompose the query Q into several subtasks _Ji to construct a processing graph G. The large model is also pre-trained as follows: the operator list of the operator database (including detailed usage) and query processing examples are used to train the LLM, and the trained LLM is used to convert the input query Q into a processing graph G composed of multiple subtasks.

操作使用一个操作符，操作符的描述信息具体来说，类似于数据库中的执行计划，我们将整个查询逻辑分解为若干子操作，每个操作符中将描述操作的具体条件，例如filter操作符，其中将包含对数据进行过滤的确切条件，一个完整的查询任务，可能由read（读取数据集），filter（过滤条件），select（挑选目标列），write（写结果到文件）以及其他各种查询操作组成。The operation uses an operator. The description information of the operator is similar to the execution plan in the database. We decompose the entire query logic into several sub-operations. Each operator will describe the specific conditions of the operation. For example, the filter operator will contain the exact conditions for filtering data. A complete query task may consist of read (reading data sets), filter (filtering conditions), select (selecting target columns), write (writing results to files) and various other query operations.

在另一实施例中，还包括：构建离线操作符数据库，该数据库包含了从一系列典型的自然语言查询（即用户输入）中衍生的典型查询对(Ji, Gi)，Gi表示基于子任务Ji构建的处理图，这里的处理图来自预包含的多个查询示例，其中基本包含了各种操作符的结合示例，此外，也会根据步骤（1）中的用户输入和步骤（2）中构建的新的处理图更新该数据库。In another embodiment, it also includes: constructing an offline operator database, which includes typical query pairs (Ji, Gi) derived from a series of typical natural language queries (i.e., user input), Gi represents a processing graph constructed based on subtask Ji, and the processing graph here comes from a plurality of pre-included query examples, which basically include examples of combinations of various operators. In addition, the database will also be updated according to the user input in step (1) and the new processing graph constructed in step (2).

（3）修正处理图G，并采用shuffle技术和/或Collapse技术，结合成本模型对修正后的处理图G进行优化；(3) Modify the processing graph G, and use the shuffle technology and/or collapse technology in combination with the cost model to optimize the modified processing graph G;

LLM生成的子任务是不可靠的，因此必须对其进行逻辑修正以提高其准确率，修正主要包括形式修正、逻辑修正和全局修正，形式修正针对处理图中子任务描述引用的表名、列名进行检查，并尝试进行修正，无法修正则返回步骤二进行重试；逻辑修正则针对子任务描述中的条件进行检查，例如join操作条件是否符合外键关系，read操作对不同模态数据的操作描述是否正确；全局修正则针对子任务之间的关系进行检查，如后续操作的目标表和列是否在前面任务中读取或生成。The subtasks generated by LLM are unreliable, so they must be logically corrected to improve their accuracy. The corrections mainly include formal correction, logical correction and global correction. Formal correction checks the table names and column names referenced by the subtask description in the processing diagram and attempts to correct them. If the correction cannot be made, return to step 2 and try again. Logical correction checks the conditions in the subtask description, such as whether the join operation conditions comply with the foreign key relationship and whether the read operation description of different modal data is correct. Global correction checks the relationship between subtasks, such as whether the target table and column of the subsequent operation are read or generated in the previous task.

在完成修正后，对处理图进一步进行逻辑优化，以提高效率，具体的，采用shuffle技术，并结合成本模型改变处理图G中的操作符的顺序来优化查询计划；采用Collapse技术，并结合成本模型循环计算合并多个操作符带来的成本收益，从而优化查询计划；根据具体问题，采用shuffle技术和Collapse技术中的一种或者同时采用，以优化处理图G，从而优化查询计划。After the correction is completed, the processing graph is further logically optimized to improve efficiency. Specifically, the shuffle technology is used, and the order of operators in the processing graph G is changed in combination with the cost model to optimize the query plan; the collapse technology is used, and the cost model is combined to cyclically calculate the cost benefits of merging multiple operators to optimize the query plan; according to the specific problem, one of the shuffle technology and the collapse technology is used or both are used to optimize the processing graph G, thereby optimizing the query plan.

Shuffle技术的实现以通常情况下的join和filter操作符结合为例，如图2所示，从逻辑上LLM生成的处理图会尝试先执行join再执行filter，但是在多数情况下这种操作顺序导致效率低下。shuffle技术首先获取调整顺序后的可能更优状态（如常见的先执行filter再执行join），然后根据数据量，数据操作次数，操作行数，列数等信息计算CPU的总计算量，称为计算成本，比较两者成本，确定后者成本更低，将处理图转化为后者。The implementation of the shuffle technology takes the combination of join and filter operators as an example, as shown in Figure 2. Logically, the processing graph generated by LLM will try to execute join first and then filter, but in most cases this operation order leads to low efficiency. The shuffle technology first obtains a possible better state after adjusting the order (such as the common execution of filter first and then join), and then calculates the total CPU computing amount based on information such as the amount of data, the number of data operations, the number of rows and columns operated, which is called the computing cost. The costs of the two are compared, and the latter is determined to have a lower cost, and the processing graph is converted to the latter.

Collapse技术的实现以定义的合并规则为基础，通过成本模型计算确切成本再决定执行合并，具体而言，Collape技术考虑子查询展开和代码最佳实现两部分，子查询展开指查询任务中的子查询实现可能导致大量重复计算，需要将查询展开为非子查询形式，这基于大量子查询预定义规则进行获取等价形式。代码最佳实现则是以最终生成更高效的代码为目标，如group分组操作和max等聚合函数操作符中会单独呈现，但如果为每个操作符生成对应单独的代码将导致数据在内存中重复复制，更佳选择是合并两个操作符，最终使用一行代码达到目标。The implementation of Collapse technology is based on the defined merging rules. The exact cost is calculated through the cost model before deciding to execute the merge. Specifically, Collapse technology considers subquery expansion and optimal code implementation. Subquery expansion means that the subquery implementation in the query task may lead to a large number of repeated calculations, and the query needs to be expanded into a non-subquery form, which is based on a large number of subquery predefined rules to obtain an equivalent form. The optimal code implementation aims to eventually generate more efficient code. For example, group operations and aggregate function operators such as max will be presented separately, but if a separate code is generated for each operator, the data will be duplicated in memory. The better choice is to merge the two operators and finally use one line of code to achieve the goal.

在另一实施例中，还包括：重复执行步骤（2）和步骤（3），得到多个优化后的处理图，并选取最优的处理图执行步骤（4）。其中，最优的处理图是通过成本模型从多个优化后的处理图选取的。In another embodiment, the method further includes: repeatedly executing step (2) and step (3) to obtain multiple optimized processing graphs, and selecting the optimal processing graph to execute step (4). The optimal processing graph is selected from the multiple optimized processing graphs through a cost model.

具体的，根据优化后的处理图或者最优的处理图，生成运行状态s，迭代进行生成代码，代码生成过程是迭代进行的，从没有前驱的图顶点（操作符）开始，依序通过图进行。这创建了一个运行状态s = [J',G',C']，其中J'代表已经被现有子图G'完成的任务，C'表示当前的代码片段集合，每一个转换成代码的新操作符都会更新状态s，指导后续的代码生成工作。最后将代码片段链接成一个连贯的输出。在生成代码的每一步当中，我们会根据对应的操作符种类生成对应的定制提示，并选择最优的代码实现方式示例。Specifically, according to the optimized processing graph or the optimal processing graph, a running state s is generated, and the code is generated iteratively. The code generation process is iterative, starting from the graph vertex (operator) without a predecessor, and proceeding through the graph in sequence. This creates a running state s = [J', G', C'], where J' represents the task that has been completed by the existing subgraph G', and C' represents the current set of code snippets. Each new operator converted into code will update the state s to guide the subsequent code generation work. Finally, the code snippets are linked into a coherent output. In each step of generating code, we will generate corresponding customized prompts according to the corresponding operator type and select the best code implementation example.

在一实施例中，还包括：在生成代码之后进行代码验证，若验证失败，则重新执行步骤（4），若仍然失败，则重新执行步骤（2）-步骤（4）。In one embodiment, the method further includes: performing code verification after the code is generated, and if the verification fails, re-executing step (4); if the verification still fails, re-executing steps (2) to (4).

进行代码验证，包括：根据生成的代码和对应的优化后的处理图进行操作符级别的验证。具体的，我们基于处理图中的每个操作符按照顺序在代码中寻找对应实现，这是步骤验证，并尤其关注操作符的条件是否在代码中有对应准确实现即逻辑验证，例如filter的操作符条件会转化为代码语言，需要对两者进行判别，这个判别是相对宽松的，当在预设规则下无法判别时会允许进行下一步执行，确定能否得到有效结果。在操作符验证的同时，对代码的语法进行验证，最后执行代码，返回结果，根据用户反馈验证结果是否符合预期。Code verification is performed, including: operator-level verification based on the generated code and the corresponding optimized processing graph. Specifically, we search for the corresponding implementation in the code in order based on each operator in the processing graph. This is step verification, and we pay special attention to whether the operator conditions have corresponding accurate implementations in the code, that is, logical verification. For example, the operator conditions of the filter will be converted into code language, and the two need to be distinguished. This distinction is relatively loose. When it cannot be distinguished under the preset rules, the next step will be allowed to be executed to determine whether a valid result can be obtained. While verifying the operator, the syntax of the code is verified, and finally the code is executed and the result is returned. Verify whether the result meets expectations based on user feedback.

因此本发明通过为提供LLM来自源数据集的全面元数据，使其使用有详细语义说明的预定义的数据处理操作符来回答查询，其中还包括对处理图进行修正和优化以及对后续生成的代码进行验证。从而允许用户用自然语言提交查询，以提升直观性，完全绕过中介模式，简化了查询过程，不需要数据转换和加载，简化了操作。同时简化了代码生成过程，提高了性能和结果的准确性。Therefore, the present invention provides LLM with comprehensive metadata from the source data set, so that it uses predefined data processing operators with detailed semantic descriptions to answer queries, which also includes modifying and optimizing the processing graph and verifying the subsequently generated code. This allows users to submit queries in natural language to improve intuitiveness, completely bypasses the intermediary mode, simplifies the query process, does not require data conversion and loading, and simplifies operations. At the same time, the code generation process is simplified, improving performance and accuracy of results.

本发明还提供了一种基于数据湖的查询装置，如图3所示，包括：The present invention also provides a query device based on a data lake, as shown in FIG3 , comprising:

处理图优化模块，修正处理图G，采用shuffle技术和/或Collapse技术，并结合成本模型对修正后的处理图G进行优化；A processing graph optimization module modifies the processing graph G, uses the shuffle technology and/or the collapse technology, and optimizes the modified processing graph G in combination with the cost model;

需要说明的是，本实施例中示出的装置实施例与上述方法实施例的内容相匹配，可以参考上述方法实施例的内容，在此不再赘述。It should be noted that the device embodiment shown in this embodiment matches the content of the above method embodiment, and the content of the above method embodiment can be referred to, which will not be repeated here.

以上实施例仅用于说明本发明的设计思想和特点，其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施，本发明的保护范围不限于上述实施例。所以，凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰，均在本发明的保护范围之内。The above embodiments are only used to illustrate the design ideas and features of the present invention, and their purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, any equivalent changes or modifications made based on the principles and design ideas disclosed by the present invention are within the protection scope of the present invention.

Claims

1. A data lake-based query method, comprising:

(1) Sampling the target data set to be queried according to user input to obtain mode information M and data sample information S, so as to construct query Q;

(2) Decomposing the query Q into a plurality of subtasks, thereby constructing a processing diagram G;

(3) Modifying the processing diagram G, adopting a shuffle technology and/or a Collapse technology, and optimizing the modified processing diagram G by combining a cost model;

(4) And generating codes according to the optimized processing diagram and executing the codes to output a user query result.

2. The method of claim 1, wherein data sampling is performed from a target data set to obtain a most representative data sample, and wherein core logic is data feature statistics and null removal; in addition, connectable tuples are retrieved from the associated data following the primary foreign key relationship, the data association information is integrated, and added to the pattern information M, ultimately forming a query Q.

3. The method of claim 1, wherein decomposing the query Q into a number of sub-tasks comprises:

the query Q is broken down into several subtasks using a large model.

4. A method according to claim 3, wherein the large model is further trained as follows:

The subtasks and the corresponding operators are used as training sets for training the large model;

the trained large model is used to translate the subtasks into corresponding operators.

5. The method of claim 1, wherein the modifying the process map G comprises: form correction, logic correction, and global correction; the form is modified to be checked for table names and column names of subtask description references in the processing diagram; the logic is modified to check for conditions in the subtask description; the global correction is to check the relation among subtasks;

the optimization of the corrected processing graph G by adopting a shuffle technology and/or a Collapse technology and combining a cost model comprises the following steps:

Adopting a shuffle technology, and changing the sequence of operators in the processing diagram G by combining a cost model to optimize a query plan;

adopting a Collapse technology, and circularly calculating the cost benefit brought by combining a plurality of operators by combining a cost model, so as to optimize a query plan;

Wherein the operator is translated from the subtask.

6. The method as recited in claim 1, further comprising: and (3) repeatedly executing the step (2) and the step (3) to obtain a plurality of optimized processing graphs, and selecting the optimal processing graph to execute the step (4).

7. The method of claim 6, wherein the optimal process map is selected from a plurality of optimized process maps by a cost model.

8. The method as recited in claim 1, further comprising: after the code is generated, code verification is carried out, if verification fails, the step (4) is re-executed, and if the verification still fails, the steps (2) - (4) are re-executed.

9. The method of claim 8, wherein the performing code verification comprises:

and verifying the level of the operators according to the generated codes and the corresponding optimized processing diagram.

10. A data lake-based query device, comprising:

The preprocessing module is used for sampling the target data set to be inquired according to user input to obtain mode information M and data sample information S so as to construct inquiry Q;

The processing diagram construction module is used for decomposing the query Q into a plurality of subtasks so as to construct a processing diagram G;

the processing diagram optimizing module is used for correcting the processing diagram G, adopting a shuffle technology and/or a Collapse technology, and combining a cost model to optimize the corrected processing diagram G;

and the code generation and execution module is used for generating codes according to the optimized processing diagram and executing the codes so as to output a user query result.