CN115982230A - Cross-data-source query method, system, equipment and storage medium of database - Google Patents
Cross-data-source query method, system, equipment and storage medium of database Download PDFInfo
- Publication number
- CN115982230A CN115982230A CN202211579504.4A CN202211579504A CN115982230A CN 115982230 A CN115982230 A CN 115982230A CN 202211579504 A CN202211579504 A CN 202211579504A CN 115982230 A CN115982230 A CN 115982230A
- Authority
- CN
- China
- Prior art keywords
- query
- sub
- data source
- query task
- target data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a cross-data-source query method, a cross-data-source query system, a cross-data-source query device and a storage medium of a database, wherein the method comprises the following steps: analyzing a target query statement input by a user to obtain a sub-query task corresponding to each target data source; acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task; acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task; and searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain the final query result of the target query statement. The invention can reasonably distribute data resources, improve the resource utilization rate and improve the data use efficiency.
Description
Technical Field
The invention relates to the technical field of big data, in particular to a cross-data-source query method, a cross-data-source query system, cross-data-source query equipment and a storage medium for a database.
Background
In the process of enterprise management, a report system is required to be constructed in aspects of business docking, performance statistics, personnel management and the like, a plurality of different data sources are usually involved in index query of the report system, and because the construction of the different data sources comes from different users, the definition of index data is not uniform, and the conventional Structured Query Language (SQL) technology is difficult to query across different data sources.
The traditional technical scheme cannot meet the requirement due to the limitation of expandability and processing performance, the emerging technical scheme based on the message queue has good expandability, and different data sources need to be gathered in one database, so that the data calculation link is long and complex, the resource utilization rate is low, the storage is heterogeneous, a data island is easy to form, and a plurality of problems such as resource waste and the like are caused by frequently synchronizing data.
Disclosure of Invention
The invention provides a cross-data-source query method, a cross-data-source query system, a cross-data-source query device and a storage medium for a database, and mainly aims to automatically analyze a general cross-data-source query statement input by a user, allocate corresponding computing engines and clusters for the general cross-data-source query statement according to the characteristics of the query statement obtained by analysis, reasonably allocate data resources, improve the utilization rate of the resources and improve the use efficiency of data.
In a first aspect, an embodiment of the present invention provides a cross-data-source query method for a database, including:
analyzing target query statements input by a user to obtain a sub-query task corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
and searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain a query result corresponding to each sub-query task, and further obtaining a final query result of the target query statement according to the query result corresponding to each sub-query task.
Preferably, the analyzing the target query statement input by the user to obtain the sub-query task corresponding to each target data source includes:
acquiring a sub-query path corresponding to each target data source according to the geographic position of the target data source, the type of the database of the target data source and the data content stored in the target data source, wherein the geographic position of the target data source, the type of the database of the target data source and the data content stored in the target data source are contained in the target query statement;
and obtaining an abstract syntax tree through a parser according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source, wherein the sub-query task is an SQL query statement.
Preferably, the obtaining, by the parser, an abstract syntax tree according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source further includes:
optimizing the sub-query tasks corresponding to each target data source to obtain the optimized sub-query tasks corresponding to each target data source;
and the optimized sub-query task corresponding to each target data source is used as the sub-query path corresponding to each target data source again.
Preferably, the optimizing the sub-query task corresponding to each target data source to obtain the optimized sub-query task corresponding to each target data source includes:
judging whether the attribute fields defined in the sub-query task corresponding to each target data source have naming conflict, and if so, renaming the conflicting attribute fields according to the actual requirements of the attribute fields;
judging whether a data unit defined in a sub-query task corresponding to each target data source is a preset unit or not, and if not, converting data defined in the non-conforming sub-query task into the preset unit;
and judging whether the format of the sub-query task corresponding to each target data source meets the preset requirement, and if not, performing format conversion on the sub-query tasks which are not met.
Preferably, the acquiring the optimal computing engine corresponding to each sub-query task according to the historical query records of the plurality of computing engines and each sub-query task includes:
aggregating the historical query records of each computing engine to obtain task characteristics corresponding to each computing engine, wherein the task characteristics comprise a geographical range of data processing and a type range of data processing;
and comparing the geographic position of the target data source in each sub-query task with the geographic range of data processing corresponding to each computing engine, and comparing the database type of the target data source in each sub-query task with the range of the data processing corresponding to each computing engine to obtain the optimal computing engine corresponding to each sub-query task.
Preferably, the obtaining an optimal cluster corresponding to each sub-query task according to the current real-time resource status of the multiple clusters and each sub-query task includes:
dividing all clusters into a plurality of preset types according to the current real-time resource state of each cluster;
dividing all the sub-query tasks into a plurality of preset types;
and taking the clusters with the same preset type as the optimal clusters of each sub-query task.
Preferably, the preset type includes a CPU concentrated type and an IO concentrated type.
In a second aspect, an embodiment of the present invention provides a cross-data source query system for a database, including:
the analysis module is used for analyzing target query statements input by a user and acquiring sub-query tasks corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
the engine module is used for acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
the cluster module is used for acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
and the query module is used for searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain a query result corresponding to each sub-query task, and further obtaining a final query result of the target query statement according to the query result corresponding to each sub-query task.
In a third aspect, an embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the cross-data-source query method for a database when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the cross-data-source query method for a database described above are implemented.
The embodiment of the invention provides a cross-data source query method, a cross-data source query system, a cross-data source query device and a storage medium of a database, wherein a target query statement is analyzed to split the target query statement into a plurality of sub-query tasks, an optimal computing engine is selected according to the data characteristics of the sub-query tasks and the historical query record of each computing engine, an optimal cluster is selected according to the current real-time resource state of each cluster and the data characteristics of the sub-query tasks, and then the optimal computing engine on the optimal cluster is used for executing the corresponding sub-query tasks, so that the query result of the target query statement can be obtained. According to the embodiment of the invention, the universal cross-data-source query statement input by the user is automatically analyzed, and the corresponding computing engine and the corresponding cluster are distributed to the universal cross-data-source query statement according to the characteristics of the sub-query task obtained through analysis, so that the data resources can be reasonably distributed, the resource utilization rate is improved, and the data use efficiency is improved.
Drawings
Fig. 1 is a schematic view of an application scenario of a cross-data-source query method for a database according to an embodiment of the present invention;
fig. 2 is a flowchart of a cross-data-source query method for a database according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a cross-data-source query system for a database according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a schematic view of an application scenario of a cross-data-source query method for a database according to an embodiment of the present invention, as shown in fig. 1, a user inputs a target query statement at a client, the client sends the target query statement to a server after receiving the target query statement, and the server executes the cross-data-source query method for the database according to the target query statement to obtain a query result of the target query statement.
It should be noted that the server may be implemented by an independent server or a server cluster formed by multiple servers. The client may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The client and the server may be connected through bluetooth, USB (universal serial Bus), or other communication connection manners, which is not limited in this embodiment of the present invention.
The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and the like.
Fig. 2 is a flowchart of a cross-data-source query method for a database according to an embodiment of the present invention, as shown in fig. 2, the method includes:
s210, analyzing target query statements input by a user to obtain a sub-query task corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
firstly, acquiring a target query statement input by a user, wherein the target query statement is a universal query statement, namely a query statement which is frequently used by the user and accords with the habit of the user, the common target query statement comprises simple query, condition-based search, advanced query and alias extraction for tables and fields, and the simple query comprises query of all fields and query of designated fields; the query according to the condition comprises a query with a relational operator, a query with in keywords, a query with beta-renand keywords, a null value query, a multi-condition query with and keywords and a query with or keywords; the advanced query comprises an aggregation function, query result ordering and grouping query; aliasing the table and the field includes aliasing the table and aliasing the field. The target query statement may include a plurality of common query statements, or may include one common query statement, and the type of the query statement may be one of the types described above, or may be a combination of a plurality of different types of query statements.
Analyzing a target query statement, and listing data related to the target query statement, wherein the data related to the target query statement are in different data sources, and the data sources are used for storing data and can be seen as a database; in practical applications, since businesses of an enterprise may involve different cities in different places, data sources are formed by collecting historical usage data of users, and different data sources are generally set in different cities. And (3) calling the data sources involved in the target query statement as target data sources, and acting the target query statement in each target data source to obtain a sub-query task corresponding to each target data source. That is, the target query statement may be split into sub-query statements corresponding to each target data source.
In the embodiment of the invention, the target query statement is divided into the sub-query tasks corresponding to each target data source, and the specific dividing method can be a method of artificial intelligent machine learning, the target query statement of a user is divided into a plurality of key words by utilizing a characteristic extraction neural network, the matching degree between each key word and each target data source is calculated according to the key words, and then the data corresponding to the key words are used as the sub-query tasks corresponding to the target data sources. Before the characteristic extraction neural network is used, the characteristic extraction neural network needs to be trained, network parameters of the characteristic extraction neural network are continuously adjusted in the training process, the training is stopped until a loss function of certain training meets conditions, and the characteristic extraction neural network at the moment is used for extracting keywords of a target query statement.
S220, acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
because there are multiple data sources, the number of concurrent target query statements is large, and the most suitable computing engines for different types of target query statements are different, different computing engines are usually available on each cluster, or different computing engines are available on different clusters, which may be specifically determined according to actual situations, which is not specifically limited in the embodiment of the present application. The calculation engine is a highly abstract aggregate of calculation rules, and a user writes corresponding interface codes according to a specified mode and then executes the interface codes to obtain a required result. Big data computing scenarios are divided into two categories: batch and stream processing, common batch and stream processing compute engines are as follows;
1. the MapReduce carried by Hadoop in batch processing divides the calculation into two stages, namely Map (mapping) and Reduce (reduction), and for an upper-layer application, the algorithm has to be split by a imagination, or even a plurality of Job series connection has to be realized in the upper-layer application, so as to complete a complete algorithm, such as iterative calculation.
2. The computing engines supporting DAG computing, such as Tez and Oozie, are mostly batch-processed tasks, hadoop2 is added with a new computing engine, mapReduce is the second generation, the original MapReduce framework structure is optimized, unnecessary computing processes are combined, the data storage times are reduced, and the execution time is greatly improved.
3. The computing engine with built-in DAG and the computing engine of the third generation represented by Spark are mainly characterized in that the computing engine is supported by the DAG inside Job and can perform real-time computing and can well run batch-processed Job. Spark, unlike MapReduce, provides only two simple programming interfaces, which provide a variety of programming interfaces to manipulate data, requiring more code if implemented using MapReduce. In terms of batch processing, the hardware facilities required by Spark are higher than that of MapReduce, and MapReduce can process more data than Spark on the same equipment.
4. The stream batch computing engine, flink, is a unified engine for processing large data volume, mainly expressed in support of stream computing by Flink and in real-time performance of a step. Of course, flink may also support the tasks of Batch, as well as the operations of DAG.
It can be seen from the above that, different computing engines have different characteristics, and the sub-query tasks which are skilled in processing are also different, so that an optimal computing engine which is most suitable for the sub-query tasks is selected from all the computing engines, computing engine resources can be utilized to the maximum efficiency, and the processing efficiency of the sub-query tasks is improved. In the embodiment of the invention, for each computing engine, firstly, the historical query record of the computing engine, namely the historical query record processed by the computing engine, is counted, the historical query record comprises the data type, the data location, the processing time and other attributes of the historical query request, then, the various attributes of the historical query request are counted, the data type, the data location and the like which are most suitable for processing by the computing engine are found, and the data type and the data location which are most suitable for processing by each computing engine are obtained according to the method. And for the sub-query tasks corresponding to each target data source, matching the data types and the data locations which are most suitable for processing and correspond to each computing engine according to the data types and the data locations which correspond to the query tasks, so as to obtain the optimal computing engine corresponding to each sub-query task.
S230, acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
a server cluster is a parallel or distributed system consisting of interconnected server clusters, since the servers in the server cluster run the same computational task. Although the computing capacity of a single server is limited, after hundreds of servers form a server cluster, the whole system has strong computing capacity and can support the computing load of big data analysis. In the embodiment of the invention, as the data are various and the data volume is large, the embodiment of the invention comprises a plurality of clusters, and then according to the current real-time resource state of each cluster, for example, at the current moment, some clusters process more tasks and are in a busy state, while some clusters process less tasks and are in an idle state, the sub-query tasks are distributed to the clusters in the idle state as much as possible during distribution, and a small number of sub-query tasks are distributed to the clusters in the busy state, so that the cluster resources are reasonably utilized; for another example, the CPU utilization of some clusters is higher, and the IO port utilization of an effective cluster is higher, so that the sub-query tasks with the lower CPU utilization are allocated to the cluster with the higher CPU utilization, and the sub-query tasks with the lower IO utilization are allocated to the cluster with the higher IO port utilization; the determination may be specifically performed according to an actual situation, and this is not specifically limited in the embodiment of the present invention.
S240, searching in the corresponding data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task, obtaining the query result corresponding to each sub-query task, and further obtaining the final query result of the target query statement according to the query result corresponding to each sub-query task.
After the optimal cluster and the optimal calculation engine corresponding to each sub-query task are obtained, the optimal calculation engine on the optimal cluster is used for executing the corresponding sub-query task, so that a query result corresponding to each sub-query task can be obtained, and a final query result of the target query statement is obtained according to the query result corresponding to each sub-query task.
The embodiment of the invention provides a cross-data source query method of a database, which comprises the steps of analyzing a target query statement, dividing the target query statement into a plurality of sub-query tasks, selecting an optimal computing engine according to the data characteristics of the sub-query tasks and the historical query record of each computing engine, selecting an optimal cluster according to the current real-time resource state of each cluster and the data characteristics of the sub-query tasks, and executing the corresponding sub-query tasks by utilizing the optimal computing engine on the optimal cluster to obtain the query result of the target query statement. In the embodiment of the invention, the universal cross-data-source query statement input by the user is automatically analyzed, and the corresponding calculation engine and the corresponding cluster are distributed to the query statement according to the characteristics of the query statement obtained by analysis, so that the data resources can be reasonably distributed, the resource utilization rate is improved, and the data use efficiency is improved.
In some embodiments, the analyzing the target query statement input by the user to obtain the sub-query task corresponding to each target data source includes:
acquiring a sub-query path corresponding to each target data source according to the geographic position of the target data source, the type of the database of the target data source and the data content stored in the target data source contained in the target query statement;
and obtaining the abstract syntax tree through the resolver according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source, wherein the sub-query task is an SQL query statement.
Specifically, a target query statement input by a user is analyzed to obtain a sub-query task corresponding to each target data source, data associated in the target query statement is called target data, and due to the fact that the target data are cross-information sources, geographic locations of different target data may be different or the same. The database type of the target data source refers to the type of the database stored in the target data source, and the database type includes MySQL, oracle, sqlServer, SQLite, INFORMIX, redis, mongoDB, HBase, neo4J, couchDB, and the like, and may be specifically determined according to the database type in the actual target data source, which is not specifically limited in the embodiment of the present invention. And data content stored in the target data source, where the data content includes a value range of data in the target data source, a specific type of included data, and the like, and may be specifically determined according to an actual situation, which is not specifically limited in the embodiment of the present invention. After the content is obtained, a sub-query path is established for each target data source, and the geographic position, the database type and the data content related to the target data source are used as the operation content of the sub-query task, so that the sub-query path corresponding to each target data can be obtained. The sub-query path in the embodiment of the invention is a text path and indicates the geographic position of the content to be searched, the type of the database of the content to be searched and the data content to be searched.
And then obtaining an abstract syntax tree through a parser according to the sub-query path corresponding to each target data source to obtain the sub-query task corresponding to each target data source.
On the basis of the foregoing embodiment, preferably, the obtaining, by the parser according to the sub-query path corresponding to each target data source, the abstract syntax tree to obtain the sub-query task corresponding to each target data source further includes:
optimizing the sub-query tasks corresponding to each target data source to obtain the optimized sub-query tasks corresponding to each target data source;
and the optimized sub-query task corresponding to each target data source is used as the sub-query task corresponding to each target data source again.
Specifically, after the sub-query task corresponding to each target data source is obtained, the sub-query task corresponding to each target data source needs to be optimized, and since the target query statement is only a general query statement conforming to the habit of the user, not a query statement conforming to the calculation engine, the target query statement needs to be optimized, and the optimized sub-query task replaces the original sub-query task and participates in subsequent calculation.
On the basis of the foregoing embodiment, preferably, the optimizing the sub-query task corresponding to each target data source to obtain an optimized sub-query task corresponding to each target data source includes:
judging whether the attribute fields defined in the sub-query task corresponding to each target data source have naming conflict, and if so, renaming the attribute fields in conflict according to the actual requirements of the attribute fields;
judging whether a data unit defined in a sub-query task corresponding to each target data source is a preset unit or not, and if not, converting the data determined in the non-conforming sub-query task into the preset unit;
and judging whether the format of the sub-query task corresponding to each target data source meets the preset requirement, and if not, performing format conversion on the sub-query tasks which are not met.
Specifically, whether naming conflict exists in the attribute fields defined in the sub-query tasks corresponding to each target data source is judged, and as the sub-query tasks are reconstructed after splitting the target query statement, the attribute fields defined in a plurality of sub-query tasks may have the condition of renaming; screening the data units defined in each sub-query task, possibly expressing different units for the same index data, and unifying the data units for facilitating the execution of subsequent search sentences, in the embodiment of the invention, each data is assigned a preset unit, the data in the sub-query tasks are compared with the preset unit corresponding to the data, and if the data do not conform to the preset unit, the data are converted to obtain the data of the preset unit; and finally, judging the format of the sub-query task, judging whether the format of the sub-query task meets the preset requirement, and if not, converting the format of the sub-query task, so that the execution caused by subsequent calculation is facilitated.
On the basis of the foregoing embodiment, preferably, each sub-query task includes a geographic location of a target data source and a database type of the target data source, and the obtaining an optimal computing engine corresponding to each sub-query task according to a historical query record of a plurality of computing engines and each sub-query task includes:
aggregating the historical query records of each computing engine to obtain task characteristics corresponding to each computing engine, wherein the task characteristics comprise a geographic range of data processing and a type range of data processing;
and comparing the geographic position of the target data source in each sub-query task with the geographic range of data processing corresponding to each computing engine, and comparing the database type of the target data source in each sub-query task with the range of the data processing corresponding to each computing engine to obtain the optimal computing engine corresponding to each sub-query task.
In the embodiment of the invention, each sub-query task comprises the geographic position of a target data source and the database type of the target data source, and the step of allocating an optimal calculation engine to each sub-query task of the target data source comprises the following steps: aggregating the historical query records of each computing engine, obtaining the historical query statement processing condition of the computing engine according to the historical query records of the computing engine, wherein the historical query records comprise the geographic position and the database type of a data source in the processed historical query statement, and extracting the optimal geographic position and the optimal database type of the computing engine by aggregating the geographic positions and the database types of all the historical query statements of the computing engine, so as to obtain the data processing geographic range and the data processing type range corresponding to each computing engine; and then comparing the geographic position of the target data source in each sub-query task with the geographic range of data processing corresponding to each calculation engine, and comparing the database type of the target data source in each query task with the data processing type range corresponding to each calculation engine to obtain the optimal calculation engine corresponding to each sub-query task. In the embodiment of the present invention, the data processing type range of the optimal computing engine includes the data processing type range of the corresponding sub-query task, and the geographic range of the data processing of the optimal computing engine includes the geographic location of the target data source of the corresponding sub-query task.
On the basis of the foregoing embodiment, preferably, the obtaining an optimal cluster corresponding to each sub-query task according to the current real-time resource status of the multiple clusters and each sub-query task includes:
dividing all clusters into a plurality of preset types according to the current real-time resource state of each cluster;
dividing all the sub-query tasks into a plurality of preset types;
and taking the clusters with the same preset type as the optimal clusters of each sub-query task.
Obtaining an optimal cluster corresponding to each sub-query task according to the current real-time resource state of each cluster and each sub-query task, and the method specifically comprises the following steps: dividing the cluster into a plurality of preset types according to the current resource type of each cluster, and dividing the sub-query tasks into a plurality of preset types according to the characteristics of the sub-query tasks; and then distributing each sub-query task to the clusters with the same preset type, and taking the clusters as the optimal clusters corresponding to the sub-query tasks.
In some embodiments, the preset types include a CPU-centralized type and an IO-centralized type.
In the embodiment of the invention, the clusters can be divided into a CPU concentrated type and an IO concentrated type, the CPU concentrated type is also called a calculation intensive type, which means that the performance of a hard disk and a memory of a system is much better than that of the CPU, at the moment, most of the operation of the system depends on the CPU, the CPU needs to read/write I/O, and the I/O can be completed in a short time; according to the characteristics, the sub-query tasks with higher requirements on the system hard disk and the memory performance are distributed to the CPU concentrated cluster, and the CPU concentrated cluster is used as the optimal cluster of the sub-query tasks; the IO centralized type is a task related to network and disk IO, and the task is characterized in that CPU consumption is low, most time of the task is waiting for IO operation to be completed (because the IO speed is far lower than the speeds of CPU and memory), so that the sub-query tasks related to network and disk IO are distributed to an IO centralized type cluster, and the IO centralized type cluster is used as the optimal cluster of the sub-query tasks.
Fig. 3 is a schematic structural diagram of a cross-data-source query system for a database according to an embodiment of the present invention, as shown in fig. 3, the system includes:
the analysis module is used for analyzing target query statements input by a user and acquiring sub-query tasks corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
the engine module is used for acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
the cluster module is used for acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
and the query module is used for searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain a query result corresponding to each sub-query task, and further obtaining a final query result of the target query statement according to the query result corresponding to each sub-query task.
The present embodiment is a system embodiment corresponding to the above method embodiment, and the specific implementation process thereof is the same as that of the above method embodiment, and please refer to the above method embodiment for details, which is not described herein again.
In some embodiments, the analysis module comprises a path unit and a statement unit, wherein:
the path unit is used for acquiring a sub-query path corresponding to each target data source according to the geographic position of the target data source, the database type of the target data source and the data content stored in the target data source contained in the target query statement;
the statement unit is used for obtaining the abstract syntax tree through the resolver according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source, wherein the sub-query task is an SQL query statement.
In some embodiments, the analysis module further comprises an optimization unit, wherein:
the optimization unit is used for optimizing the sub-query tasks corresponding to the target data sources to obtain the optimized sub-query tasks corresponding to the target data sources, and the optimized sub-query tasks corresponding to the target data sources are used as the sub-query tasks corresponding to the target data sources again.
In some embodiments, the optimization unit includes a naming unit, a unit, and a format unit, wherein:
the naming unit is used for judging whether the attribute fields defined in the sub-query task corresponding to each target data source have naming conflict, and if so, renaming the conflicting attribute fields according to the actual requirements of the attribute fields;
the unit is used for judging whether a data unit defined in the sub-query task corresponding to each target data source is a preset unit or not, and if not, converting the data determined in the non-conforming sub-query task into the preset unit;
the format unit is used for judging whether the format of the sub-query task corresponding to each target data source meets the preset requirement, and if not, format conversion is carried out on the sub-query tasks which are not met.
In some embodiments, each of the sub-query tasks includes a geographic location of a target data source and a database type of the target data source, and the engine module includes an aggregation unit and a comparison unit, wherein:
the aggregation unit is used for aggregating the historical query records of each computing engine to obtain task characteristics corresponding to each computing engine, and the task characteristics comprise a geographical range of data processing and a type range of data processing;
the comparison unit is used for comparing the geographic position of the target data source in each sub-query task with the geographic range of data processing corresponding to each calculation engine, comparing the database type of the target data source in each sub-query task with the type range of data processing corresponding to each calculation engine, and obtaining the optimal calculation engine corresponding to each sub-query task.
In some embodiments, the cluster module comprises a type unit, a partition unit, and an allocation unit, wherein:
the type unit is used for dividing all clusters into a plurality of preset types according to the current real-time resource state of each cluster;
the dividing unit is used for dividing all the sub-query tasks into a plurality of preset types;
the distribution unit is used for taking the cluster with the same preset type as each sub-query task as a corresponding optimal cluster.
In some embodiments, the preset types include a CPU-centralized type and an IO-centralized type.
The modules in the cross-data-source query system of the database can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention, where the computer device may be a server, and an internal structural diagram of the computer device may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the running of computer programs in the computer storage media. The database of the computer device is used for storing data generated or acquired in the process of executing a cross-data-source query method of the database, such as target query statements and sub-query tasks. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a cross-data source query method for a database.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the cross-data-source query method of a database in the above embodiments. Alternatively, the processor, when executing the computer program, implements the functionality of the modules/units of an embodiment of a cross-data source query system for a database.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of a cross-data-source query method of a database in the above-described embodiments. Alternatively, the computer program, when executed by a processor, implements the functionality of the modules/units of the cross-data source query system for a database described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.
Claims (10)
1. A cross-data-source query method for a database is characterized by comprising the following steps:
analyzing target query statements input by a user to obtain a sub-query task corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
and searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain a query result corresponding to each sub-query task, and further obtaining a final query result of the target query statement according to the query result corresponding to each sub-query task.
2. The method for querying across data sources of a database according to claim 1, wherein the analyzing the target query statement input by the user to obtain the sub-query task corresponding to each target data source comprises:
acquiring a sub-query path corresponding to each target data source according to the geographic position of the target data source, the type of the database of the target data source and the data content stored in the target data source contained in the target query statement;
and obtaining the abstract syntax tree through the resolver according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source, wherein the sub-query task is an SQL query statement.
3. The method according to claim 2, wherein the obtaining an abstract syntax tree by a parser according to the sub-query path corresponding to each target data source, and obtaining the sub-query task corresponding to each target data source further comprises:
optimizing the sub-query tasks corresponding to each target data source to obtain the optimized sub-query tasks corresponding to each target data source;
and the optimized sub-query task corresponding to each target data source is used as the sub-query task corresponding to each target data source again.
4. The method for querying across data sources of a database according to claim 3, wherein the optimizing the sub-query task corresponding to each target data source to obtain the optimized sub-query task corresponding to each target data source includes:
judging whether the attribute fields defined in the sub-query task corresponding to each target data source have naming conflict, and if so, renaming the attribute fields in conflict according to the actual requirements of the attribute fields;
judging whether a data unit defined in a sub-query task corresponding to each target data source is a preset unit or not, and if not, converting data defined in the non-conforming sub-query task into the preset unit;
and judging whether the format of the sub-query task corresponding to each target data source meets the preset requirement, and if not, performing format conversion on the sub-query tasks which are not met.
5. The method according to claim 1, wherein each sub-query task includes a geographic location of a target data source and a database type of the target data source, and the obtaining an optimal computing engine corresponding to each sub-query task according to a historical query record of a plurality of computing engines and each sub-query task includes:
aggregating the historical query records of each computing engine to obtain a geographic range of data processing and a type range of data processing corresponding to each computing engine;
and comparing the geographic position of the target data source in each sub-query task with the geographic range of data processing corresponding to each computing engine, and comparing the database type of the target data source in each sub-query task with the range of the data processing corresponding to each computing engine to obtain the optimal computing engine corresponding to each sub-query task.
6. The method according to claim 1, wherein the obtaining an optimal cluster corresponding to each sub-query task according to the current real-time resource status of the plurality of clusters and each sub-query task comprises:
dividing all clusters into a plurality of preset types according to the current real-time resource state of each cluster;
dividing all the sub-query tasks into a plurality of preset types;
and taking the clusters with the same preset type as the optimal clusters of each sub-query task.
7. The cross-data-source query method for the database according to claim 6, wherein the preset types include a CPU-concentrated type and an IO-concentrated type.
8. A cross-data-source query system for a database, comprising:
the analysis module is used for analyzing target query statements input by a user and acquiring sub-query tasks corresponding to each target data source, wherein the target query statements relate to data in a plurality of different data sources;
the engine module is used for acquiring an optimal calculation engine corresponding to each sub-query task according to the historical query records of the plurality of calculation engines and each sub-query task;
the cluster module is used for acquiring an optimal cluster corresponding to each sub-query task according to the current real-time resource states of the plurality of clusters and each sub-query task;
and the query module is used for searching in the corresponding target data source by using the optimal calculation engine on the optimal cluster corresponding to each sub-query task to obtain a query result corresponding to each sub-query task, and further obtaining a final query result of the target query statement according to the query result corresponding to each sub-query task.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor when executing the computer program implements the steps of the cross-data-source query method of the database according to any one of claims 1 to 7.
10. A computer storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of a cross-data-source query method for a database according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211579504.4A CN115982230A (en) | 2022-12-09 | 2022-12-09 | Cross-data-source query method, system, equipment and storage medium of database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211579504.4A CN115982230A (en) | 2022-12-09 | 2022-12-09 | Cross-data-source query method, system, equipment and storage medium of database |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115982230A true CN115982230A (en) | 2023-04-18 |
Family
ID=85975019
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211579504.4A Pending CN115982230A (en) | 2022-12-09 | 2022-12-09 | Cross-data-source query method, system, equipment and storage medium of database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982230A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116302457A (en) * | 2023-05-25 | 2023-06-23 | 之江实验室 | Cloud primary workflow engine implementation method, system, medium and electronic equipment |
-
2022
- 2022-12-09 CN CN202211579504.4A patent/CN115982230A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116302457A (en) * | 2023-05-25 | 2023-06-23 | 之江实验室 | Cloud primary workflow engine implementation method, system, medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10831562B2 (en) | Method and system for operating a data center by reducing an amount of data to be processed | |
US20230124520A1 (en) | Task execution method and storage device | |
CN103678609B (en) | Large data inquiring method based on distribution relation-object mapping processing | |
CN109815283B (en) | Heterogeneous data source visual query method | |
US20170083573A1 (en) | Multi-query optimization | |
CN105677812A (en) | Method and device for querying data | |
CN105824957A (en) | Query engine system and query method of distributive memory column-oriented database | |
CN108073696B (en) | GIS application method based on distributed memory database | |
CN112269887A (en) | Distributed system based on graph database | |
CN114443680A (en) | Database management system, related apparatus, method and medium | |
US12026162B2 (en) | Data query method and apparatus, computing device, and storage medium | |
CN115982230A (en) | Cross-data-source query method, system, equipment and storage medium of database | |
CN115168389A (en) | Request processing method and device | |
CN112182031B (en) | Data query method and device, storage medium and electronic device | |
CN114969441A (en) | Knowledge mining engine system based on graph database | |
CN108319604B (en) | Optimization method for association of large and small tables in hive | |
CN108932258B (en) | Data index processing method and device | |
CN116775041B (en) | Real-time decision engine implementation method based on stream calculation and RETE algorithm | |
CN111125108A (en) | HBASE secondary index method, device and computer equipment based on Lucene | |
CN116431635A (en) | Lake and warehouse integrated-based power distribution Internet of things data real-time processing system and method | |
CN114297260A (en) | Distributed RDF data query method and device and computer equipment | |
CN110928938B (en) | Interface middleware system | |
CN114443686A (en) | Compression graph construction method and device based on relational data | |
KR20230033911A (en) | Method for processing data in the etl process, and apparatus implementing the same method | |
CN113742346A (en) | Asset big data platform architecture optimization method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |