CN112579610A

CN112579610A - Multi-data source structure analysis method, system, terminal device and storage medium

Info

Publication number: CN112579610A
Application number: CN202011573145.2A
Authority: CN
Inventors: 王刚
Original assignee: Anhui Aisino Corp
Current assignee: Anhui Aisino Corp
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-03-30

Abstract

The invention provides a method and a system for analyzing a structure of multiple data sources, terminal equipment and a storage medium, and relates to the technical field of data analysis. The method comprises the steps of obtaining a target SQL statement, analyzing the target SQL statement and constructing an abstract syntax tree; traversing the abstract syntax tree to construct a first query plan, and replacing a corresponding original two-dimensional table in the first query plan by adopting a preset configuration parameter table to obtain a second query plan; executing the second query plan to query the target data from each data source storing the target data, and converting the queried target data in a data warehouse to generate target data in a uniform format; and summarizing and analyzing the target data in the uniform format. By the method, the unified query analysis of the data source structure is realized, and the development speed of data analysis application is improved; in addition, the method omits the process of constructing a data aggregation program and saves system resources.

Description

Multi-data source structure analysis method, system, terminal device and storage medium

Technical Field

The invention relates to the technical field of data analysis, in particular to a method, a system, terminal equipment and a storage medium for analyzing a multi-data-source structure.

Background

Existing relational data analysis tools such as databases of Oracle, MySQL and the like and a big data analysis framework developed based on Hadoop can only process data stored in the system, data of a business system needs to be loaded to a data warehouse after extraction, cleaning and conversion, namely an Extract-Transform-Load (ETL) process, and then data analysis is carried out in a summary mode. Therefore, the existing data analysis method causes a large consumption of system resources, and when the ETL task is more, the existing analysis method is also inefficient. In addition, the data source and the data warehouse respectively store a copy of data with the same format, which wastes storage space.

Disclosure of Invention

The invention solves the problems that the existing data analysis method has lower efficiency and large system resource consumption.

In order to solve the above problem, a first aspect of the present invention provides a method for analyzing a structure of multiple data sources, including:

acquiring a target SQL (Structured Query Language) statement, and analyzing the target SQL statement to construct an abstract syntax tree;

traversing the abstract syntax tree to construct a first query plan, and replacing a corresponding original two-dimensional table in the first query plan by adopting a preset configuration parameter table to obtain a second query plan, wherein index and/or access information are prestored in the configuration parameter table;

executing the second query plan to query the target data from each data source storing the target data, and converting the queried target data in a data warehouse to generate target data in a uniform format;

and summarizing and analyzing the target data in the uniform format.

Further, the acquiring the target SQL statement comprises:

if the pre-stored SQL statement exists, taking the pre-stored SQL statement as the target SQL statement;

and if the pre-stored SQL statement does not exist, when the SQL statement is received, taking the received SQL statement as the target SQL statement.

Further, the replacing the original two-dimensional table corresponding to the first query plan by using a preset configuration parameter table to obtain a second query plan includes:

traversing the first query plan, and determining the original two-dimensional table in the first query plan;

and replacing the corresponding original two-dimensional table with the configuration parameter table according to a preset replacement algorithm to obtain the second query plan.

Further, the executing the second query plan to query for the target data from each data source storing the target data includes:

executing the second query plan, and simultaneously querying each data source according to the type of the data source;

the simultaneously querying each data source according to the type of the data source comprises:

if the data source is a database, executing the second query plan to generate an executable SQL statement and a query field command corresponding to the data source, acquiring a field type of the data source according to the query field command, checking the target data according to the field type, and if the target data is legal, performing query operation on the data source according to the executable SQL statement and returning a query result;

if the data source is a file, executing the second query plan to obtain the corresponding access information and the field type, calling the data source according to the access information, analyzing the data source to query the target data, checking the target data according to the field type, and if the target data is legal, returning a query result.

Further, the converting the target data queried in the data warehouse to generate the target data in the unified format includes:

determining each original storage data source corresponding to the target data;

executing a preset conversion algorithm corresponding to each original storage data source according to the original storage data sources, and converting target data in a unified format in the data warehouse according to the conversion algorithm and the target data.

Further, the converting the target data in the unified format in the data warehouse according to the conversion algorithm and the target data includes:

and if the original storage data source is a database, converting the field names and the field types of the target data according to the corresponding conversion algorithm, and generating the target data in the unified format in the data warehouse.

Further, the converting the target data in the unified format in the data warehouse according to the conversion algorithm and the target data further includes:

if the original storage data source is a file, analyzing the original storage data source, circularly reading data in the original storage data source, inquiring the target data, converting the inquired target data according to the corresponding conversion algorithm, and generating the target data in the unified format in the data warehouse.

A second aspect of the present invention provides an analysis system for a multiple data source structure, including:

the analysis module is used for acquiring a target SQL statement and analyzing the target SQL statement to construct an abstract syntax tree;

the replacing module is used for traversing the abstract syntax tree to construct a first query plan and replacing a corresponding original two-dimensional table in the first query plan by adopting a preset configuration parameter table to obtain a second query plan, wherein indexes and/or access information are prestored in the configuration parameter table;

the execution module is used for executing the second query plan so as to query the target data from each data source in which the target data are stored, and converting the queried target data in a data warehouse to generate target data in a uniform format;

and the analysis module is used for summarizing and analyzing the target data in the uniform format.

A third aspect of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the multiple data source structure analysis method as described in any one of the above when executing the computer program.

A fourth aspect of the present invention provides a storage medium storing a computer program which, when executed by a processor, implements the steps of the multiple data source structure analysis method as described in any one of the above.

The invention has the beneficial effects that: constructing a first query plan by traversing an abstract syntax tree, and replacing a corresponding original configuration parameter table in the first query plan by adopting a preset configuration parameter table to obtain a second query plan, so that the second query plan is executed to query a plurality of data sources simultaneously, target data in a uniform format is directly generated in a data warehouse according to the queried target data, data extraction is not needed, the overall efficiency of data summarization is improved, and therefore, summarization analysis can be easily performed subsequently; by the method, data format differences among the data sources are shielded, uniform query analysis is realized, and the development speed of data analysis application is increased; in addition, each data source is only called when data analysis is carried out, so that the process of constructing a data aggregation program is omitted, and system resources are saved.

Drawings

FIG. 1 is a flow chart of a method for analyzing a structure of multiple data sources according to an embodiment of the present invention;

FIG. 2 is an exemplary diagram of an abstract syntax tree in accordance with an embodiment of the present invention;

FIG. 3 is an exemplary diagram of a first query plan in accordance with an embodiment of the invention;

FIG. 4 is an exemplary diagram of a second query plan in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of a multiple data source structural analysis system according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, or system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The terms "first", "second" and "third", etc., described herein, are used only for distinguishing devices/components/subassemblies/parts, etc., and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated, whereby the definition of features as "first", "second" and "third", etc., may explicitly or implicitly mean that at least one of such features is included, unless explicitly specifically defined otherwise, "a plurality" means at least two, e.g., two, three, etc., and those skilled in the art may specifically understand the specific meaning of the above terms in the present invention.

As shown in fig. 1, a method for analyzing a multiple data source structure according to an embodiment of the present invention includes:

s101: and acquiring a target SQL statement, and analyzing the target SQL statement to construct an abstract syntax tree.

In this embodiment, the data source is defined as an original storage source of the target data, and the original storage source may be a database or a file.

It should be noted that the structured query language is a database query and programming language, including SQL syntax rules for accessing data and managing a relational database system.

Specifically, the server parses the target SQL statement according to the SQL parser, which includes lexical parsing and syntactic parsing. The lexical analysis means that the target SQL statement is split into the lexical units which cannot be separated repeatedly. In the SQL grammar, lexical units include keywords, identifiers, word sizes, operators, and delimiters. For example, when the server reads the UPDATE of the target SQL statement, the server determines that the first character "U" satisfies the rules of the keyword and the identifier, determines that the second character "P" also satisfies the rules of the keyword and the identifier, and so on, until the 7 th character is a space, the server determines that the rule is not satisfied, and then the server completes the recognition of a lexical unit. The UPDATE is a keyword defined by the SQL specification and also meets the identifier rule, so when one lexical unit is an identifier, the SQL parser judges the priority to determine whether the lexical unit is the keyword or not, and the priority of the keyword is higher than the identifier rule. Other rules include reading characters beginning with a number according to the literal quantity of the numeric rule; reading characters at the beginning of the double quotation marks or the single quotation marks according to the font quantity of the character string rule; operators or delimiters are identified based on the symbolic features.

Grammar parsing obtains one lexical unit from lexical parsing each time. If the rule is satisfied, continuing to extract and match the next lexical unit until the character string is ended; if the rule is not satisfied, an error is prompted and the analysis is finished. Syntax parsing eventually converts the target SQL statement into an abstract syntax tree.

In this embodiment, the following SQL statement is used as an example for explanation.

The grammar rules are now given as follows:

< Query > < SFW > -Query statement can be described as a select-from-where statement

< Query >) -a Query statement may be described as one Query statement may be bracketed to serve as another Query statement

A SELECT < SelList > FROM < FROMLIST > WHERE < Condition > -SELECT-FROM-WHERE statement can be described as a SELECT attribute FROM multiple queryable elements (tables or views), with WHERE expressions returning either true or false

< SelList > < Attribute > < SelList > -the attribute of the result to be queried can be described as one attribute plus one comma followed by other attributes

< SelList > < Attribute > -the result attribute to query can be described as an attribute

< relationship >, < FromList > -multiple queryable relationships (tables or views) are described as relationships (tables or views) -commas-other multiple queryable tables or views

< FromList > < relationship > -multiple queryable tables or views are described as a single relationship (table or view)

Condition expression describes an Attribute equal to a certain Query result

< Condition > < Attribute > < Pattern > -conditional expression is described with Attribute equal to parameter

< Condition > < Attribute > -a conditional expression is described with attributes equal to attributes (here, equal is taken as an example only).

The given relationship is as follows:

STUDENT(name，age，classid)

CLASS(id，classname)

given an SQL statement (query the name of the student for a shift):

the SELECT name FROM study class (SELECT id FROM CLASS WHERE class name).

The abstract syntax tree shown in fig. 2 can be derived by using the above SQL syntax rules as a guide and applying LL (1) algorithm.

Optionally, the obtaining the target SQL statement comprises:

In the application, the SQL statement may be pre-stored in the corresponding system, the pre-stored SQL statement is used as the target SQL statement, if the pre-stored SQL statement does not exist, the SQL statement may also be sent through the corresponding program, and when the server receives the SQL statement, the received SQL statement is used as the target SQL statement. And after the target SQL statement is determined, the SQL statement is analyzed through an SQL analyzer to construct an abstract syntax tree.

As shown in fig. 2 to 4, S102: and traversing the abstract syntax tree to construct a first query plan, and replacing a corresponding original two-dimensional table in the first query plan by adopting a preset configuration parameter table to obtain a second query plan, wherein the configuration parameter table is prestored with index and/or access information.

The index corresponds to a database type data source, the access information corresponds to a file type data source, and the access information comprises an address of the file type data source. Specifically, the configuration parameter table at least prestores corresponding relations of data source addresses, table names and field names so as to determine addresses of target data, and further improve query efficiency.

The first query plan construction process is as follows, the whole syntax tree is traversed in a middle order, the query plan of the current node is constructed when the current node can be directly converted into the node of the query plan, and the query plan of the sub-tree is tried to be constructed by traversing the sub-tree when the current node cannot be directly constructed (if the sub-query is included). And after the construction of the sub-tree query plan is finished, adding the sub-tree query plan into the current query node, if the sub-tree query plan cannot be constructed, ignoring the sub-tree query plan, and traversing other nodes. A first query plan constructed through the abstract syntax tree shown in fig. 2 is shown in fig. 3.

After the first query plan is constructed, the server replaces the original two-dimensional table in the first query plan with a preset configuration parameter table by a preset replacement algorithm, taking fig. 3 and fig. 4 as examples, the first query plan in the example in fig. 3, the original two-dimensional tables in the first query plan of this example are study and CLASS, after the original two-dimensional table is replaced with the preset configuration parameter table, the generated second query plan is as shown in fig. 4, the configuration parameter table in fig. 4 is pre-stored with indexes corresponding to various types of databases, access information corresponding to various types of files, and corresponding conversion algorithms, and after determining the data source storing the target data, index or access information corresponding to each data source can be obtained according to the second query plan, and a corresponding conversion algorithm is used for querying the target data according to the index or access information, and converting the target data in a uniform format in the data warehouse through a conversion algorithm while reading the target data.

Optionally, the obtaining the second query plan by replacing the original two-dimensional table corresponding to the first query plan with the preset configuration parameter table includes:

In the process of replacing the original two-dimensional table in the first query plan, the first query plan needs to be traversed first to determine the address of the original two-dimensional table and the association between the original two-dimensional table and the first query plan; after the determination, the preset replacement algorithm can be executed, the corresponding original two-dimensional table is replaced by the configuration parameter table according to the replacement algorithm, namely, the association between the original two-dimensional table and the first query plan is released, and the association between the configuration parameter table and the first query plan is established, so that the second query plan is constructed and completed.

S103: executing the second query plan to query the target data from each data source storing the target data, and converting the target data into the target data in the unified format in a data warehouse according to the queried target data.

As shown in the second query plan of FIG. 4, the target data corresponding to the STUDENT and CLASS may be stored in a plurality of different servers. The data stored in the server may be a database, and may be a json file, an excel file, an xml file, or the like.

The server can obtain the type, address, database name, user name, password and table name of the database according to the second query plan, and then the server accesses the database and the table data in the database through the information to realize query of the target data.

For file type data sources such as txt, json, excel, Hadoop and the like, the server can obtain the corresponding data source type and access information of the data source according to the second query plan, wherein the access information comprises but is not limited to an address, a user name and a password, so that the target file is determined, the target file is called according to the information, and the target data is queried through the target file.

Optionally, the executing the second query plan to query for the target data from each data source storing the target data includes:

if the data source is a database, executing the second query plan to generate an executable SQL statement and a query field command corresponding to the data source, acquiring the field type of the data source according to the query field command, checking the target data according to the field type, and if the target data is legal, performing query operation on the data source according to the executable SQL statement and returning a query result.

The database field types are all stored in the database, and the corresponding query field command is called only according to the database types when the database field types are obtained.

In this embodiment, the server queries the data source through the data acquisition interface, and after the data acquisition interface receives the data reading request, the server queries the corresponding data source according to the received information. The read data interface needs to input a data source connection mode (such as ftp, url or other types), a data source type, an original table name (file name or other types), a page number, a line per page, a filter condition, a required field and return a plurality of lines of data.

For different types of databases, a paging query statement can be easily constructed according to information such as a data source connection mode, a data source type, an original table name, a few pages, the number of each page and the like, which are transmitted when a data reading interface is called, for example, for MySQL, the following SQL statement for paging query can be generated:

select*from student limit(curPage-1)*pageSize，pageSize；

and reading according to the query result to read the target data. The query can be performed in a similar manner for other types of databases, which is not described herein again.

For a file type data source, an ftp, webservice and other type interfaces are required to be called through a preset analysis program to obtain a corresponding file, the corresponding file is read and then analyzed, and target data is obtained.

Optionally, the converting, in a data warehouse, the target data into the uniform format according to the queried target data includes:

determining each original storage data source corresponding to the target data;

executing a preset conversion algorithm corresponding to each original storage data source according to the original storage data sources, and converting target data in a unified format in a data warehouse according to the conversion algorithm and the target data.

Because the target data may be stored in different types of data sources, in order to convert the target data in the unified format in the data warehouse, a corresponding conversion algorithm needs to be established for each type of data source, so that after the type of the data source is determined and the target data is queried, the corresponding conversion algorithm is executed, and the target data in the unified format is converted in the data warehouse.

Optionally, the converting the target data in a unified format in the data warehouse according to the conversion algorithm and the target data includes:

The conversion process according to the target data in the database is simple, and the field name and the type can be converted directly according to the configured field relationship.

Optionally, the converting the target data in a unified format in the data warehouse according to the conversion algorithm and the target data further includes:

The object data transformation process for xml, json, etc. is somewhat more complex. For example, the json data is used as an example, and the conversion algorithm is as follows:

similar to { result.id > id.., result.name > classname. } is configured according to field conversion, before data conversion, the user can directly navigate to the result attribute of the first object and judge whether the result is an array, if the result is the array, the id and the name of the data are acquired in a circulating manner, and if the result is only one object, the data are acquired only once.

If the relational data are constructed as follows:

rusult.id	rusult.name
		1	one class
2	Two classes
		3	Three shifts

After the conversion of the field names and field types according to the conversion algorithm, the results are as follows:

id	classname
		1	one class
2	Two classes
		3	Three shifts

S104: and summarizing and analyzing the target data in the uniform format.

According to the method, a first query plan is constructed by traversing an abstract syntax tree, a preset configuration parameter table is adopted to replace an original configuration parameter table corresponding to the first query plan to obtain a second query plan, so that the second query plan is executed to query a plurality of different data sources simultaneously, target data in a unified format is directly generated in a data warehouse according to the queried target data, data extraction is not needed, the overall efficiency of data summarization is improved, and therefore summarization analysis can be easily performed subsequently; by the method, data format differences among the data sources are shielded, uniform query analysis is realized, and the development speed of data analysis application is increased; in addition, the system only calls each data source when data analysis is carried out, so that the process of constructing a data aggregation program is omitted, and system resources are saved.

As shown in fig. 5, an embodiment of the present invention further provides an analysis system for a multiple data source structure, including:

the parsing module 51 is used for acquiring a target SQL statement and parsing the target SQL statement to construct an abstract syntax tree;

the replacing module 52 is configured to traverse the abstract syntax to construct a first query plan, and replace a corresponding original two-dimensional table in the first query plan by using a preset configuration parameter table to obtain a second query plan, wherein the configuration parameter table is pre-stored with index and/or access information;

the execution module 53 executes the second query plan to query the target data from each data source storing the target data, and converts the queried target data in a data warehouse to generate target data in a uniform format;

and the analysis module 54 summarizes and analyzes the target data in the uniform format.

As shown in fig. 6, an embodiment of the present invention further provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the steps of the multiple data source structure analysis method as described in any one of the above. Such as S101 to S104 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 51 to 54 shown in fig. 5.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a storage medium and executed by a processor, to instruct related hardware to implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the content of the storage medium may be increased or decreased as required by legislation and patent practice in the jurisdiction, for example, in some jurisdictions, the storage medium does not include electrical carrier signals and telecommunication signals according to legislation and patent practice.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Although the present disclosure has been described above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present disclosure, and these changes and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method for analyzing a structure of multiple data sources is characterized by comprising the following steps:

acquiring a target SQL statement, and analyzing the target SQL statement to construct an abstract syntax tree;

and summarizing and analyzing the target data in the uniform format.

2. The method for multiple data source structural analysis of claim 1, wherein said obtaining a target SQL statement comprises:

3. The multiple data source structure analysis method of claim 1, wherein the replacing the corresponding original two-dimensional table in the first query plan with the preset configuration parameter table to obtain the second query plan comprises:

4. The multiple data source structure analysis method as claimed in claim 1, wherein said executing the second query plan to query the target data from each data source storing the target data comprises:

executing the second query plan, acquiring the type of the data source storing the target data, and simultaneously querying each data source according to the type of the data source;

5. The multiple data source structure analysis method as claimed in claim 1, wherein the generating of the target data in a unified format by converting the queried target data in a data warehouse comprises:

determining each original storage data source corresponding to the target data;

executing a preset conversion algorithm corresponding to each original storage data source according to the original storage data source, and converting the target data in the unified format in the data warehouse according to the conversion algorithm and the target data.

6. The multiple data source structure analysis method of claim 5, wherein said translating the target data in the unified format in a data warehouse according to the translation algorithm and the target data comprises:

7. The multiple data source structural analysis method of claim 6, wherein said translating the target data in the unified format in a data warehouse according to the translation algorithm and the target data further comprises:

8. An analysis system for a multiple data source architecture, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the multiple data source structure analysis method according to any one of claims 1 to 7 when executing said computer program.

10. A storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the multiple data source structure analysis method of any one of claims 1 to 7.