CN115510021A

CN115510021A - Method and system for constructing standard layer of data warehouse

Info

Publication number: CN115510021A
Application number: CN202210749186.5A
Authority: CN
Inventors: 杨立才; 邵宏力; 胡超; 刘磊; 李云; 邓知知
Original assignee: Jiangsu Kunshan Rural Commercial Bank Co ltd
Current assignee: Jiangsu Kunshan Rural Commercial Bank Co ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-12-23
Anticipated expiration: 2042-06-29
Also published as: CN115510021B

Abstract

The invention relates to a method and a system for constructing a data warehouse standard layer. The method comprises the following steps: the standard layer comprises a table model and a field model; for each table in the database, determining whether the table is an island table, and putting a non-island table serving as a table model into a standard layer; the island table means that the table does not have a foreign key relation with other tables; determining whether the fields of all tables in the database are main data fields; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristic is greater than the threshold value and is a non-default value, putting the field into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is used, the conversion of the viewing setting code value is recommended. The invention improves the degree of data standardization.

Description

Method and system for constructing standard layer of data warehouse

Technical Field

The invention belongs to the technical field of business intelligence, and particularly relates to a method and a system for constructing a data warehouse standard layer.

Background

In the construction of data systems such as data warehouses, data governance, data lakes and the like, the data needs to be processed in a standardized manner before being warehoused. In the traditional technical scheme, the data standardization operation is to judge whether each field of each table needs to be standardized or not by adopting the modes of manually checking the PDM file, the table remark, the field content and the like of the database, and how to standardize the field. The traditional technology has heavier dependence on manpower, and if the naming of fields, tables and field code values is not standardized and naming explanation is missing, personnel cannot know the data structure and relationship, lacks PDM files or description documents, and the personnel cannot know the business process of company organizations, the data standardization operation becomes very difficult. Particularly, when the system is complicated in organization and the data amount of the system table is large, a great amount of manpower is required to be invested to identify and judge the data, so that the problems of incomplete data standardization, standard missing and the like still exist.

Disclosure of Invention

The invention provides a method and a system for constructing a data warehouse standard layer.

In order to solve the technical problems in the prior art, the invention provides a method for constructing a data warehouse standard layer, which comprises the following steps: the standard layer comprises a table model and a field model;

for each table in the database, determining whether the table is an island table, and putting a non-island table serving as a table model into a standard layer; the island table means that the table does not have a foreign key relation with other tables;

determining whether the field of each table in the database is a main data field; when the field is a main data field, the field is put into a standard layer; when the field is not a main data field, if the filling rate in the field characteristics is greater than a threshold value (for example, 2%) and is a non-default value, putting the field into a standard layer;

when the analysis data type is not consistent with the original type, the conversion type is recommended when the field type judges that the proportion of the data is 100%, for example, when the original type is a text, and the stored data is all (100%) floating point numbers (namely decimal numbers), the conversion into a more accurate floating point type is recommended; if the code value field is used, a rule for configuring code value conversion is recommended. The code values described here are enumerated values.

As a preferred embodiment, whether each table in the database is an island table is determined through a table-level knowledge graph; the table-level knowledge graph is a knowledge graph which displays the tables and the foreign key relations among the tables in a visual graph structure; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation; whether the corresponding table has a foreign key relation or not is determined by whether edges exist among all nodes in the table-level knowledge graph or not, and when no edge exists between a certain node and any other node, the table represented by the node is an island table.

As a preferred embodiment, determining whether the fields of each table in the database are main data fields through a field-level knowledge graph; the field-level knowledge graph is a knowledge graph which displays the fields and the relationships among the tables in a visual graph structure form; the field-level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships between the tables are embodied as relationships between fields from different tables, and at least comprise foreign key relationships, data equality or data null equality; when a main data field is determined, two fields with the relationship between tables being foreign key relationship, equal data or equal data null are found out through a field-level knowledge graph, and when original data of the two fields are sourced from different service systems, the two fields are used as the main data field.

As a preferred embodiment, the method for obtaining the table-level knowledge graph comprises the following steps: acquiring a service system and a table name from which each table in a database comes, and a field name in each table; analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating to obtain the in-table function dependency relationship among the fields in the table according to the table name, the field name and the field value aiming at each table; aiming at each table, identifying a primary key of each table according to the function dependency relationship in the table, searching and determining a corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key; and displaying the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph.

As a preferred embodiment, the method for obtaining the relationships between the tables in the field-level knowledge graph comprises the following steps: determining a table A to which the foreign key belongs through the function dependency relationship in the table, finding a closure of the field of the foreign key, and removing the duplication of the field in the closure to form a temporary table B taking the field of the foreign key as a main key; taking the table C with the main key as a left table and taking the temporary table B as a right table through the relation of the external keys, and performing internal connection to form a new temporary table D; the values of the fields in temporary table D in tables a and C are compared to form the following relationships:

data are equal, namely the fields between the table A and the table C are completely equal in two columns of data in the temporary table D;

and (4) data null elimination is equal, namely fields between the table A and the table C are equal after two columns of data in the temporary table D are null eliminated.

The invention also provides a system for constructing the standard layer of the data warehouse, which comprises the following steps: a processor; a database; and a memory in which a program is stored, a database storing tables,

wherein when the processor executes the program, the following operations are performed:

for each table in the database, determining whether the table is an island table, and putting a non-island table serving as a table model into a standard layer; the island table means that the table has no foreign key relation with other tables; determining whether the field of each table in the database is a main data field; when the field is a main data field, the field is put into a standard layer; when the field is not a main data field, if the filling rate in the field characteristics is greater than a threshold value (for example, 2%) and is a non-default value, putting the field into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; and if the code value field is the code value field, recommending a rule for converting the configuration code value.

Compared with the prior art, the invention has the remarkable advantages that:

(1) The invention exports all data files from a plurality of upstream service systems, loads the data files into a big data platform, utilizes the mass storage and the computing power of the big data platform to calculate and analyze all table data of all the service systems to obtain the main foreign key relation and the function dependency relation of each table and other tables, and carries out the standardized processing method of each table and each field according to the relation;

(2) In the construction process of the data warehouse or the data lake system, all tables and fields needing to enter the data warehouse or the data lake can be directly standardized without relying on any manual identification, judgment and the fact that people are familiar with knowing about data tables, relations, field contents and field code values, and without investing high-cost human resources, so that the standardized processing efficiency is improved, the data standardization degree is improved, and the data entering the data warehouse or the data lake are ensured to be unified.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

FIG. 1 is a schematic flow diagram of one embodiment of the present invention.

Fig. 2 is a schematic flow chart of step 300 in fig. 1.

FIG. 3 is a simplified diagram of a field level knowledge graph.

FIG. 4 is a field level knowledge graph overview schematic.

FIG. 5 is a detailed diagram of a field level knowledge-graph portion.

FIG. 6 is a detailed diagram of another portion of a field level knowledge-graph.

Detailed Description

It is easily understood that various embodiments of the present invention can be conceived by those skilled in the art according to the technical solution of the present invention without changing the essential spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete. The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the innovative concepts of the invention.

The method and the system for constructing the data warehouse standard layer can complete the construction of the data warehouse standard layer only by obtaining key characteristic information through data analysis.

The invention relates to a method for constructing a standard layer of a data warehouse, wherein the standard layer comprises a table model and a field model;

determining whether the fields of all tables in the database are main data fields; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is more than 2% and is a non-default value, putting the field into a standard layer;

when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; and if the code value field is the code value field, recommending a rule for converting the configuration code value.

In the invention, as an optimal mode, whether each table in the database is an island table is determined through a table-level knowledge graph; the table-level knowledge graph is a knowledge graph which displays the tables and the foreign key relations among the tables in a visual graph structure; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation; whether the corresponding table has a foreign key relation or not is determined by whether edges exist among all nodes in the table-level knowledge graph or not, and when no edge exists between a certain node and any other node, the table represented by the node is an island table. The method for acquiring the table-level knowledge graph comprises the following steps: acquiring a service system and a table name from which each table in a database comes, and a field name in each table; analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating and obtaining the in-table function dependency relationship among the fields in the table according to the table name, the field name and the field value aiming at each table; aiming at each table, identifying a primary key of each table according to the function dependency relationship in the table, searching and determining a corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key; and displaying the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph.

In the invention, as a preferable mode, whether the fields of each table in the database are main data fields is determined through a field-level knowledge graph; the field-level knowledge graph is a knowledge graph which displays the fields and the relationships among the tables in a visual graph structure form; the field-level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships between the tables are embodied as relationships between fields from different tables, and at least comprise foreign key relationships, data equality or data null equality; when a main data field is determined, two fields with the relationship between tables being foreign key relationship, equal data or equal data null are found out through a field-level knowledge graph, and when original data of the two fields are sourced from different service systems, the two fields are used as the main data field. The method for acquiring the relationship among the tables in the field-level knowledge graph comprises the following steps: determining a table A to which the foreign key belongs through the function dependency relationship in the table, finding a closure of the field of the foreign key, and removing the duplication of the field in the closure to form a temporary table B taking the field of the foreign key as a main key; taking the table C with the main key as a left table and taking the temporary table B as a right table through the relation of the external keys, and performing internal connection to form a new temporary table D; the values of the fields in temporary table D in tables a and C are compared to form the following relationships:

the data is equal to null, namely, the fields between the table A and the table C are equal after two columns of data in the temporary table D are equal after null value removal.

In another aspect, the present invention further provides a system for constructing a standard layer of a data warehouse, including: a processor; a database; and a memory in which a program is stored, a database storing tables, wherein when the processor executes the program, the following operations are performed: for each table in the database, determining whether the table is an island table, and putting a non-island table serving as a table model into a standard layer; the island table means that the table does not have a foreign key relation with other tables; determining whether the field of each table in the database is a main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is more than 2% and is a non-default value, putting the field into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is used, a rule for configuring code value conversion is recommended.

The method and system for building a data warehouse standard layer according to the present invention will be described in detail below with reference to a specific embodiment. In practice, in order to save the calculation results obtained from each step, a series of tables are established in the calculation system to store the result data of each step. Of course, in actual operation, various tools such as text documents may be used to store the calculation results of the steps. As one embodiment, the following series of data tables may be used in building the data warehouse standard layer to store the calculation results of each step:

table 1 table LIST table TABLES _ LIST;

table 2 field LIST table column _ LIST;

table 3 MASTER _ DATA _ INFO, MASTER DATA field information table;

table 4 field characteristic information table column _ FEATURE _ INFO;

table 5 standard layer model table.

The above-constructed form template may be placed in advance in the storage device of the system. As shown in fig. 1, the method for constructing a data warehouse standard layer according to the embodiment includes the following steps:

s100, obtaining the table names of the data TABLES for constructing the data warehouse, and storing the table names into a table LIST table TABLES _ LIST.

And reading LISTs of all the TABLES from the database by the table data reading device, and storing the table names of all the TABLES into a LIST table template preset in the storage device to form a LIST table TABLES _ LIST of the full database table as shown in the table 1. If each table is from different service systems, the method also comprises the step of obtaining the service system number of each table.

Table 1 shows a listing of all tables read from the database.

Table 1 table LIST table TABLES _ LIST (partial example)

SYS_CODE	TABLE_CODE	COMMENT
			S03	ods.ods_s03_acc_accp	Silver tent
S03	ods.ods_s03_ctr_loan_cont	Contract main table
			S03	ods.ods_s03_prd_bank_info	Bank information
S55	ods.ods_s55_bt_discount_batch	Bisection buy batch
			S58	ods.ods_s58_m_ci_customer	Customer basic information table
S58	ods.ods_s58_m_ci_person	Personal customer information main table
			S57	ods.ods_s57_tb_fss_transbook	Transfer information flow chart

The meanings of the individual items in table 1 are as follows:

the SYS _ CODE is a serial number of a business system, the business system is each working system used by a certain unit, for example, a certain bank has a loan system, a payroll system and the like at the same time, and data in the business systems are stored in a data warehouse in a table form.

TABLE _ CODE is the English name listed in the data warehouse.

COMMENT is the Chinese name of each table. The Chinese names shown in the COMMENT column are for convenience of illustration only, and in actual implementation, the column of data information including the Chinese names is not necessarily required.

S200, obtaining the fields of each table and storing the fields in a field LIST table COLUMNS _ LIST.

The table data reading device obtains the field information in each table from the data table stored in the data warehouse, and stores the field information in the field LIST table template preset in the storage device to form a field LIST table COLUMNS _ LIST. A portion of the field list table is shown in table 2.

TABLE 2 field Listing Table COLUMNS _ LIST (partial example)

SYS_CODE	TABLE_CODE	COL_NUM	COL_CODE	COMMENT
					S58	ods.ods_s58_m_ci_person	1	cust_no	Customer number
S58	ods.ods_s59_m_ci_person	2	cust_name	Name of customer
					S58	ods.ods_s60_m_ci_person	3	cust_eng_name	English name of customer
S58	ods.ods_s61_m_ci_person	4	py_name	Phonetic names

The meanings of the individual items in table 2 are as follows:

SYS CODE is a service system number,

TABLE _ CODE is the English name listed in the data warehouse.

COL _ NUM is a field number,

the COL _ CODE is a field name,

COMMENT is the Chinese name of each field. The Chinese names shown in the COMMENT column are for convenience of illustration only, and in actual implementation, the column of data information including the Chinese names is not necessarily required.

And S300, acquiring a table-level knowledge graph and a field-level knowledge graph.

As shown in fig. 2, the present step specifically includes the following steps:

s301, aiming at each table, analyzing the characteristics of each field according to the value of the field in the table.

The features include qualitative features and quantitative features; the qualitative characteristic may include a data type of the field and the quantitative characteristic may include a length of the field.

In this embodiment, the qualitative characteristics of the fields refer to performing the following qualitative analysis according to the values of the fields (and the data in the fields):

COL _ TYPE is the data TYPE of the field. Such as character strings, different storage lengths, text, numerical values, dates, times, etc.

COL _ NULLABLE is whether a field is NULLABLE, belongs to a qualitative feature of a field, and is a preference, and in some embodiments, may not be null as a qualitative feature of a field.

COL _ PK is whether the field is a primary key or not, and belongs to the qualitative characteristics of the field. Certainly, the feature cannot be obtained in this step temporarily, and the feature needs to be recorded in the table 4 field basic meta information and qualitative feature record table after the foreign key is obtained in the subsequent step.

COL _ AUTOINCRE is whether or not it is a self-increment field, belongs to the qualitative feature of a field, and is a preference, and in some embodiments, whether or not it is a self-increment field may not be taken as the qualitative feature of a field.

COL _ DEFULT indicates whether a default field belongs to a qualitative feature of the field, and is a preference, which in some embodiments may not be taken as a qualitative feature of the field.

CODE _ VALUE _ FLG is whether a CODE VALUE field belongs to a qualitative feature of a field, is a preference, and in some embodiments may not be taken as a qualitative feature of a field.

In this embodiment, the quantitative features include:

COL _ RECORDS is the number of field lines, which belongs to the index feature.

COL _ DISTINCT is the row number of the field after the duplication removal, and belongs to the index characteristic.

COL _ NOTNLULL _ is the number of non-NULL rows in the field value, belonging to the index feature. As a preference, in some embodiments, it may not be used as an indicator feature for a field.

Of course, not all of the qualitative and quantitative features previously described are required in the present invention.

S302, aiming at each table, according to the table name, the field name and the value of the field, calculating to obtain the function dependency relationship among the fields in the same table, which is called as the function dependency relationship in the table.

In the prior art, there are various methods for obtaining the function dependency relationship through calculation, and the embodiment is not specifically developed. For ease of understanding, the function dependency includes a field for function dependency derivation and a function dependency derivation result field, which is only briefly described. For example, the field of the table prd _ bank _ info for function-dependent derivation is bank code bank _ no, and the field of the function-dependent derivation result is bank _ name bank name. Thus, the functional dependencies in the table can be understood as: the bank name bank _ name may be derived from the bank code bank _ no, or may be said to depend on the bank code bank _ no.

And S303, identifying the primary key of each table according to the function dependency relationship in the table aiming at each table.

There are many methods available in the prior art to calculate the primary key of the get list, and this embodiment is not specifically developed. The method of the present invention prefers a set of candidate codes to find primary keys, which may be one or more candidate codes.

S304, according to the characteristics of the primary key, searching and determining a corresponding foreign key in other tables, and forming a foreign key relation between the primary key and the foreign key.

There are many methods in the prior art to obtain the foreign key relationship, and this embodiment is not specifically developed.

The invention preferably obtains the foreign key relationship in the following manner:

and taking a field matched with the primary key data type and the field length in the other tables as a foreign key, wherein the field matched with the primary key data type and the field length means that the data type of the field is the same as the primary key data type, the minimum length of the field is greater than or equal to the minimum length of the primary key, and the maximum length of the field is less than or equal to the maximum length of the primary key.

Further, the fields matching the primary key data type and the field length may be further filtered, for example:

sequentially traversing the main keys, and generating a corresponding bloom filter for the value of each main key by a Hash method;

and comparing the value of the field matched with the data type and the field length of the primary key with the bloom filter corresponding to the primary key, and taking the field as a finally determined foreign key when the data coincidence rate is greater than a preset threshold value.

S305, the tables and the foreign key relations among the tables are displayed in a visual graph structure form to be used as a table-level knowledge graph.

After the foreign key relationship is obtained, the tables in the database and the foreign key relationship among the tables are stored in a graph database preset in a storage device in a graph structure form, and a visual table-level knowledge graph which is convenient to query is formed.

A table-level knowledge graph is shown in fig. 3. The table-level knowledge graph comprises 1 node and 1 edge, wherein the round node represents a table, and each node stores information representing the table, including basic meta information and related characteristic information of the table, such as English name, field number, table annotation (Chinese name), table row number and the like of the table. In each item of information, other information than the table english name may be used as the preferred additional information, and the node may or may not store the information. The table-level knowledge graph only contains one relation of foreign keys, and is represented as an edge which is connected with two nodes and is represented by an arrow in fig. 3, wherein FK marked on the edge represents the relation of the foreign keys, each edge is a directed edge, a starting node is a table to which a main key belongs, and a node pointed by the arrow is a table to which the foreign key belongs. The information of the foreign key relation, such as the English name of the primary key field, the English name of the foreign key field, the coincidence rate of the primary foreign key, etc., is also stored on each edge. Preferably, the foreign key may be a combined foreign key, the storage of the fields on the edge of the primary key and the foreign key is stored by using a list, and the fields with the same subscript have association, so that the field mapping relationship of the combined foreign key is completely stored.

S306, calculating the relationship among the tables.

The inter-table relationship in the invention is embodied as the relationship between fields from different tables, including the function dependency relationship, data equality relationship and data null-phase relationship between fields in different tables. The invention refers to the functional dependency relationship between fields in different tables as the functional dependency relationship between tables. The functional dependency relationships between tables include one-way dependency and two-way dependency. Therefore, in the present invention, the relationships between tables include four relationships, which are:

one-way dependence;

two-way dependency;

the data are equal;

data is removed from equality relations.

The relationships among the tables are used as supplements of the relationships between the foreign keys and the foreign keys, so that the relationships among the tables are greatly enriched, and more functions are realized.

The calculation method of the relationship among the four tables comprises the following steps: for primary keys and foreign keys in a foreign key relationship,

firstly, selecting a table A to which a foreign key belongs through a function dependency relationship in the table, finding a foreign key field (including a combined foreign key) and a closure of the foreign key field, and in the current closure, removing the duplication of the inner field of the closure to form a temporary table B with the foreign key field as a main key because all other fields in the closure can be pushed out through the foreign key;

secondly, taking the table C where the main key is located as a left table and taking the temporary table B as a right table, and performing internal connection to form a new temporary table D, wherein the fields in the temporary table D are actually derived from the table A and the table C;

secondly, calculating the intra-table function dependency relationship of each segment in the temporary table D, wherein the intra-table function dependency relationship of the temporary table D is the inter-table function dependency relationship of the table A and the table C;

finally, the data comparison is carried out on the field values in the table A to which the foreign key belongs and the table C to which the main key belongs, so that the following relationships are obtained:

(1) Unidirectional dependence: the fields between the table A and the table C have a one-way dependency relationship in the temporary table D, and the relationship type is marked as fd; the embodiment only stores the dependency relationship between single fields

(2) Bidirectional dependence: bidirectional dependency exists between the fields in the table A and the table C in the temporary table D, namely, the two fields have data one-to-one correspondence results, and the relationship type is marked as bfd; the embodiment only stores the dependency relationship between single fields

(3) Data are equal: the fields between the table A and the table C are completely equal in two columns of data in the temporary table D, the data can be considered to have stronger association or redundancy relationship, and the relationship type is marked as equal;

(4) Data null equals: the fields between the table A and the table C are equal after null values of two columns of data in the temporary table D are removed, the data can be considered to have weak association or redundancy relation, and the relation type is marked as same;

and S307, displaying the foreign key relation, the intra-table function dependency relation and the inter-table relation in a visual graph structure form to be used as the field-level knowledge graph.

In the step, fields are connected together by using the external key relation, the in-table function dependency relation and the inter-table relation, and are stored in a graph database preset in a storage device, and are displayed in a visual graph structure form to serve as a field-level knowledge graph. A field level knowledge graph overview is shown in fig. 4. The field-level knowledge graph comprises 1 node and 7 edges in total. The circular nodes represent a field, wherein each node stores information representing the field, and the information comprises a table name, a field English name, a service system number, a field number, a Chinese name, an analyzed data type, a field analysis length, whether the field can be empty, whether the field is a main key, whether the field is a self-increment field, whether the field is a default value, whether the field type judges a data proportion, whether the field comprises Chinese, a Chinese data proportion, whether the field is a code value field, a field line number, a field duplication removal line number, a field maximum length, a field minimum length, a field average length, a field length variance, a length median, a field non-NULL line number and the like. In the above information, except the table name and the field english name, the other information is preferred, and in practical application, the information may be added or reduced according to practical requirements. Since fig. 4 has a limited frame, only a part of the field-level map is shown, and the 7 edges cannot be shown completely, the present invention further shows the field-level map using detail partial fig. 5 and 6. It should be noted that fig. 5 and 6 are parts of the field-level knowledge graph, and do not refer to parts of fig. 4, as in fig. 4.

The 7 kinds of edges are respectively:

(1) An external key: in fig. 5 or fig. 6, an edge connecting two nodes is embodied, FK marked on the edge indicates a foreign key relationship, each edge is a directed edge, a starting node is a primary key, a node pointed by an arrow is a foreign key, and each edge further stores analyzed related information, which mainly includes a primary foreign key coincidence rate.

(2) And (3) combining external bonds: in fig. 5 or fig. 6, an edge connecting two nodes is represented, and JFK marked on the edge represents a joint foreign key relationship. Because of the association of multiple fields, when several fields are combined to form several edges, for example, the combined primary key is composed of 3 fields, and the combined foreign key generates 3 edges. Each edge is a directed edge, wherein the starting node is a table to which the main key belongs, the node pointed by the arrow is a table to which the foreign key belongs, and each edge also stores analyzed related information and mainly comprises the coincidence rate of the main foreign key.

(3) Functional dependencies within the table: in the figure, an edge connecting two nodes is represented, and FD marked on the edge is expressed as functional dependency in the table. Since the function dependency in the table is usually complicated, only the relation of FD _ LEVEL equal to 1 in the function dependency record table in fig. 5 or fig. 6 is selected to generate the function dependency in the table. Each edge is a directed edge, wherein the starting node is a field in LEFT _ COLUMN in the function-dependent record table, and the node pointed by the arrow is a corresponding field in RIGHT _ COLUMN in the function-dependent record table, which indicates that RIGHT _ COLUMN depends on LEFT _ COLUMN.

(4) One-way functional dependencies between tables: in fig. 5 or fig. 6, an edge connecting two nodes is shown, and EXFD marked on the edge represents a functional dependency relationship between tables. Rows with REL _ TYPE equal to fd in the multiple relationship recording tables among the tables are all converted into the relationship, each edge is a directed edge, fields in LEFT _ COL _ CODE in the multiple relationship recording tables among the starting tables, and nodes pointed by arrows are fields in RIGHT _ COL _ CODE corresponding to the multiple relationship recording tables among the tables, and the fact that RIGHT _ COL _ CODE depends on LEFT _ COL _ CODE is shown.

(5) Bidirectional functional dependencies between tables: in fig. 5 or fig. 6, an edge connecting two nodes is represented, and the EXBFD labeled on the edge is expressed as a functional dependency relationship between tables. Rows with REL _ TYPE equal to bfd in the multiple relation record tables among the tables are converted into the relation, each edge is an undirected edge (the directed edge is drawn in the figure and is limited by a graph database in a storage device, and the undirected edge is processed in actual calculation), wherein fields in LEFT _ COL _ CODE in the multiple relation record tables among the starting tables, and nodes pointed by arrows are fields in RIGHT _ COL _ CODE corresponding to the multiple relation record tables among the tables, and the RIGHT _ COL _ CODE and the LEFT _ COL _ CODE are mutually dependent.

(6) Data equality relationship between tables: in fig. 5 or fig. 6, an edge connecting two nodes is represented, and equal ends marked on the edge represent data equality relationships between tables. Rows with REL _ TYPE equal to equals to the equials in the multiple relationship record tables among the tables are all converted into the relationship, each edge is an undirected edge (the directed edge is drawn in the figure and is limited by a graph database in a storage device, and the undirected edge is processed in actual calculation), wherein fields in LEFT _ COL _ CODE in the multiple relationship record tables of the starting table, and nodes pointed by arrows are fields in RIGHT _ COL _ CODE corresponding to the multiple relationship record tables among the tables, and indicate that the data of RIGHT _ COL _ CODE and LEFT _ COL _ CODE are equal.

(7) Data between tables null equality relationships: in fig. 5 or fig. 6, an edge connecting two nodes is represented, and SAME marked on the edge represents an inter-table data null equality relationship. The lines of REL _ TYPE equal to same in the multiple relation recording tables among the tables are all converted into the relation, each edge is an undirected edge (the directed edge is drawn in the figure and is limited by a graph database in a storage device, and the undirected edge is processed in actual calculation), wherein the field in the LEFT _ COL _ CODE in the multiple relation recording tables among the starting tables, and the node pointed by the arrow is the field in the RIGHT _ COL _ CODE corresponding to the multiple relation recording tables among the tables, and represents that the data of RIGHT _ COL _ CODE and LEFT _ COL _ CODE are equal in null.

Of course, not all of the above-described inter-table relationships in a field-level knowledge graph need be used in the present invention.

S400, all main data fields of the table in the table list in the step 100 are acquired.

And finding out two fields with the relationship between tables being foreign key relationship, equal data or equal data null through a field-level knowledge graph, and taking the two fields as main data fields when the original data of the two fields are sourced from different service systems. As previously described, the field-level knowledge graph includes nodes and edges, each node representing a field and each edge representing an inter-field relationship; the relation of foreign key relation, data equality or data null equality is embodied as a corresponding edge in the field level knowledge graph. And recording all the found fields as main data fields in a main data information table preset in the storage device.

A part of the table of the main data field information is shown in table 3, for example.

Table 3 MASTER DATA field information table MASTER _ DATA _ INFO (partial example)

SYS_CODE	TABLE_CODE	COL_CODE	MASTER_ID	ORDER
					s58	ods.ods_s58_m_ci_org	regi_regis_date	1821	1
s53	ods.ods_s53_vai_cus_com_xd	reg_start_date	1821	2
					s03	ods.ods_s03_cus_com	reg_start_date	1821	3
s28	ods.ods_s28_cus_com	reg_start_date	1821	4
					s53	ods.ods_s53_vai_cus_com_xd	fina_per_tel	1825	1
s03	ods.ods_s03_cus_com	fina_per_tel	1825	2
					s28	ods.ods_s28_cus_com	fina_per_tel	1825	3

The meanings of the individual items in table 3 are as follows:

SYS CODE is a service system number,

TABLE _ CODE is the English name listed in the data warehouse.

COL _ CODE is a field name,

the MASTER _ ID is a main data group, and when the group numbers are consistent, the data are in the same group, and data sharing occurs;

ORDER is the sequence number in the packet, and the sequence ORDER is determined by descending ORDER of field dimension value (i.e. the number of rows after deduplication), and the smaller the sequence number in the packet, the more important the field is.

And S500, acquiring the characteristics of all fields in the field list through the field-level knowledge graph.

The field characteristics generally include:

1) COL _ RECORDS field line number

2) Number of non-NULL rows in COL _ NOTCNULL field value

3) Number of rows after COL _ DISTINCT field deduplication

4) COL _ TYPE analysis judgment TYPE (field data TYPE)

5) COL _ TYPE _ JUDGE _ RATE field TYPE judgment data proportion

6) Whether COL _ DEFULT field is a default value

7) Whether the CODE _ VALUE _ FLG field is a CODE VALUE field

8) FILL _ Rate FILL

Wherein, the characteristics 1) -8) are field characteristics directly obtained according to the field-level knowledge graph, and the filling RATE FILL _ RATE is obtained by calculating according to a formula COL _ not/COL _ RECORDS × 100%. The aforementioned characteristics are recorded in the column _ FEATURE _ INFO in a field characteristic information table preset in the storage device.

TABLE 4 field characteristic information Table COLUMNS _ FEATURE _ INFO

Of course, not all of the aforementioned field features need be used in the present invention.

S600, determining whether each table is an island table or not through a table-level knowledge graph.

As previously mentioned, a table-level knowledge graph refers to a knowledge graph that shows tables and foreign key relationships between tables in a visual graph structure; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation; whether the corresponding table has the foreign key relationship is determined by whether edges exist among all nodes in the table-level knowledge graph or not, when no edge exists between a certain node and any other node, the table represented by the node and other tables do not have the foreign key relationship, and the table represented by the node is an island table. Whether the table is an island table is marked in a table LIST table TABLES _ LIST.

TABLE 1 TABLE List Table TABLES _ LIST (partial example)

In the table, IS _ island IS an islanding table, where Y indicates yes and N indicates no.

S700, forming a standard layer according to the following assembly rules according to the results of the steps.

According to the content of IS _ island in table 1, the table selected as N (indicating that the non-island table IS selected) IS added to the standard layer, i.e. the non-island table IS put into the standard layer as a table model.

Fields are then recommended for the standard layer according to tables 3 and 4,

when a certain field is a main data field, the suggestion information of the field in the display device is a reserved field, namely the field is added into a standard layer;

when the field is non-main data, the filling rate is more than 2% and is a non-default value, the field is recommended to be reserved, namely the field is added into the standard layer, otherwise, the field is not recommended to be added into the standard layer;

when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is used, a rule for configuring code value conversion is recommended.

The resulting model of the standard layer is shown in Table 5

Information of table 5ods. Ods 03. U ctr. U loan. U cont in display device

In the display device, COL _ CODE is a field English name, SORCE _ COL _ TYPE is a field original data TYPE, COL _ TYPE is a field analysis judgment TYPE, and ANALY _ INFO is an assembled field analysis result.

The structure of each table in the above embodiments is merely an example, and in actual operation, the number of column data items is not necessarily only the items shown in each table in the above embodiments, and other item data may be provided.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto,

any changes or substitutions that may be easily made by those skilled in the art within the technical scope of the present disclosure are also intended to be covered by the scope of the present invention.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes described in a single embodiment or with reference to a single figure, for the purpose of streamlining the disclosure and aiding in the understanding of various aspects of the invention by those skilled in the art. However, the present invention should not be construed to include features in the exemplary embodiments which are all the essential technical features of the patent claims.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

It should be understood that the devices, modules, units, components, etc. included in the system of one embodiment of the present invention may be adaptively changed to be provided in an apparatus or system different from that of the embodiment. The different devices, modules, units or components comprised by the system of an embodiment may be combined into one device, module, unit or component or may be divided into a plurality of sub-devices, sub-modules, sub-units or sub-components.

The means, modules, units or components in the embodiments of the present invention may be implemented in hardware, or may be implemented in software running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that embodiments in accordance with the present invention may be practiced using a microprocessor or Digital Signal Processor (DSP). The present invention may also be embodied as a computer program product or computer readable medium for carrying out a portion or all of the methods described herein.

Claims

1. A method for constructing a standard layer of a data warehouse is characterized in that the standard layer comprises a table model and a field model;

for each table in the database, determining whether the table is an island table, and putting a non-island table serving as a table model into a standard layer; the island table means that the table has no foreign key relation with other tables;

determining whether the fields of all tables in the database are main data fields; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristic is greater than the threshold value and is a non-default value, putting the field into a standard layer;

2. The method of constructing a data warehouse standard layer of claim 1, wherein determining whether tables in the database are island tables is performed by a table-level knowledge graph;

the table-level knowledge graph is a knowledge graph which displays the tables and the foreign key relations among the tables in a visual graph structure; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;

whether the corresponding table has a foreign key relation or not is determined by whether edges exist among all nodes in the table-level knowledge graph or not, and when no edge exists between a certain node and any other node, the table represented by the node is an island table.

3. The method of building a data warehouse standard layer of claim 1 wherein determining whether the fields of the tables in the database are primary data fields is by a field level knowledge graph;

the field-level knowledge graph is a knowledge graph which displays the fields and the relations among the tables in a visual graph structure form; the field-level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships between the tables are embodied as relationships between fields from different tables, and at least comprise foreign key relationships, data equality or data null equality;

when a main data field is determined, two fields with the relationship between tables being foreign key relationship, equal data or equal data null are found out through a field-level knowledge graph, and when original data of the two fields are sourced from different service systems, the two fields are used as the main data field.

4. The method of building a data warehouse standard layer as claimed in claim 2 wherein the method of obtaining the table-level knowledge-graph is:

acquiring a service system and a table name from which each table in a database comes, and a field name in each table;

analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating and obtaining the in-table function dependency relationship among the fields in the table according to the table name, the field name and the field value aiming at each table;

aiming at each table, identifying the primary key of each table according to the function dependency relationship in the table, searching and determining the corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key;

and displaying the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph.

5. The method of building a data warehouse standard layer as claimed in claim 3 wherein the method of obtaining the relationships between the tables in the field level knowledge graph is:

determining a table A to which the foreign key belongs through the function dependency relationship in the table, finding a closure of the field of the foreign key, and removing the duplication of the field in the closure to form a temporary table B taking the field of the foreign key as a main key;

taking the table C with the main key as a left table and taking the temporary table B as a right table through the relation of the external keys, and performing internal connection to form a new temporary table D;

the values of the fields in temporary table D in table a and table C are compared to form the following relationships:

the data are equal, namely the fields between the table A and the table C are completely equal in the two columns of data in the temporary table D;

6. A system for building a standardized layer of a data warehouse, comprising:

a processor; a database; and a memory in which a program is stored, a database storing tables,

determining whether the field of each table in the database is a main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristic is greater than the threshold value and is a non-default value, putting the field into a standard layer;

7. The system for building a data warehouse standard layer of claim 6 wherein the determination of whether each table in the database is an islanding table is made through a table-level knowledge graph;

whether the corresponding table has a foreign key relationship is determined by whether edges exist among all nodes in the table-level knowledge graph, and when any node does not have an edge at any other node, the table represented by the node is an island table.

8. The system for building a data warehouse standard layer of claim 6 wherein determining whether the fields of the tables in the database are primary data fields is through a field level knowledge graph;

the field-level knowledge graph is a knowledge graph which displays the fields and the relationships among the tables in a visual graph structure form; the field-level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship between fields; the relationships between the tables are embodied as relationships between fields from different tables, and at least comprise foreign key relationships, data equality or data null equality;

when a main data field is determined, two fields with the relationship of foreign key relationship, equal data or equal data null are found out through a field-level knowledge graph, and when original data of the two fields come from different service systems, the two fields are used as the main data field.

9. The system for building a data warehouse standards layer of claim 7 wherein the method for obtaining the table-level knowledge-graph is:

10. The system for building a data warehouse standard layer as claimed in claim 8, wherein the method for obtaining the relationships between the tables in the field level knowledge graph is:

the values of the fields in temporary table D in tables a and C are compared to form the following relationships: