CN115328883A - Data warehouse modeling method and system - Google Patents
Data warehouse modeling method and system Download PDFInfo
- Publication number
- CN115328883A CN115328883A CN202210749147.5A CN202210749147A CN115328883A CN 115328883 A CN115328883 A CN 115328883A CN 202210749147 A CN202210749147 A CN 202210749147A CN 115328883 A CN115328883 A CN 115328883A
- Authority
- CN
- China
- Prior art keywords
- field
- tables
- foreign key
- path
- association
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 22
- 230000000007 visual effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 37
- 238000013507 mapping Methods 0.000 description 14
- 238000012546 transfer Methods 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
- G06F16/212—Schema design and management with details for data modelling support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a data warehouse modeling method and system. Displaying the tables and the foreign key relation among the tables in a visual graph structure form to be used as a table-level knowledge map, operating a minimum spanning tree algorithm on the basis of the table-level knowledge map to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; acquiring an association mode and an association field relation between the main table and the expected association table according to the optimal path; simultaneously acquiring selectable fields in all the expected association tables; removing fields which are not needed by the data warehouse modeling from the optional fields, reserving the needed fields, and determining the processing rule of the reserved fields; and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule. The invention improves the modeling quality.
Description
Technical Field
The invention belongs to the technical field of business intelligence, and particularly relates to a data warehouse modeling method and system.
Background
In the prior art, when modeling a data warehouse for table structure data, the relationships between tables are usually manually analyzed by field naming specifications, manual discrimination, compiling PDM files and the like, and then modeling is performed. Obviously, in the prior art, the dependence on manual work is very serious, the time and the labor for manual analysis are more, and if the field naming is not standard, an analyst is not familiar with the data structure and the relation, or lacks ER modeling files or description documents, the data warehouse model is very difficult to construct for the data table. Even if a model is constructed, the quality of the model is difficult to guarantee, and problems such as correlation errors and incomplete coverage are possible.
Disclosure of Invention
The invention provides a data warehouse modeling method and system.
In order to solve the technical problems in the prior art, the invention provides a data warehouse modeling method system, which comprises the following steps: the dependency relationship of the functions in each table, the primary key, the foreign key and the foreign key relationship of each table;
showing the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm on the basis of a table-level knowledge graph to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected associated table;
acquiring an association mode and an association field relationship between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relationship; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by data warehouse modeling from the optional fields, reserving the needed fields, and determining processing rules of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
As a preferred implementation, after the table-level knowledge graph is established, the dimension value of the foreign key field is calculated, and the dimension value is stored in the attribute of the edge of the table-level knowledge graph corresponding to the foreign key relation and used as the weight of the edge; the dimension value refers to the number of rows of the external key field after the duplication is removed; and when a plurality of paths with the shortest path length are found by using the minimum spanning tree algorithm, selecting one path with the largest sum of the path weights as the optimal path.
As a preferred embodiment, the process of obtaining the functional dependency relationship in the table is: acquiring the table name of each table in a database and the field name in each table; analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating to obtain a function dependency relationship among fields in the table as a function dependency relationship in the table according to the table name, the field name and the value of the field for each table; the features of the field include qualitative features and quantitative features; the qualitative characteristics include a data type of the field and the quantitative characteristics include a length of the field.
As a preferred embodiment, the process of obtaining the foreign key relationship is: and aiming at each table, identifying the primary key of each table according to the dependency relationship of the function in the table, searching and determining the corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key.
In a preferred embodiment, when searching and determining the foreign key, a field in the other table matching with the data type of the primary key and the field length is used as the foreign key, the field matching with the data type of the primary key and the field length means that the data type of the field is the same as the data type of the primary key, the minimum length of the field is greater than or equal to the minimum length of the primary key, and the maximum length of the field is less than or equal to the maximum length of the primary key.
As a preferred embodiment, two tables are arbitrarily selected from the service master table and all tables desired to be associated, a shortest path algorithm is used to calculate all selectable shortest paths between the two tables, and the selectable shortest path is used to replace the foreign key associated path.
In another aspect of the present invention, a data warehouse modeling system is further provided, including: a processor; a database; and a memory in which a program is stored, a database storing tables,
wherein when the processor executes the program, the following operations are performed:
showing the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm based on a table-level knowledge map to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected associated table;
acquiring an association mode and an association field relationship between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relationship; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by the data warehouse modeling from the optional fields, reserving the needed fields, and determining the processing rule of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
Compared with the prior art, the invention has the remarkable advantages that:
when the model is constructed in a business intelligence and data warehouse system, the association relation between tables can be automatically calculated by a background only by manually selecting the main table and the expected association table, so as to assist in generating the model. The phenomena of low manual analysis efficiency and high error rate are avoided, and the modeling quality is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
FIG. 1 is a schematic flow diagram of one embodiment of the present invention.
FIG. 2 is a schematic diagram of a table-level knowledge graph in accordance with the present invention.
FIG. 3 is a schematic diagram of the association relationship of the transfer message flow meter according to an embodiment of the invention.
FIG. 4 is a diagram of an example of path selection in one embodiment of the invention.
Detailed Description
It is easily understood that various embodiments of the present invention can be conceived by those skilled in the art according to the technical solution of the present invention without changing the essential spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete. Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with embodiments of the invention serve to explain the innovative concepts of the invention.
The data warehouse modeling method and the data warehouse modeling system can generate the association relationship of the data warehouse model by only obtaining the foreign key relationship through data analysis, and the association relationship comprises an association (join) mode and an association (join) condition and assists in the development of the data warehouse model.
The invention relates to a data warehouse modeling method system, which comprises the following steps: the dependency relationship of the functions in each table, the primary key, the foreign key and the foreign key relationship of each table;
displaying the tables and the foreign key relation among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm based on a table-level knowledge map to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected associated table;
acquiring an association mode and an association field relation between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relation; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by the data warehouse modeling from the optional fields, reserving the needed fields, and determining the processing rule of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
As a preferred implementation, after the table-level knowledge graph is established, the dimension value of the foreign key field is calculated, and the dimension value is stored in the attribute of the edge of the table-level knowledge graph corresponding to the foreign key relation and used as the weight of the edge; the dimension value refers to the number of rows of the external key field after the duplication is removed; and when a plurality of paths with the shortest path length are found by using the minimum spanning tree algorithm, selecting one path with the largest sum of the path weights as the optimal path.
As a preferred embodiment, the process of obtaining the functional dependency relationship in the table is: acquiring the table name of each table in a database and the field name in each table; analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating to obtain a function dependency relationship among fields in the table as a function dependency relationship in the table according to the table name, the field name and the value of the field for each table; the characteristics of the field comprise qualitative characteristics and quantitative characteristics; the qualitative characteristic comprises a data type of the field and the quantitative characteristic comprises a length of the field.
As a preferred embodiment, the process of obtaining the foreign key relationship is: and aiming at each table, identifying the primary key of each table according to the dependency relationship of the function in the table, searching and determining the corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key.
In a preferred embodiment, when searching and determining the foreign key, a field in the other table matching with the data type of the primary key and the field length is used as the foreign key, the field matching with the data type of the primary key and the field length means that the data type of the field is the same as the data type of the primary key, the minimum length of the field is greater than or equal to the minimum length of the primary key, and the maximum length of the field is less than or equal to the maximum length of the primary key.
As a preferred embodiment, two tables are arbitrarily selected from the service master table and all tables desired to be associated, a shortest path algorithm is used to calculate all selectable shortest paths between the two tables, and the selectable shortest path is used to replace the foreign key associated path.
In another aspect of the present invention, a data warehouse modeling system is further provided, including: a processor; a database; and a memory in which a program is stored, a database storing tables,
wherein when the processor executes the program, the following operations are performed:
displaying the tables and the foreign key relation among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm on the basis of a table-level knowledge graph to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected associated table;
acquiring an association mode and an association field relation between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relation; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by data warehouse modeling from the optional fields, reserving the needed fields, and determining processing rules of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
The method and system for building a data warehouse model according to the present invention will be described in detail below with reference to a specific embodiment. In practice, in order to facilitate the storage of the calculation results obtained from each step, a series of tables are established in the calculation system to store the result data of each step. Of course, in actual operation, various tools such as text documents may be used to store the calculation results of the steps. As one example, the following series of data tables may be used in building the data warehouse model to store the results of the calculations for each step:
table 1 table LIST table TABLES _ LIST;
table 2 field LIST table column _ LIST;
table 3 FUNCTION DEPENDENCY record table FUNCTION _ default _ TAB;
table 4 foreign key record table FK _ INFO _ TAB;
TABLE 5 TABLE relationship information TABLE MAP _ TABLE _ RELATED;
table 6 field MAPPING information table MAP _ MAPPING _ COL _ INFO.
The above-constructed form template may be placed in advance in the storage device of the system. As shown in fig. 1, the method for constructing a data warehouse model according to the embodiment includes the following steps:
s1, obtaining table names of data tables used for constructing a data warehouse, and storing the table names into an obtained table list table.
And reading the list of all the tables from the data warehouse by the table data reading device, and storing the table list information into the storage device to form the list table of the full data warehouse.
Table 1 shows a listing of all tables read from the database.
Table 1 table LIST table TABLES _ LIST (partial example)
SYS_CODE | TABLE_CODE | COMMENT |
S03 | ods.ods_s03_acc_accp | Silver tent |
S03 | ods.ods_s03_ctr_loan_cont | Contract main table |
S03 | ods.ods_s03_prd_bank_info | Bank information |
S55 | ods.ods_s55_bt_discount_batch | Bisection buy batch |
S58 | ods.ods_s58_m_ci_customer | Customer basic information table |
S58 | ods.ods_s58_m_ci_person | Personal customer information main table |
S57 | ods.ods_s57_tb_fss_transbook | Transfer information flow chart |
The meanings of the individual items in table 1 are as follows:
the SYS _ CODE is a serial number of a business system, the business system is each working system used by a certain unit, for example, a certain bank has a loan system, a payroll system and the like at the same time, and data in the business systems are stored in a data warehouse in a table form.
TABLE _ CODE is the English name listed in the data warehouse.
COMMENT is the Chinese name of each table. The Chinese names shown in the COMMENT column are for convenience of illustration only, and in actual implementation, the column of data information including the Chinese names is not necessarily required.
And S2, obtaining fields of each table and storing the fields in a field LIST table COLUMNS _ LIST.
The table data reading device reads the meta information in each table from the information stored in the data warehouse and stores the meta information in the field LIST table template preset in the storage device to form a field LIST table COLUMNS _ LIST. A portion of the field list table is shown in table 2.
Table 2 field LIST table column _ LIST (partial example)
SYS_CODE | TABLE_CODE | COL_NUM | COL_CODE | COMMENT |
S58 | ods.ods_s58_m_ci_person | 1 | cust_no | Customer number |
S58 | ods.ods_s58m_ci_person | 2 | cust_name | Name of customer |
S58 | ods.ods_s58_m_ci_person | 3 | cust_eng_name | English name of customer |
S58 | ods.ods_s58_m_ci_person | 4 | py_name | Phonetic names |
The meanings of the items in table 2 are as follows:
the SYS _ CODE is a service system number;
TABLE _ CODE is the English name listed in the data warehouse;
COL _ NUM is a field number;
COL _ CODE is a field name;
COMMENT is the Chinese name of each field. The Chinese names shown in the column of COMMENT are merely for convenience of illustration, and in actual implementation, the column of data information including the Chinese names is not necessarily required.
And S3, calculating to obtain the in-table function dependency relationship, identifying the primary key according to the in-table function dependency relationship, obtaining the external key according to the primary key, and forming the external key relationship between the primary key and the external key.
And performing function dependence analysis on each table in the table list to obtain the function dependence relationship in the table. In the prior art, various algorithms can be used for calculating to obtain a function dependency relationship, identifying a main key according to the function dependency relationship in a table, and acquiring an external key according to the main key. One example of this step is as follows.
Firstly, for each table, according to the table name, the field name and the value of the field, calculating to obtain the functional dependency relationship among the fields in the same table, which is called as the functional dependency relationship in the table. In this embodiment, a table with a dimension division FLAG DIM _ FLAG of 1 in a table list configuration table ANALYSIS _ CONF _ TAB is obtained, and then function dependency ANALYSIS is performed on the table requiring function dependency ANALYSIS to obtain a function dependency relationship in the table. The intra-table FUNCTION DEPENDENCY is stored in the FUNCTION DEPENDENCY record table FUNCTION _ default _ TAB shown in table 3. Of course, as another embodiment, the dimension division FLAG DIM _ FLAG is not considered or set, but function-dependent analysis is performed on all tables. In the prior art, various algorithms can calculate and obtain the function dependence relationship.
Table 3 FUNCTION DEPENDENCY record table FUNCTION _ DEPENDENCY _ TAB (partial example)
Table 3 is merely an example of intra-table functional dependencies for some of the tables in this embodiment.
Next, for each table, the primary key of each table is identified based on the intra-table functional dependencies. In this embodiment, the primary key is a candidate code set, which may be one or more candidate codes, and is referred to as the primary key. After the function dependency calculation is completed, the invention selects a relation set of rows in which the function dependency derivation LEVEL FD _ LEVEL is less than or equal to 3 in table 6, and performs candidate code solution by using a method for fast solving candidate codes. Therefore, a certain range can be defined, and the calculation pressure is relieved.
And finally, searching and determining corresponding foreign keys in other tables according to the characteristics of the primary keys, and forming foreign key relationships between the primary keys and the foreign keys. This step is carried out by the following means:
and traversing the obtained primary keys in sequence by using an inter-table relationship analysis and generation device in the system, and generating a corresponding bloom filter for the value of each primary key by using a Hash method. In particular, for federated primary keys, the data (values for each primary key) is federated to construct the corresponding bloom filter.
For any two tables, such as Table A and Table B, the fields in Table B that are likely to be foreign keys are selected based on the qualitative and quantitative characteristics of the primary key in Table A. For example, fields that are likely to be foreign keys are sorted based on the data TYPE COL _ TYPE in the qualitative feature and the length of the field in the quantitative feature (MAX _ LEN is the maximum length of the field and MIN _ LEN is the minimum length of the field). When the data type of a field in the table B is a subset of the data type of the field of the primary key in the table A and the length range of the field is within the length range of the field of the primary key, the field can be selected as a field which can be used as a foreign key. For example, if the value of the primary key in Table A is numeric data with a length of 14-18 bits, the fields in Table B that may be used as foreign keys should be characterized by a minimum length of 14 or more, a maximum length of 18 or less, and a data type of numeric data.
And comparing the data possibly used as the foreign key field with the bloom filter obtained in the step S71 to obtain the main foreign key data overlapping rate, and if the overlapping rate is 100%, using the data as the finally determined foreign key. In one embodiment, when the main foreign key data overlapping rate is greater than the threshold value, the main foreign key data overlapping rate may be set as the finally determined foreign key. For example, considering data quality, if necessary, a certain tolerance is considered for the overlap rate of the main foreign key, and when the overlap rate of the main foreign key in the comparison field in the bloom filter is greater than 98%, the main foreign key is determined to be a foreign key. After the foreign key is determined, a main foreign key relationship is formed between the main key and the foreign key, and main foreign key relationship information is stored in a table preset in the storage device, thereby forming a foreign key record table. The foreign key record table FK _ INFO _ TAB is shown in table 4. Table 4 shows no federated foreign key, as this example does not contain a federated primary key.
TABLE 4 foreign key record Table FK _ INFO _ TAB
FK_SYS_CODE | S03 | S03 | S03 | S03 |
FK_TABLE_CODE | prd_bank_info | prd_bank_info | prd_bank_info | prd_bank_info |
FK_COL_CODE | bank_no | bank_no | bank_no | bank_no |
SYS_CODE | S03 | S55 | S55 | S57 |
TABLE_CODE | acc_accp | bt_discount_batch | bt_discount_batch | tb_fss_transbook |
COL_CODE | aorg_no | s_mbfebankcode | s_inacc_banknum | payeebankno |
DATA_RATE | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
The meaning of each item in the foreign key record table FK _ INFO _ TAB shown in table 4 is as follows:
FK _ SYS _ CODE is the number of the main service system.
FK _ TABLE _ CODE is the parent TABLE name in the foreign key relationship.
FK _ COL _ CODE is the name of the primary key field in the parent table in the foreign key relationship.
And SYS _ CODE is the serial number of the business system where the sub-table is located in the foreign key relationship.
TABLE _ CODE is the name of the child TABLE in the foreign key relationship.
COL _ CODE is the foreign key field name in the foreign key relationship.
DATA _ RATE is the foreign key registration ratio. I.e., the proportion of the foreign key field present in the primary key field. For example, if the value is 0.99, it indicates that 99% of the data in the foreign key field is present in the primary key. This value should be 1 when there is a foreign key constraint, but in practice dirty data may appear sometimes without establishing a foreign key constraint to guarantee system performance. This value functions to some tolerance of real dirty data.
And S4, displaying the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph.
After the foreign key relationship is obtained, the tables in the database and the foreign key relationship among the tables are stored in a graph database preset in a storage device in a graph structure form, and a visual table-level knowledge graph which is convenient to query is formed.
A table-level knowledge graph is shown in fig. 2. The table-level knowledge graph comprises 1 node and 1 edge, wherein the round node represents a table, and each node stores information representing the table, including basic meta information and related characteristic information of the table, such as English name, field number, table annotation (Chinese name), table row number and the like of the table. In each item of information, other information than the table english name may be used as the preferred additional information, and the node may or may not store the information. The table-level knowledge graph only contains one relationship of foreign keys, and is represented in fig. 2 as an edge which is connected with two nodes and is represented by an arrow, wherein FK marked on the edge represents the relationship of the foreign keys, each edge is a directed edge, a starting node is a table to which a main key belongs, and a node pointed by the arrow is a table to which the foreign key belongs. The information of the foreign key relationship, such as the English name of the primary key field, the English name of the foreign key field, the coincidence rate of the primary and foreign keys, etc., is also stored on each edge.
Preferably, since the foreign key may be a combined foreign key, the storage of the fields on the edge of the primary key and the foreign key is stored by using a list, and the fields with the same subscript have an association, the field mapping relationship of the combined foreign key is completely stored. If the joint main key consists of two fields C1 and C2, the peripheral field thereof is C3 and C4, and the peripheral relation thereof consists of mapping relation C1 → C3 and mapping relation C2 → C4, in order to ensure the correct corresponding relation, list storage is adopted, and because the subscript exists in the list, the fields with the same subscript indicate that the mapping relation exists.
The embodiment may store the obtained foreign key relationship in a database in the storage device, for example, a neo4j graph database, and then display the foreign key relationship in the form of graph data. It should be understood that neo4j is only one type of graph database, and commonly used graph databases also include ArangoDB, orientDB, janussgraph, and other graph databases, which may exhibit foreign key relationships in the form of graph data.
And S5, calculating the dimension value of the foreign key field, and storing the dimension value in the attribute of the edge of the table-level knowledge graph corresponding to the foreign key relation as the weight of the edge.
The dimension value refers to the number of rows of the foreign key field after deduplication.
In this embodiment, the dimension value of the foreign key field is calculated and stored in the attribute of the edge corresponding to the foreign key relationship in the neo4j graph database as the weight of the edge.
Preferably, this step can be omitted and the weight defaults to 1.
Preferably, the weight value may be the reciprocal of the dimension value of the foreign key field multiplied by 1000000 to be suitable for the original minimum spanning tree algorithm, so as to simplify the calculation. At this time, the smaller the weight is, the larger the weight is, and the coefficient of 1000000 solves the problem that the decimal place is too long due to when the data is excessive in the table.
S6, the user selects the appointed service main table and other tables expected to be associated with the main table on the display device.
The business master table and the expectation-association table have been exposed as nodes in the table-level knowledge graph.
For example, if it is desired to create a warehouse model that takes account of transfer information flow as a main part and supplements relevant personal customer information, the transfer information flow table ods.ods _ s57_ tb _ fss _ transport may be selected as the main table, and the personal customer information main table ods.ods _ s58_ m _ ci _ person may be selected as the desired association table.
And S7, based on the foreign key relation shown in the form of graph data, processing the business main table and the expected associated table selected by the user by using a minimum spanning tree algorithm to obtain each path from the business main table to each expected associated table. And then selecting a path with the shortest path length as an optimal path, wherein the path with the shortest path length refers to a path with the minimum number of edges between tables.
And selecting one path with the minimum sum of the path weights as an optimal path when the shortest path has a plurality of paths in the process of running the minimum spanning tree algorithm. I.e., the native minimum spanning tree algorithm, the path of the minimum weight is calculated.
And if the step S5 is not carried out, namely if the edges in the path have no weight value or the weight values are all defaulted to 1, selecting the shortest path with the path length.
In this embodiment, the path is composed of a plurality of edges (foreign key relationships), each edge (foreign key relationship) is a field reference relationship, and is represented as a tree-like relationship.
Fig. 3 shows the relationship of transferring the information flow meter, i.e. the foreign key relationship. Two paths (regardless of direction) in fig. 3 show partial foreign key relationships of the main table transfer information pipeline table ods.ods _ s57_ tb _ fss _ transfook, where,
the first one is: ods _ s58_ m _ ci _ person — > ods _ s57_ tb _ fss _ transfook;
a second bar:
ods.ods_s57_tb_fss_transbook<——ods.ods_s58_m_ci_customer——>ods.ods_s58_ m_ci_person,
the first path length is 1, the second path length is 2, and the first path is selected for association because the length of the first path is shortest.
And S8, acquiring the association mode (namely, foreign key relationship, a path in the figure) and the association field relationship between the service main table and the table expected to be associated according to the optimal path, reading the optional field information (field information in the path) of all the tables expected to be associated from the table 2, and displaying the optional field information on a display device.
Preferably, the relevance of the return can be demonstrated using a directed graph. Fig. 4 shows an example of path selection.
Finally, a transfer information flow meter ods _ 57_ tb _ fss _ transbook is selected as a main table through a minimum spanning tree algorithm, the personal customer information main table ods _ 58_ m _ ci _ person is used as a shortest path of the expected association table, and the path length is 1. The optional fields are all fields in TABLE 2 field LIST TABLE column _ LIST, TABLE _ CODE is ods.ods _ s57_ tb _ fss _ shift, ods.ods _ s58_ m _ ci _ person (only a part of TABLE 2).
And S9, selecting fields needing to be reserved and corresponding field processing rules from the selectable fields by the user, and removing the fields not needing to be reserved.
In general, in the data warehouse model construction, not all fields are required, so fields that need to be reserved can be selected from optional fields, and fields that do not need to be reserved can be removed.
The processing rule refers to an ETL processing mode, for example, the first 6 bits of the identity card number field idno are intercepted, and the processing rule is substring (idno, 1, 6).
Preferably, because a plurality of foreign key relationships may be included between the two tables, the two tables may be arbitrarily selected from the service master table and all tables desired to be associated, the shortest path algorithm is used to calculate all selectable shortest paths between the two tables, and the selectable shortest path is used to replace a foreign key associated path. I.e. the association means and association field relationship between the add or replace service master table and the table desired to be associated. For example, two tables of a and B are selected to find the shortest path, and when three associated paths of a → C → B, a → D → B, a → C → F → B actually exist, a → C → B, a → D → B are the shortest, any one of the two paths can be selected.
Preferably, the user can modify and adjust the association condition by himself. Such as: wherein, the transaction status field transaction status in the transfer information flow meter ods, ods _ S57_ tb _ fss _ transaction has two statuses of success (S) and failure (F), when the transaction needs to ignore the failed transaction, the association condition may be modified from automatically generated t1. Cumtno = t2. Cumt _ no to t1. Cumtno = t2. Cumt _ no and t2. Franstatus = 'S'.
Preferably, the user can add the data filtering condition by himself.
The proportion of the above foreign key relation is 99.895%, it is easy to understand that 99.895% of the client numbers in the transfer information flow meter ods _ s57_ tb _ fss _ transfer book are the client numbers in the personal client information main table ods _ s58_ m _ ci _ person, that is, a small number of other clients such as public clients can be added with the data filtering condition t2. Cut _ non node null because a model of supplementing the relevant personal client information mainly by the transfer information flow is expected to be established;
preferably, if the model needs to be previewed, starting a previewing function, generating an SQL statement by using an association mode, an association field relation, a field and a processing rule, and browsing to generate a data warehouse previewing model, and when the result does not accord with the expectation, repeating the steps 6, 7, 8 and 9 until the model accords with the expectation.
And S10, after the steps are finished and the user confirms that the steps are finished, the generated processing rule is recorded in a table relation information table shown in a table 5 and a field mapping information table shown in a table 6 which are preset in the storage device.
TABLE 5 TABLE relationship information TABLE MAP _ TABLE _ RELATED (partial example)
TABLE 5 TABLE relationship information TABLE MAP _ TABLE _ RELATED stores path association information, i.e. the association manner and association field relationship between the traffic master TABLE and the TABLE desired to be associated in step S8.
TABLE 5 TABLE in the relational information TABLE MAP TABLE RELATED,
the JOB _ CODE is a JOB number, and rows having the same JOB number represent an association between an original table set and an original table included in one model table.
JOB _ NAME is the NAME of the JOB,
DATABASE _ ENV is the target DATABASE Schema,
JOIN _ TABLE is a TABLE name,
TABLE _ ALIAS is a TABLE name,
the JOIN _ TYPE is in an association mode, wherein the JOIN _ TYPE has 4 association modes in total, 1 represents an inner connection, 2 represents a left outer connection, 3 represents a right outer connection, and 4 represents a full outer connection;
JOIN _ CONDITION is an association rule.
Table 6 field MAPPING information table MAP _ MAPPING _ COL _ INFO
Table 6 field MAPPING information table MAP _ MAPPING _ COL _ INFO, stores the fields to be reserved and the processing rules in step 9.
Table 6 in the field mapping information table,
JOB _ CODE is a JOB number,
JOB _ NAME is the NAME of the JOB,
DATABASE _ ENV is the target DATABASE Schema,
TAB _ NAME is the NAME of the target table,
COL _ NO is the target table field number,
COL _ NAME is the target field NAME,
COL _ COMMENT is annotated for the target field,
TYPE _ NAME is the target field TYPE,
COL _ LENGTH is the LENGTH of the target field,
COL _ SCALE is the target field precision,
IS _ PK IS whether the primary key IS or not,
source TAB NAME is the source table NAME,
source COL VALUE is the source table field name,
SOURCE ALIAS is the SOURCE table field ALIAS,
source _ TYPE is a source table field TYPE,
source LENGTH is the LENGTH of the source table field,
source _ SCALE is the source table field precision,
COL _ MAPPING is a field MAPPING relationship.
And finally, automatically reading the table 5 and the table 6 by the system, generating operation configuration, and establishing a corresponding data warehouse model table in the data warehouse.
The structure of each table in the above embodiments is merely an example, and in actual operation, the number of column data items is not necessarily only the items shown in each table in the above embodiments, and other item data may be provided.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto,
any changes or substitutions that may be easily made by those skilled in the art within the technical scope of the present disclosure are also intended to be covered by the scope of the present invention.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes described in the context of a single embodiment, or with reference to a single figure, for the purpose of streamlining the disclosure and aiding in the understanding of various aspects of the invention by those skilled in the art. However, the present invention should not be construed to include features of the exemplary embodiments all of which are essential features of the patent claims.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware, and the program may be stored in a computer readable storage medium. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
It should be understood that the devices, modules, units, components, etc. included in the system of one embodiment of the present invention may be adaptively changed to be provided in an apparatus or system different from that of the embodiment. The different devices, modules, units or components comprised by the system of an embodiment may be combined into one device, module, unit or component or may be divided into a plurality of sub-devices, sub-modules, sub-units or sub-components.
The means, modules, units or components in the embodiments of the present invention may be implemented in hardware, or may be implemented in software running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that embodiments in accordance with the present invention may be practiced using a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as a computer program product or computer-readable medium for performing a portion or all of the methods described herein.
Claims (10)
1. A data warehouse modeling method is characterized by comprising the dependence relationship of functions in tables, the main keys, the foreign keys and the foreign key relationship of the tables;
displaying the tables and the foreign key relation among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm based on a table-level knowledge map to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected association table;
acquiring an association mode and an association field relationship between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relationship; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by data warehouse modeling from the optional fields, reserving the needed fields, and determining processing rules of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
2. The data warehouse modeling method of claim 1, wherein after the table-level knowledge graph is established, the dimension values of the foreign key fields are calculated and stored in the attributes of the edges of the table-level knowledge graph corresponding to the foreign key relationships as the weight of the edges; the dimension value refers to the number of rows of the external key field after the duplication is removed;
and when a plurality of paths with the shortest path length are found by using the minimum spanning tree algorithm, selecting one path with the largest sum of the path weights as the optimal path.
3. The data warehouse modeling method of claim 1 or 2, wherein the process of obtaining the functional dependencies within the table is:
acquiring the table name of each table in the database and the field name in each table;
analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating to obtain a function dependency relationship among fields in the table as a function dependency relationship in the table according to the table name, the field name and the field value;
the characteristics of the field comprise qualitative characteristics and quantitative characteristics; the qualitative characteristics include a data type of the field and the quantitative characteristics include a length of the field.
4. The data warehouse modeling method of claim 3, wherein the process of obtaining foreign key relationships is:
and aiming at each table, identifying the primary key of each table according to the function dependency relationship in the table, searching and determining the corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key.
5. The data warehouse modeling method of claim 4,
when searching and determining the foreign key, using a field matched with the data type of the primary key and the field length in other tables as the foreign key, wherein the field matched with the data type of the primary key and the field length means that the data type of the field is the same as the data type of the primary key, the minimum length of the field is greater than or equal to the minimum length of the primary key, and the maximum length of the field is less than or equal to the maximum length of the primary key.
6. The data warehouse modeling method of claim 1, wherein two tables are arbitrarily selected from the business master table and all tables desired to be associated, and using a shortest path algorithm, all optional shortest paths between the two tables are calculated, and the optional shortest paths are used to replace foreign key associated paths.
7. A data warehouse modeling system, comprising:
a processor; a database; and a memory in which a program is stored, a database storing tables,
wherein when the processor executes the program, the following operations are performed:
showing the tables and the foreign key relations among the tables in a visual graph structure form to be used as a table-level knowledge graph; the table-level knowledge graph comprises nodes and edges, each node represents a table, and each edge represents a foreign key relation;
determining a main table and an expected association table expected to be associated with the main table, running a minimum spanning tree algorithm on the basis of a table-level knowledge graph to obtain each path from the main table to each expected association table, and selecting the path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges from the main table to the expected association table;
acquiring an association mode and an association field relationship between a main table and an expected association table according to the optimal path, wherein the association mode is a foreign key relationship; simultaneously acquiring selectable fields in all the expected association tables;
removing fields which are not needed by data warehouse modeling from the optional fields, reserving the needed fields, and determining processing rules of the reserved fields;
and establishing a corresponding table serving as a model in the data warehouse according to the association mode, the association field relation, the field and the processing rule.
8. The data warehouse modeling system of claim 7, wherein after the table-level knowledge graph is established, the dimension values of the foreign key fields are calculated and stored in the attributes of the edges of the table-level knowledge graph corresponding to the foreign key relationships as the weight of the edges; the dimension value refers to the number of rows of the external key field after the duplication is removed;
and when a plurality of paths with the shortest path length are found by using the minimum spanning tree algorithm, selecting one path with the largest sum of the path weights as the optimal path.
9. The data warehouse modeling system of claim 7 or 8,
the process of obtaining the functional dependency relationship in the table is as follows:
acquiring the table name of each table in a database and the field name in each table;
analyzing the characteristics of each field according to the value of the field in the table aiming at each table; calculating to obtain a function dependency relationship among fields in the table as a function dependency relationship in the table according to the table name, the field name and the field value;
the characteristics of the fields include qualitative characteristics and quantitative characteristics; the qualitative characteristics include a data type of the field, and the quantitative characteristics include a length of the field;
the process of acquiring the foreign key relation comprises the following steps:
aiming at each table, identifying the primary key of each table according to the function dependency relationship in the table, searching and determining the corresponding foreign key in other tables according to the characteristics of the primary key, and forming a foreign key relationship between the primary key and the foreign key; when searching and determining the foreign key, using a field matched with the data type of the primary key and the field length in other tables as the foreign key, wherein the field matched with the data type of the primary key and the field length means that the data type of the field is the same as the data type of the primary key, the minimum length of the field is greater than or equal to the minimum length of the primary key, and the maximum length of the field is less than or equal to the maximum length of the primary key.
10. The data warehouse modeling system of claim 7, wherein two tables are arbitrarily selected from the business master table and all tables desired to be associated, a shortest path algorithm is used to calculate all optional shortest paths between the two tables, and the optional shortest paths are used to replace foreign key associated paths.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210749147.5A CN115328883B (en) | 2022-06-29 | 2022-06-29 | Data warehouse modeling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210749147.5A CN115328883B (en) | 2022-06-29 | 2022-06-29 | Data warehouse modeling method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115328883A true CN115328883A (en) | 2022-11-11 |
CN115328883B CN115328883B (en) | 2024-06-18 |
Family
ID=83917927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210749147.5A Active CN115328883B (en) | 2022-06-29 | 2022-06-29 | Data warehouse modeling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115328883B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115712691A (en) * | 2022-11-17 | 2023-02-24 | 创新奇智(重庆)科技有限公司 | Data relation processing method and system |
CN117539869A (en) * | 2024-01-08 | 2024-02-09 | 北京睿企信息科技有限公司 | Data processing system for acquiring data table |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286151A (en) * | 2007-04-13 | 2008-10-15 | 国际商业机器公司 | Method for establishing multidimensional model and data store mode mappings and relevant system |
CN111460047A (en) * | 2020-03-09 | 2020-07-28 | 平安科技(深圳)有限公司 | Method, device and equipment for constructing characteristics based on entity relationship and storage medium |
CN111581393A (en) * | 2020-04-28 | 2020-08-25 | 国家电网有限公司客户服务中心 | Construction method of knowledge graph based on customer service data in power industry |
WO2022037624A1 (en) * | 2020-08-19 | 2022-02-24 | 第四范式(北京)技术有限公司 | Method and apparatus for determining association relationship between data tables, and device |
CN114385764A (en) * | 2021-08-25 | 2022-04-22 | 广东粤财金融云科技股份有限公司 | Reverse analysis method and device for relational database table structure |
CN114595294A (en) * | 2022-03-11 | 2022-06-07 | 北京梦诚科技有限公司 | Data warehouse modeling and extracting method and system |
CN116226082A (en) * | 2022-12-29 | 2023-06-06 | 金篆信科有限责任公司 | Database model generation method and device, storage medium and electronic equipment |
-
2022
- 2022-06-29 CN CN202210749147.5A patent/CN115328883B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286151A (en) * | 2007-04-13 | 2008-10-15 | 国际商业机器公司 | Method for establishing multidimensional model and data store mode mappings and relevant system |
CN111460047A (en) * | 2020-03-09 | 2020-07-28 | 平安科技(深圳)有限公司 | Method, device and equipment for constructing characteristics based on entity relationship and storage medium |
CN111581393A (en) * | 2020-04-28 | 2020-08-25 | 国家电网有限公司客户服务中心 | Construction method of knowledge graph based on customer service data in power industry |
WO2022037624A1 (en) * | 2020-08-19 | 2022-02-24 | 第四范式(北京)技术有限公司 | Method and apparatus for determining association relationship between data tables, and device |
CN114385764A (en) * | 2021-08-25 | 2022-04-22 | 广东粤财金融云科技股份有限公司 | Reverse analysis method and device for relational database table structure |
CN114595294A (en) * | 2022-03-11 | 2022-06-07 | 北京梦诚科技有限公司 | Data warehouse modeling and extracting method and system |
CN116226082A (en) * | 2022-12-29 | 2023-06-06 | 金篆信科有限责任公司 | Database model generation method and device, storage medium and electronic equipment |
Non-Patent Citations (2)
Title |
---|
张颖琛: "空间路径聚类算法的建模与研究", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 6, 15 June 2012 (2012-06-15) * |
机器学习简明教程: "知识图谱构建之二:从结构化数据到知识图谱", Retrieved from the Internet <URL:www.bing.com> * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115712691A (en) * | 2022-11-17 | 2023-02-24 | 创新奇智(重庆)科技有限公司 | Data relation processing method and system |
CN117539869A (en) * | 2024-01-08 | 2024-02-09 | 北京睿企信息科技有限公司 | Data processing system for acquiring data table |
CN117539869B (en) * | 2024-01-08 | 2024-03-15 | 北京睿企信息科技有限公司 | Data processing system for acquiring data table |
Also Published As
Publication number | Publication date |
---|---|
CN115328883B (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210334248A1 (en) | Generating and reusing transformations for evolving schema mapping | |
US11651036B2 (en) | Integrated index blocks and searching in blockchain systems | |
US9916350B2 (en) | Automated creation of join graphs for unrelated data sets among relational databases | |
US8332366B2 (en) | System and method for automatic weight generation for probabilistic matching | |
US7606817B2 (en) | Primenet data management system | |
US9507875B2 (en) | Symbolic hyper-graph database | |
JP4397978B2 (en) | Binding ordering method using concentration | |
US7797307B2 (en) | Query generation method for queries for inline views for aggregation referring to identified subgraphs | |
US20030225761A1 (en) | System for managing and searching links | |
US9934289B2 (en) | Fuzzy full text search | |
WO2016029230A1 (en) | Automated creation of join graphs for unrelated data sets among relational databases | |
CN115328883A (en) | Data warehouse modeling method and system | |
CN115292508A (en) | Knowledge graph construction method and system based on table data | |
US11853400B2 (en) | Distributed machine learning engine | |
CN116028678A (en) | Method and system for searching full-quantity path in knowledge graph | |
CN111428095A (en) | Graph data quality verification method and graph data quality verification device | |
US11550792B2 (en) | Systems and methods for joining datasets | |
US8423523B2 (en) | Apparatus and method for utilizing context to resolve ambiguous queries | |
WO2020008180A1 (en) | Method and system for integrating data sets | |
CN113553477A (en) | Graph splitting method and device | |
CN115292297B (en) | Method and system for constructing data quality monitoring rule of data warehouse | |
WO2019030405A1 (en) | Systems and methods for compiling a database | |
Xia et al. | A novel data schema integration framework for the human-centric services in smart city | |
US20220342879A1 (en) | Data searching system, device, method and program | |
CN117009588A (en) | Three-dimensional data association retrieval method based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |