CN114091426A - Method and device for processing field data in data warehouse - Google Patents
Method and device for processing field data in data warehouse Download PDFInfo
- Publication number
- CN114091426A CN114091426A CN202011118899.9A CN202011118899A CN114091426A CN 114091426 A CN114091426 A CN 114091426A CN 202011118899 A CN202011118899 A CN 202011118899A CN 114091426 A CN114091426 A CN 114091426A
- Authority
- CN
- China
- Prior art keywords
- field
- processed
- data
- similarity
- fields
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012545 processing Methods 0.000 title claims abstract description 38
- 239000013598 vector Substances 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 10
- 238000012512 characterization method Methods 0.000 claims description 2
- 238000012423 maintenance Methods 0.000 abstract description 7
- 238000011161 development Methods 0.000 abstract description 6
- 238000013507 mapping Methods 0.000 abstract description 5
- 238000010606 normalization Methods 0.000 abstract description 5
- 230000009286 beneficial effect Effects 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 21
- 238000011144 upstream manufacturing Methods 0.000 description 18
- 238000013515 script Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 13
- 238000012360 testing method Methods 0.000 description 9
- 230000001419 dependent effect Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012384 transportation and delivery Methods 0.000 description 5
- 238000013499 data model Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000012937 correction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000011425 standardization method Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for processing field data in a data warehouse, and relates to the technical field of computers. One embodiment of the method comprises: acquiring characteristic data of a field to be processed in a data warehouse; acquiring one or more standard fields, wherein the standard fields have characteristic data; calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields; and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field. The method can dig out the mapping relation between each field in the data warehouse and the standard field, reduces the maintenance cost, improves the development efficiency and is beneficial to code normalization.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for processing field data in a data warehouse.
Background
A Data Warehouse (Data Warehouse), referred to as DW or DWH for short, is a Data storage set, which contains Data storage structures different from various types of Data. In the fields of e-commerce, banking, finance and the like, as data contained in a data warehouse is more and more, fields are not named uniformly, for example, more and more fields with the same meaning and multiple fields exist in the data warehouse, which causes the problems of reducing development efficiency and increasing maintenance cost.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for processing field data in a data warehouse, which can dig out a mapping relationship between each field in the data warehouse and a standard field, reduce maintenance cost, improve development efficiency, and facilitate code normalization.
To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method of processing field data in a data warehouse.
The method for processing the field data in the data warehouse comprises the following steps: acquiring characteristic data of a field to be processed in a data warehouse; acquiring one or more standard fields, wherein the standard fields have characteristic data; according to the feature data of the field to be processed and the feature data of the standard field, calculating the similarity between the field to be processed and the one or more standard fields respectively; and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field.
Optionally, the step of obtaining feature data of the field to be processed in the data warehouse includes: acquiring an execution statement of the data warehouse; and analyzing the execution statement, and determining the field to be processed and the characteristic data of the field to be processed.
Optionally, the step of analyzing the execution statement and determining the field to be processed and the feature data of the field to be processed includes: analyzing the execution statement and determining the associated task corresponding to the field to be processed; and determining the associated field of the field to be processed according to the associated task, and taking the associated field as the field to be processed.
Optionally, the step of analyzing the execution statement and determining the associated task corresponding to the field to be processed includes: analyzing the execution statement by using an abstract syntax tree, and determining an associated task corresponding to the field to be processed; and the associated task comprises a parent task corresponding to the field to be processed.
Optionally, the step of obtaining one or more standard fields includes: acquiring the category information of the fields to be processed in the data warehouse; and acquiring one or more standard fields corresponding to the fields to be processed according to the category information.
Optionally, the feature data includes a field name, a field type, and a field description.
Optionally, the step of calculating the similarity between the field to be processed and the one or more standard fields respectively includes: calculating first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively by adopting a keyword query model; calculating second similarity between the field description of the field to be processed and the field description of the one or more standard fields by adopting a semantic analysis model; and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
Optionally, the step of calculating the similarity between the field to be processed and the one or more standard fields respectively includes: vector representation is carried out on the field name and the field type by adopting a tfidf model, so that first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively is calculated; vector representation is carried out on the field description by adopting a bert model so as to calculate second similarity of the field description of the field to be processed and the field description of the one or more standard fields respectively; and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
Optionally, the step of associating the to-be-processed field with the target field includes: replacing the field to be processed with the target field; and, in the event a new data table is created in the data repository, representing fields in the created new data table using the one or more standard fields.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided an apparatus for processing field data in a data warehouse.
The device for processing field data in the data warehouse of the embodiment of the invention comprises: the field to be processed acquisition module is used for acquiring the characteristic data of the field to be processed in the data warehouse; the standard field acquisition module is used for acquiring one or more standard fields, and the standard fields have characteristic data; the similarity determining module is used for calculating the similarity between the field to be processed and the one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields; and the target field acquisition module is used for determining a target field corresponding to the field to be processed from the one or more standard fields according to the calculated similarity so as to associate the field to be processed by using the target field.
Optionally, the to-be-processed field obtaining module is further configured to: acquiring an execution statement of the data warehouse; and analyzing the execution statement, and determining the field to be processed and the characteristic data of the field to be processed.
Optionally, the to-be-processed field obtaining module is further configured to: analyzing the execution statement and determining the associated task corresponding to the field to be processed; and determining the associated field of the field to be processed according to the associated task, and taking the associated field as the field to be processed.
Optionally, the to-be-processed field obtaining module is further configured to: analyzing the execution statement by using an abstract syntax tree, and determining an associated task corresponding to the field to be processed; and the associated task comprises a parent task corresponding to the field to be processed.
Optionally, the standard field obtaining module is further configured to: acquiring the category information of the fields to be processed in the data warehouse; and acquiring one or more standard fields corresponding to the fields to be processed according to the category information.
Optionally, the feature data includes a field name, a field type, and a field description.
Optionally, the similarity determination module is further configured to: calculating first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively by adopting a keyword query model; calculating second similarity between the field description of the field to be processed and the field description of the one or more standard fields by adopting a semantic analysis model; and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
Optionally, the similarity determination module is further configured to: vector representation is carried out on the field name and the field type by adopting a tfidf model, so that first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively is calculated; vector representation is carried out on the field description by adopting a bert model so as to calculate second similarity of the field description of the field to be processed and the field description of the one or more standard fields respectively; and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
Optionally, the target field obtaining module is further configured to: replacing the field to be processed with the target field; and, in the event a new data table is created in the data repository, representing fields in the created new data table using the one or more standard fields.
To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.
An electronic device of an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by one or more processors, the one or more processors implement the method for processing field data in the data warehouse of the embodiment of the invention.
To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided a computer-readable medium.
A computer-readable medium of an embodiment of the present invention has a computer program stored thereon, and when executed by a processor, the computer program implements a method of processing field data in a data warehouse of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: by calculating the similarity between the field to be processed and one or more standard fields corresponding to the field to be processed and determining the target field corresponding to the field to be processed from the one or more standard fields, the mapping relation between each field in the data warehouse and the standard fields can be mined, the maintenance cost is reduced, the development efficiency is improved, and the code normalization is facilitated. In addition, the category information of the field to be processed can be obtained, the similarity between the field to be processed and one or more standard fields corresponding to the category information is calculated, then the target field corresponding to the field to be processed is determined, the target field can be determined by calculating the field similarity under each category, and for application scenes with large data magnitude in a data warehouse, an effective method for standardizing the field is provided, the calculated amount is greatly reduced, and the efficiency of standardizing the field is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of processing field data in a data warehouse, according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main flow of a method of processing field data in a data warehouse, according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of obtaining characterization data for fields to be processed in a data warehouse, according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a method of processing field data in a data warehouse, according to an embodiment of the invention;
FIG. 5 is a diagram of the syntax rules of SelectStatement in Hive SQL;
FIG. 6 is a schematic diagram of determining similarity between fields according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of the major modules of an apparatus for processing field data in a data warehouse, according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a method for processing field data in a data warehouse according to an embodiment of the present invention, and as shown in fig. 1, the method for processing field data in a data warehouse according to an embodiment of the present invention mainly includes:
step S101: acquiring characteristic data of a field to be processed in a data warehouse;
step S102: acquiring one or more standard fields, wherein the standard fields have characteristic data;
step S103: calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields;
step S104: and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field.
The feature data of the field to be processed refers to respective feature data of the field, such as a field name (i.e., a representation of the field), a field description, a field type, and the like. In the embodiment of the invention, the category information of the field to be processed in the data warehouse can be obtained firstly, and then one or more standard fields corresponding to the field to be processed are obtained according to the obtained category information. The category information refers to the category to which the field belongs, and may include multiple categories of the field, and the range between the multiple categories may sequentially be from large to small, for example, the field afs _ ord _ id has a first category of after-sale subject, a second category of delivery service, and a third category of beijing area. For each category, there is feature data for the category corresponding to the one or more criteria fields and the one or more criteria fields. For the case of a multi-level category including fields in the category information, standard fields may be defined as needed. For example, in the above example, one or more standard fields are defined for the after-sales topic, and then, for the to-be-processed fields of which all the primary categories are the after-sales topics, the similarity between the field and the one or more standard fields corresponding to the after-sales topic is calculated. Or, defining a standard field set for the after-sale subject/delivery service, and calculating the similarity between the field and one or more standard fields corresponding to the after-sale subject/delivery service for the to-be-processed fields of which the primary class is the after-sale subject and the secondary class is the delivery service. After calculating the similarity between the field to be processed and one or more standard fields respectively, determining the standard field with the maximum similarity value as the target field of the field to be processed.
Further, the category information may further include: a method name for calling an execution statement, a class name related to a calling method, a calling interface name, and the like. For example, in the embodiment of the present invention, one or more standard fields corresponding to a method name for calling an execution statement may be further configured, so that a field to be processed calls a certain method, and a target field may be selected from the one or more standard fields corresponding to the configured method name.
Fig. 2 is a schematic diagram of a main flow of a method for processing field data in a data warehouse according to another embodiment of the present invention, and as shown in fig. 2, the method for processing field data in a data warehouse according to the embodiment of the present invention mainly includes:
step S201: acquiring characteristic data and category information of fields to be processed in a data warehouse;
step S202: acquiring one or more standard fields and the characteristic data of the standard fields according to the category information;
step S203: calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the one or more standard fields;
step S204: and determining a target field corresponding to the field to be processed from the one or more standard fields according to the calculated similarity.
In the method shown in steps S201 to S204, after the category information corresponding to the field to be processed is obtained, the similarity between the field to be processed and one or more standard fields corresponding to the category information may be calculated according to the category information of the field to be processed in the data warehouse. And according to the similarity, determining a target field corresponding to the field to be processed from the one or more standard fields, and determining the target field only by calculating the field similarity under each service type. Aiming at the application scene with large data magnitude in the data warehouse, an effective field standardization method is provided, the calculated amount is greatly reduced, and the field standardization efficiency is improved.
Fig. 3 is a schematic diagram of acquiring feature data of a field to be processed in a data warehouse according to an embodiment of the present invention, and as shown in fig. 3, acquiring feature data of a field to be processed in a data warehouse according to an embodiment of the present invention includes:
step S301: an execution statement of the data warehouse is obtained. The implementation of step S301 mainly includes two ways: acquiring an execution statement (sql statement) in a script interface mode; and obtaining the execution log of the historical task.
Step S302: and analyzing the execution statement and determining the characteristic data of the field to be processed. The implementation of step S302 mainly includes:
step S3021: and analyzing the execution statement and determining the associated task corresponding to the field to be processed. In the process of the step, the abstract syntax tree is used for analyzing the lexical method and the syntax of the execution statement and determining the associated task corresponding to the field to be processed. And the associated task comprises a parent task corresponding to the field to be processed. It should be noted that the parent task in the embodiment of the present invention includes not only the dependent upstream task, but also the upstream task (i.e., the parent task corresponding to the parent task), the upstream task (i.e., the upstream task of the parent task corresponding to the parent task) and the like up to the ancestor task.
Step S3022: and determining the associated fields of the fields to be processed according to the associated tasks. The related field comprises a task name and a processing atom field of a parent task, and the related field can be used as a field to be processed. Preferably, in the process of analyzing the execution statement and determining the associated tasks corresponding to the fields to be processed, the lexical and grammars of the execution statement are analyzed by using the abstract syntax tree to determine the associated tasks corresponding to the fields to be processed; the associated tasks comprise parent tasks corresponding to the fields to be processed, and the associated fields comprise task names and processing atom fields of the parent tasks.
In the embodiment of the invention, the parent task is an upstream computing task which needs to be depended on when the field in the table is computed; the ancestor task is the most upstream computation task that the fields in the table need to be relied upon for computation. The atomic field name of the processing is the name corresponding to the field in the task at the most upstream. The field name may include a field alias, that is, in the sql statement, another name may be used for the field name, and the name may be referred to by the as keyword, where the two fields are the same in nature, and the represented business meanings are also the same. For example: the order table adm _ m04_ ord contains fields of ord _ id, gmv and the like, when gmv fields are calculated, the adm _ m04_ sku table is needed to be associated with the adm _ m04_ gene table, so that the adm _ m04_ sku table and the adm _ m04_ gene table are both parent tasks of adm _ m04_ ord, then the parent tasks corresponding to adm _ m04_ sku table and adm _ m04_ gene table are upstream tasks of adm _ m04_ ord, if the upstream tasks are the most upstream tasks, the upstream tasks are ancestor tasks, and the corresponding fields of the tables are atomic fields. In the embodiment of the invention, the abstract syntax tree is used for analyzing and extracting the basis for calculating the similarity between the enhanced fields of the blood border upstream features of the fields.
In another embodiment of the invention, the characteristic data and the category information of the field to be processed in the data warehouse are obtained; acquiring one or more standard fields and the characteristic data of the standard fields according to the category information; calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the one or more standard fields; and determining a target field corresponding to the field to be processed from the one or more standard fields according to the calculated similarity. According to the method, the similarity between the field to be processed and the standard field is calculated according to the category information, and an effective field standardization method is provided for an application scene with a large data magnitude in a data warehouse, so that the calculation amount is greatly reduced, and the field standardization efficiency is improved.
Before calculating the similarity between the field to be processed and one or more standard fields, index directory information of the field to be processed and the standard fields can be obtained, wherein the index directory information at least comprises category information. And establishing indexes of the fields to be processed and the standard fields according to the index directory information. And acquiring the characteristic data of the field to be processed and the characteristic data of one or more standard fields according to the established index. In a relational database, an index is a single, physical storage structure that orders one or more columns of values in a database table, which is a collection of one or more columns of values in a table and a corresponding list of logical pointers to data pages in the table that physically identify the values. The index is equivalent to the directory of the book, and the required content can be quickly found according to the page number in the directory. The index provides pointers to data values stored in a specified column of the table, and then sorts these pointers according to the sorting order that you specify. The database uses the index to find a particular value and then follows the pointer to find the row containing that value. This allows SQL statements corresponding to tables to be executed faster and to quickly access specific information in the database tables. In the embodiment of the invention, as the fields accumulated in the enterprise data warehouse are accumulated all year round, the fields need to be divided primarily through the subject and the business domain in the data warehouse before calculating the similarity, and the index of the fields is established, so that a large amount of unnecessary calculation can be reduced, because the similar fields are necessarily in the same subject and business domain, only the fields of the same subject business domain need to be calculated for each field during calculation, and the calculation matching can be accelerated.
In another embodiment of the present invention, in the process of calculating the similarity between each field to be processed and one or more standard fields, each feature in the feature data is classified first, and a corresponding calculation model is determined according to the classification result. Then, a vector representation of the field to be processed and the one or more standard fields is determined according to the determined calculation model. And calculating the similarity of the field to be processed and one or more standard fields respectively according to the vector representation of the field to be processed and the one or more standard fields.
Preferably, in the process of classifying each feature in the feature data and determining the corresponding calculation model according to the classification result, the field description in the feature data is determined as a sense feature, and other features except the field description in the feature data are determined as nonsense features. The characteristic data may include: the field name, the field type and the field description, so a keyword query model can be used to determine the vector representation of the field name and the field type, and a semantic analysis model can be used to determine the vector representation of the field description. The method is specifically realized in that a keyword query model can be adopted to calculate first similarity between the field name and the field type of a field to be processed and the field name and the field type of one or more standard fields respectively; calculating second similarity between the field description of the field to be processed and the field description of one or more standard fields by adopting a semantic analysis model; and calculating the similarity between the field to be processed and one or more standard fields respectively according to the first similarity and the second similarity.
The keyword query model may select a tfidf model, that is, vector representation is performed on the field name and the field type by using the tfidf model, so as to calculate first similarities between the field name and the field type of the field to be processed and the field name and the field type of one or more standard fields, respectively. the tfidf model (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. tf is Term Frequency (Term Frequency) and idf is Inverse text Frequency index (Inverse Document Frequency). The semantic analysis model may select a bert model, that is, vector-represent the field descriptions by using the bert model to calculate a second similarity between the field descriptions of the field to be processed and the field descriptions of the one or more standard fields, respectively. The bert model is a very powerful generic pre-training language representation model proposed by Google. The current methods for obtaining pre-training representation models are mainly Feature-based (Feature-based) methods or Fine-tuning (Fine-tuning) methods.
In the embodiment of the present invention, there are many ways of calculating similarity for different application scenarios, and features having no practical significance may include: field names, field types, model names, etc. in the data warehouse; features of practical significance may be: the field and the corresponding field description in the dependent parent task of the field are usually in the form of English characters underlined. Through sample tests, the embodiment of the invention takes tfidf as an English character without specific meaning; for meaningful field Chinese description, after removing noise characters, the best calculation effect is calculated by adopting a bert model, and finally, the calculation results of the two are averaged and fused to obtain the final result. In addition, the first similarity between the field to be processed and one or more standard fields can be calculated by adopting a TextRank model, a semantic-based statistical language model and other keyword query models, the second similarity between the field to be processed and one or more standard fields can be calculated by adopting Word2Vec and other semantic analysis models, and the similarity between the field to be processed and one or more standard fields can be calculated according to the first similarity and the second similarity.
In another embodiment of the invention, after the target field of the field to be processed is determined from the plurality of standard fields according to the calculated similarity, the field to be processed in the data warehouse is replaced by the target field. And in the case of creating a new data table in the data warehouse, one or more standard fields are used to represent fields in the created new data table, that is, if a new data table is created in the data warehouse, the new data table can be created directly using the standard fields, that is, the fields in the created new data table are guaranteed to be the standard fields. Therefore, in the embodiment of the invention, the target field corresponding to the field to be processed is determined, and the target field is associated with the field to be processed, so that the batch replacement of old scripts in a data warehouse is realized, and the effect of uniform specification and constraint in the process of creating a new data table can be achieved.
Fig. 4 is a schematic diagram of a method for processing field data in a data warehouse according to an embodiment of the present invention, fig. 5 is a schematic diagram of a syntax rule of SelectStatement in Hive SQL, and fig. 6 is a schematic diagram of determining similarity between fields according to an embodiment of the present invention.
For the existing data warehouse, different developers have different habits, so that fields with the same business meaning in the existing data warehouse have multiple writing methods, different fields represent the fields, the confusion is brought to other developers and downstream users such as data analysis, the confusion is brought to the data warehouse due to the ambiguity of the fields, the ambiguity even ambiguity causes the confusion of the data warehouse, meanwhile, a great deal of inconvenience is brought to data warehouse managers, data management and the like, and a technical mode for unified standardization of the fields is urgently needed.
At present, the industry basically only has no constraint and unified means for specification, and old fields and newly generated fields cannot be normalized, so that more and more ambiguous fields and synonymous fields in a data warehouse exist, namely the ambiguous fields indicate that the same field has multiple business meanings, for example, item _ id may indicate commodity code and main commodity code, and the synonymous fields indicate that the same business meaning has multiple writing methods of fields, for example: brand _ id and barnd _ code both indicate the meaning of brand code this service. The current industry specification specifically refers to naming specification of fields in a data warehouse, for example, 1, the field name in the data warehouse must be started by letters, and words or abbreviations with characteristic meanings cannot be included by double quotation marks. 2. Field primary key name: prefix PK _ primary key name shall be a field name consisting of prefix + table name +. If the composite primary key has more fields, only the first field is included. The table name may have the prefix removed. 3. Field foreign key name: the prefix is FK _. The external key name is a field name formed by a prefix, an external key table name, a main key table name and an external key table. The table name may have the prefix removed.
As shown in fig. 4, a method for processing field data in a data warehouse according to an embodiment of the present invention includes:
step S401: and acquiring the sql script of the data warehouse. The sql script in the data warehouse refers to an sql statement script which is deployed on a line and periodically executed, and is used for performing an sql execution statement on data in the data warehouse, and the sql script contains fields of a data model and can extract required fields from the sql script. The system on each big data line stores the sql task scripts uploaded by the developer on the platform because the sql scripts are executed periodically, and the scripts can be acquired through the interface of the big data system after the sql task scripts are executed periodically. The data model is a two-dimensional table model abstracted from the field data model of the data warehouse according to business requirements.
In the data acquisition process of the step, the acquisition of the effective historical stock sql script in the data warehouse has two modes: can be obtained by a script interface mode; and obtaining the execution log of the historical task.
Step S402: and analyzing the acquired sql script. Analyzing all historical sql scripts, and extracting data characteristics divided into two parts: name, type and description of each field; model names of parent tasks and ancestor tasks corresponding to each field, and corresponding processing atom field names, field types and field descriptions. In the embodiment of the present invention, the extraction of the features in the above-mentioned step (i) is obtained through metadata of the data warehouse model, and information such as names, types, and descriptions of all fields can be extracted. The second step is characterized in that the sql script is analyzed through the AST abstract syntax tree, the model names of the parent task and the ancestor task corresponding to each field, the corresponding processing atom fields and the alias of each field are obtained in a data relationship analysis mode, and the specific process is as follows:
a. using Antlr to realize lexical and syntactic analysis of SQL;
b. after lexical and syntactic parsing, the Abstract Syntax Tree Syntax Abstract Syntax Tree of Antlr is used to convert the input sentence into an Abstract Syntax Tree while parsing, as shown in fig. 4, the Abstract Syntax Tree is a Syntax rule of the selectstate in Hive SQL (only for example, not related to disclosure of confidential information), and as can be seen from this, the selectstate includes clauses such as select, from, where, groupby, haiving, orderby, and the like. In the embodiment of the present invention, a dependency data source (a source table indicating the sql statement query) may be obtained from a clause, a select may obtain an upstream dependency model field and a corresponding alias (indicating that an alias corresponding to a field and a field of a table of the sql statement query is obtained by a select keyword), and a small number of aliases may also be obtained from other clauses (for example, an alias of a field may also be obtained by a keyword such as case where a field name is also usually taken after the case where) may be added to the subsequent similarity calculation. The abstract syntax tree AST is an abstract representation of the syntax structure of the source code. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code. The syntax is said to be "abstract" in that the syntax does not represent every detail that appears in the true syntax. For example, nesting brackets are implicit in the structure of the tree and are not present in the form of nodes; whereas a conditional jump statement like the if-condition-then may be represented using a node with two branches.
Step S403: an index is established based on the business topics of the data repository. The step is realized in two ways: fields of the same theme and the same service domain are placed in the same file, and a super-large index file is created, so that the method is complicated and consumes memory; secondly, maintaining an index json file, and realizing the json file through a multi-level directory, wherein the first-level directory is a theme, the second-level directory is a service domain, the third-level directory can also be divided by finer granularity such as a service scene, and finally, the corresponding field is used as a json leaf node value. For example: { after-sale: { delivery: { field: afs _ ord _ id }, the first level directory is: after-sale themes, the secondary catalog is: the final leaf node is the field: and numbering the after-sale orders of the afs _ ord _ id, and using the json structure as an index to accelerate subsequent field searching, calculation and other operations.
Step S404: the similarity between the fields is determined. The step is realized by calculating in the same theme service domain according to the maintained index, otherwise, the calculation amount is greatly increased. The standard fields of all fields in the data warehouse need to be defined before the similarity calculation. Wherein, the standard field also needs to be indexed, which can accelerate the manual selection of the standard field, and the standardization is performed under the same index, and after the similarity is calculated, one standard is manually selected from several similar values as the standard of the field, for example: both brand _ id and barnd _ code represent brand codes, from which a field name identifying the field, such as brand _ id, is selected as a standard field. Since the standard may contain specific business meanings and writing specifications of the fields, etc., in the embodiment of the present invention, the standardized fields generally need to be maintained by special business personnel, and may also need to be continuously updated subsequently.
There are many ways of calculating similarity to different application scenarios, and there is no practical meaning: the field name, field type and alias, model name, etc. in the data warehouse have practical significance: the field and the corresponding field description in the dependent parent task of the field are usually in the form of English characters underlined, and are tested by a sample. In the embodiment of the invention, the tfidf model is used as an English character without specific meaning; for meaningful field Chinese description, after removing noise characters, the best calculation effect is calculated by adopting a bert model, and finally, the calculation results of the two are averaged and fused to obtain the final result. Specifically, as shown in fig. 6, the technical solution is further illustrated by a specific example:
step S4041: and extracting a unified standard field standard _ data. The brand _ code in the following table is the field name (i.e. field representation), and the characteristic data of the field is: field type, field description, dependent upstream model name (i.e., parent task name), dependent upstream model field name (i.e., associated field name), dependent upstream model field type (i.e., associated field type), dependent upstream model field description (i.e., associated field description), alias 1 (i.e., field alias), alias 2 (i.e., field alias), alias 3 (i.e., field alias).
Table 1 standard fields: standard _ data
Step S4042: the field test _ data to be standardized in the data warehouse is extracted.
Table 2 fields to be normalized: test _ data
Step S4043: and calculating the similarity of each field in the test _ data with each field in the standard _ data, and comparing the similarity with which field in the standard _ data is the highest, namely selecting the field as the field after standardization. Specifically, the field name, the field type, the alias, the model name and the like are calculated by tfidf, words such as the field name, the field type and the like are split into letters during calculation, and then the letters are spliced into vectors for calculation, so that a calculation result of n × m lines (n: standard _ data line number, m: test _ data line number) can be calculated; calculating the similarity between the field description short texts containing the meanings by using a bert algorithm, and finally obtaining a result (numerical values in the following list are random numbers):
test_data | standard_data | tfidf_similarity | bert_similarity | average |
brand_codel | band_code | 0.45 | 0.45 | 0.45 |
brand_codel | brand_name_en | 0.34 | 0.34 | 0.34 |
brand_codel | brand_name_full | 0.45 | 0.45 | 0.45 |
brand_codel | brand_name_local | 0.32 | 0.32 | 0.32 |
brand_codel | brand_name_cn | 0.45 | 0.45 | 0.45 |
brand_codel | brand_name | 0.87 | 0.87 | 0.87 |
brand_name_ens | brand_code | 0.32 | 0.32 | 0.32 |
brand_name_ens | brand_name_en | 0.44 | 0.44 | 0.44 |
brand_name_ens | brand_name_full | 0.66 | 0.66 | 0.66 |
brand_name_ens | brand_name_local | 0.23 | 0.23 | 0.23 |
brand_name_ens | brand_name_cn | 0.45 | 0.45 | 0.45 |
brand_name_ens | brand_name | 0.34 | 0.34 | 0.34 |
brand_name_fullx | brand_code | 0.45 | 0.45 | 0.45 |
brand_name_fullx | brand_name_en | 0.32 | 0.32 | 0.32 |
brand_name_fullx | brand_name_full | 0.45 | 0.45 | 0.45 |
brand_name_fullx | brand_name_local | 0.87 | 0.87 | 0.87 |
brand_name_fullx | brand_name_cn | 0.32 | 0.32 | 0.32 |
brand_name_fullx | brand_name | 0.44 | 0.44 | 0.44 |
brand_codec | brand_code | 0.66 | 0.66 | 0.66 |
brand_codec | brand_name_en | 0.23 | 0.23 | 0.23 |
brand_codec | brand_name_full | 0.45 | 0.45 | 0.45 |
brand_codec | brand_name_local | 0.34 | 0.34 | 0.34 |
brand_codec | brand_name_cn | 0.45 | 0.45 | 0.45 |
brand_codec | brand_name | 0.32 | 0.32 | 0.32 |
brand_name_env | brand_code | 0.45 | 0.45 | 0.45 |
brand_name_env | brand_name_en | 0.87 | 0.87 | 0.87 |
brand_name_env | brand_name_full | 0.32 | 0.32 | 0.32 |
brand_name_env | brand_name_local | 0.44 | 0.44 | 0.44 |
brand_name_env | brand_name_cn | 0.66 | 0.66 | 0.66 |
brand_name_env | brand_name | 0.23 | 0.23 | 0.23 |
for example, assume that the standard field (standard _ data) has the following characteristic data:
field(s) | Type of field | Field description |
brand_code | string | Brand code |
The characteristic data of the field to be standardized (test _ data) is as follows:
field(s) | Type of field | Field description |
2Brand_id | int | Brand id |
barnd_code | string | Brand id |
brand_cd | string | Brand cd |
The field names and field types of the standard fields are vector represented by the tfidf model, and the results are as follows:
field(s) | Type of field | Field description | tfidf |
brand_code | string | Brand code | [26475.0,56354,87464,554324,455432,54325…] |
Field(s) | Type of field | Field description | |
2Brand_id | int | Brand id | [14275.0,12354.2,7464,124324,255433,44325…] |
barnd_code | string | Brand id | [75375.0,31354,32464,2143224,155432,84325…] |
brand_cd | string | Brand cd | [97475.0,34354,12464,524324,415432,34325…] |
The field descriptions are vector-represented by the bert pre-training model, with the following results:
field(s) | Type of field | Field description | bert |
brand_code | string | Brand code | [4475.0,76354,17464,4324,5432,14325…] |
Field(s) | Type of field | Field description | |
2Brand_id | int | Brand id | [45698.0,32354,64464,354324,355432,43253…] |
barnd_code | string | Brand id | [98765.0,42354,34464,875432,155432,56782…] |
brand_cd | string | Brand cd | [56789.0,31354,42464,424324,355432,14325…] |
The similarity between the records in the test _ data and the records in the standard _ data can be calculated by utilizing the conventional cosine formula according to the vector expression, and the detailed description is not provided in the embodiment of the invention. The calculation results include tfidf similarity and bert similarity, and the results are shown in the following table:
test_data | standard_data | tfidf_similarity | bert_similarity |
2Brand_id | brand_code | 0.45 | 0.22 |
barnd_code | brand_code | 0.36 | 0.33 |
brand_cd | brand_code | 0.56 | 0.44 |
the tfidf similarity and bert similarity were averaged and the results are given in the following table:
test_data | standard_data | tfidf_similarity | bert_similarity | average |
2Brand_id | brand_code | 0.45 | 0.22 | 0.67 |
barnd_code | brand_code | 0.36 | 0.33 | 0.69 |
brand_cd | brand_code | 0.56 | 0.44 | 1 |
through the above process, if the standard fields are multiple, the standard _ data in the record with the largest average is selected as the standardized field of the test _ data. Through the above process, in the above example, the target field result in the standard _ data corresponding to the field in the test _ data is as follows:
test_data | standaard_data |
brand_codel | brand_code |
brand_name_ens | brand_name_en |
brand_name_fullx | brand_name_full |
brand_codec | brand_code |
brand_name_env | brand_name_en |
brand_name_local | brand_name_local |
brand_name_cn | brand_name_cn |
brand_codeg | brand_code |
brand_nameg | brand_name |
brand_codeb | brand_code |
brand_codeb | brand_code |
barndname_cnx | brand_name_cn |
brand_name_en | brand_name_en |
brand_name_local | brand_name_local |
brand_name_cn | brand_name_cn |
brand_id | brand_code |
in the embodiment of the present invention, because the comparison between different fields is calculated by using "name, type, and description", and the precision is kept above the decimal point of 10 bits, the case where the similarity between the field a and the standardized fields B and C is the same rarely occurs. If the similarity of the field A and the normalized fields B and C is the same, the following two ways can be adopted for correction:
1. pre-processing: the standard field is maintained before being processed by a developer, the standard library is maintained regularly, the standard library still needs to be established by people, the synchronous monitoring is carried out during the maintenance, the monitoring mode is not that all fields are monitored, the part with larger similarity difference (because the field with errors is often changed into the field with lower similarity with the standard field) is monitored and processed, and the error is basically eliminated.
2. Post-processing: the data used by the developer is collected for real-time updating, for example, when the user finds the error (because the user checks the number in the using process), the number is processed in a correct mode after checking the number, the data flow is monitored, once the user finds that the used standard field is not matched with the predicted standard field, the correction is carried out, the correction is permanent, and a white list is recorded to avoid the next judgment.
Step S405: and outputting a field standardization result. The standardized fields corresponding to each field to be standardized in the data warehouse are obtained through the steps S401-S404, and the standardized fields can be used for batch replacement of old scripts in the data warehouse and can also be used for unified specification constraint in a new table modeling process of developers.
According to the embodiment of the invention, the technical scheme of field standardization is realized by utilizing the existing data model of the data warehouse. And (3) calculating meaningless characters between fields in the data warehouse through tfidf and calculating a field similarity fusion calculation mode described by meaningful fields through a bert model. And the abstract syntax tree is used for analyzing and extracting the basis of the similarity calculation between the enhanced fields of the blood margin upstream features of the fields, so that the calculation accuracy is improved.
Fig. 7 is a schematic diagram of main blocks of an apparatus for processing field data in a data warehouse according to an embodiment of the present invention, and as shown in fig. 7, an apparatus 700 for processing field data in a data warehouse according to an embodiment of the present invention includes: a to-be-processed field obtaining module 701, a standard field obtaining module 702, a similarity determining module 703 and a target field obtaining module 704.
The to-be-processed field obtaining module 701 is configured to: acquiring characteristic data of a field to be processed in a data warehouse; the standard field acquisition module 702 is configured to: acquiring one or more standard fields, wherein the standard fields have characteristic data; the similarity determination module 703 is configured to: calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields; the target field obtaining module 704 is configured to: and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field.
In this embodiment of the present invention, the to-be-processed field obtaining module 701 is further configured to: acquiring an execution statement of a data warehouse; and analyzing the execution statement, and determining the field to be processed and the characteristic data of the field to be processed.
In this embodiment of the present invention, the to-be-processed field obtaining module 701 is further configured to: analyzing the execution statement and determining the associated task corresponding to the field to be processed; and determining the associated field of the field to be processed according to the associated task, and taking the associated field as the field to be processed.
In this embodiment of the present invention, the to-be-processed field obtaining module 701 is further configured to: analyzing the execution statement by using the abstract syntax tree, and determining an associated task corresponding to the field to be processed; and the associated task comprises a parent task corresponding to the field to be processed.
In this embodiment of the present invention, the standard field obtaining module 702 is further configured to: acquiring category information of fields to be processed in a data warehouse; and acquiring one or more standard fields corresponding to the fields to be processed according to the category information.
In an embodiment of the present invention, the feature data includes: field name, field type, and field description.
In this embodiment of the present invention, the similarity determining module 703 is further configured to: calculating first similarity between the field name and the field type of the field to be processed and the field name and the field type of one or more standard fields respectively by adopting a keyword query model; calculating second similarity between the field description of the field to be processed and the field description of one or more standard fields by adopting a semantic analysis model; and calculating the similarity between the field to be processed and one or more standard fields respectively according to the first similarity and the second similarity.
Preferably, in this embodiment of the present invention, the similarity determining module 703 is further configured to: vector representation is carried out on the field name and the field type by adopting a tfidf model so as to calculate first similarity between the field name and the field type of the field to be processed and the field name and the field type of one or more standard fields respectively; vector representation is carried out on the field description by adopting a bert model so as to calculate second similarity between the field description of the field to be processed and the field description of one or more standard fields respectively; and calculating the similarity between the field to be processed and one or more standard fields respectively according to the first similarity and the second similarity.
In this embodiment of the present invention, the target field obtaining module 704 is further configured to: replacing the field to be processed with a target field; and, in the event a new data table is created in the data repository, representing fields in the created new data table with one or more standard fields.
According to the device for processing the field data in the data warehouse, the similarity between the field to be processed and one or more standard fields corresponding to the field to be processed is calculated, the target field corresponding to the field to be processed is determined from the one or more standard fields, the mapping relation between each field and the standard field in the data warehouse can be mined, the maintenance cost is reduced, the development efficiency is improved, and code normalization is facilitated. In addition, the category information of the field to be processed can be obtained, the similarity between the field to be processed and one or more standard fields corresponding to the category information is calculated, then the target field corresponding to the field to be processed is determined, the target field can be determined by calculating the field similarity under each category, and for application scenes with large data magnitude in a data warehouse, an effective method for standardizing the field is provided, the calculated amount is greatly reduced, and the efficiency of standardizing the field is improved.
Fig. 8 illustrates an exemplary system architecture 800 of a method of processing field data in a data warehouse or an apparatus for processing field data in a data warehouse to which embodiments of the present invention may be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. For example, the standard field may be configured with the terminal device 801, 802, 803.
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server providing various services, for example, a background management server (for example only) providing support during configuration of data fields by the user using the terminal devices 801, 802, 803; for another example, the server 805 may perform the association between the pending field and the standard field according to the embodiment of the present invention, so as to replace the pending field with the standard field, or represent data with the standard field when creating a new data table.
It should be noted that the method for processing field data in the data warehouse provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the apparatus for processing field data in the data warehouse is generally disposed in the server 805.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 908 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a field to be processed acquisition module, a standard field acquisition module, a similarity determination module and a target field acquisition module. The names of the modules do not limit the module itself in some cases, for example, the to-be-processed field acquiring module may also be described as a module for acquiring the feature data of the to-be-processed field in the data warehouse.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring characteristic data of a field to be processed in a data warehouse; acquiring one or more standard fields, wherein the standard fields have characteristic data; calculating the similarity between the field to be processed and one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields; and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field.
In the embodiment of the invention, the similarity between the field to be processed and one or more standard fields corresponding to the field to be processed is calculated, and the target field corresponding to the field to be processed is determined from the one or more standard fields, so that the mapping relation between each field and the standard field in the data warehouse can be mined, the maintenance cost is reduced, the development efficiency is improved, and the code normalization is facilitated. In addition, the category information of the field to be processed can be obtained, the similarity between the field to be processed and one or more standard fields corresponding to the category information is calculated, then the target field corresponding to the field to be processed is determined, the target field can be determined by calculating the field similarity under each category, and for application scenes with large data magnitude in a data warehouse, an effective method for standardizing the field is provided, the calculated amount is greatly reduced, and the efficiency of standardizing the field is improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (12)
1. A method of processing field data in a data warehouse, comprising:
acquiring characteristic data of a field to be processed in a data warehouse;
acquiring one or more standard fields, wherein the standard fields have characteristic data;
according to the feature data of the field to be processed and the feature data of the standard field, calculating the similarity between the field to be processed and the one or more standard fields respectively;
and according to the calculated similarity, determining a target field corresponding to the field to be processed from the one or more standard fields so as to associate the field to be processed by using the target field.
2. The method of claim 1, wherein the step of obtaining characterization data for the fields to be processed in the data repository comprises:
acquiring an execution statement of the data warehouse;
and analyzing the execution statement, and determining the field to be processed and the characteristic data of the field to be processed.
3. The method according to claim 2, wherein the step of parsing the execution statement and determining the field to be processed and the feature data of the field to be processed comprises:
analyzing the execution statement and determining the associated task corresponding to the field to be processed;
and determining the associated field of the field to be processed according to the associated task, and taking the associated field as the field to be processed.
4. The method according to claim 3, wherein the step of parsing the execution statement and determining the associated task corresponding to the field to be processed comprises:
analyzing the execution statement by using an abstract syntax tree, and determining an associated task corresponding to the field to be processed; and the associated task comprises a parent task corresponding to the field to be processed.
5. The method of claim 1, wherein the step of obtaining one or more criteria fields comprises:
acquiring the category information of the fields to be processed in the data warehouse;
and acquiring one or more standard fields corresponding to the fields to be processed according to the category information.
6. The method of any of claims 1-5, wherein the profile data includes a field name, a field type, and a field description.
7. The method according to claim 6, wherein the step of calculating the similarity between the field to be processed and the one or more standard fields respectively comprises:
calculating first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively by adopting a keyword query model;
calculating second similarity between the field description of the field to be processed and the field description of the one or more standard fields by adopting a semantic analysis model;
and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
8. The method according to claim 7, wherein the step of calculating the similarity between the field to be processed and the one or more standard fields respectively comprises:
vector representation is carried out on the field name and the field type by adopting a tfidf model, so that first similarity between the field name and the field type of the field to be processed and the field name and the field type of the one or more standard fields respectively is calculated;
vector representation is carried out on the field description by adopting a bert model so as to calculate second similarity of the field description of the field to be processed and the field description of the one or more standard fields respectively;
and according to the first similarity and the second similarity, calculating the similarity between the field to be processed and the one or more standard fields respectively.
9. The method of claim 1, wherein the step of associating the field to be processed with the target field comprises:
replacing the field to be processed with the target field; and the number of the first and second groups,
in the case of creating a new data table in the data repository, fields in the created new data table are represented by the one or more standard fields.
10. An apparatus for processing field data in a data warehouse, comprising:
the field to be processed acquisition module is used for acquiring the characteristic data of the field to be processed in the data warehouse;
the standard field acquisition module is used for acquiring one or more standard fields, and the standard fields have characteristic data;
the similarity determining module is used for calculating the similarity between the field to be processed and the one or more standard fields according to the characteristic data of the field to be processed and the characteristic data of the standard fields;
and the target field acquisition module is used for determining a target field corresponding to the field to be processed from the one or more standard fields according to the calculated similarity so as to associate the field to be processed by using the target field.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011118899.9A CN114091426A (en) | 2020-10-19 | 2020-10-19 | Method and device for processing field data in data warehouse |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011118899.9A CN114091426A (en) | 2020-10-19 | 2020-10-19 | Method and device for processing field data in data warehouse |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114091426A true CN114091426A (en) | 2022-02-25 |
Family
ID=80295826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011118899.9A Pending CN114091426A (en) | 2020-10-19 | 2020-10-19 | Method and device for processing field data in data warehouse |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114091426A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114818983A (en) * | 2022-05-26 | 2022-07-29 | 蕴硕物联技术(上海)有限公司 | Gas shielded welding data processing method and device |
CN114896352A (en) * | 2022-04-06 | 2022-08-12 | 北京月新时代科技股份有限公司 | Method, system, medium and computer device for automatically matching field names of well files without field names |
CN115292274A (en) * | 2022-06-29 | 2022-11-04 | 江苏昆山农村商业银行股份有限公司 | Data warehouse topic model construction method and system |
CN115510021A (en) * | 2022-06-29 | 2022-12-23 | 江苏昆山农村商业银行股份有限公司 | Method and system for constructing standard layer of data warehouse |
CN116719875A (en) * | 2023-08-09 | 2023-09-08 | 恩核(北京)信息技术有限公司 | Data standardization maintenance method, system, equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704625A (en) * | 2017-10-30 | 2018-02-16 | 锐捷网络股份有限公司 | Fields match method and apparatus |
CN108256074A (en) * | 2018-01-17 | 2018-07-06 | 链家网(北京)科技有限公司 | Method, apparatus, electronic equipment and the storage medium of checking treatment |
CN109325078A (en) * | 2018-09-18 | 2019-02-12 | 拉扎斯网络科技(上海)有限公司 | Data blood margin determination method and device based on structural data |
CN110232056A (en) * | 2019-05-21 | 2019-09-13 | 苏宁云计算有限公司 | A kind of the blood relationship analytic method and its tool of structured query language |
CN110277149A (en) * | 2019-06-28 | 2019-09-24 | 北京百度网讯科技有限公司 | Processing method, device and the equipment of electronic health record |
CN111143390A (en) * | 2019-12-30 | 2020-05-12 | 北京每日优鲜电子商务有限公司 | Method and device for updating metadata |
CN111435406A (en) * | 2019-01-14 | 2020-07-21 | 北京京东尚科信息技术有限公司 | Method and device for correcting database statement spelling errors |
CN111538744A (en) * | 2020-07-08 | 2020-08-14 | 浙江大华技术股份有限公司 | Method and device for processing data blood margin |
-
2020
- 2020-10-19 CN CN202011118899.9A patent/CN114091426A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704625A (en) * | 2017-10-30 | 2018-02-16 | 锐捷网络股份有限公司 | Fields match method and apparatus |
CN108256074A (en) * | 2018-01-17 | 2018-07-06 | 链家网(北京)科技有限公司 | Method, apparatus, electronic equipment and the storage medium of checking treatment |
CN109325078A (en) * | 2018-09-18 | 2019-02-12 | 拉扎斯网络科技(上海)有限公司 | Data blood margin determination method and device based on structural data |
CN111435406A (en) * | 2019-01-14 | 2020-07-21 | 北京京东尚科信息技术有限公司 | Method and device for correcting database statement spelling errors |
CN110232056A (en) * | 2019-05-21 | 2019-09-13 | 苏宁云计算有限公司 | A kind of the blood relationship analytic method and its tool of structured query language |
CN110277149A (en) * | 2019-06-28 | 2019-09-24 | 北京百度网讯科技有限公司 | Processing method, device and the equipment of electronic health record |
CN111143390A (en) * | 2019-12-30 | 2020-05-12 | 北京每日优鲜电子商务有限公司 | Method and device for updating metadata |
CN111538744A (en) * | 2020-07-08 | 2020-08-14 | 浙江大华技术股份有限公司 | Method and device for processing data blood margin |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114896352A (en) * | 2022-04-06 | 2022-08-12 | 北京月新时代科技股份有限公司 | Method, system, medium and computer device for automatically matching field names of well files without field names |
CN114818983A (en) * | 2022-05-26 | 2022-07-29 | 蕴硕物联技术(上海)有限公司 | Gas shielded welding data processing method and device |
CN115292274A (en) * | 2022-06-29 | 2022-11-04 | 江苏昆山农村商业银行股份有限公司 | Data warehouse topic model construction method and system |
CN115510021A (en) * | 2022-06-29 | 2022-12-23 | 江苏昆山农村商业银行股份有限公司 | Method and system for constructing standard layer of data warehouse |
CN115510021B (en) * | 2022-06-29 | 2023-12-22 | 江苏昆山农村商业银行股份有限公司 | Method and system for constructing data warehouse standard layer |
CN115292274B (en) * | 2022-06-29 | 2023-12-26 | 江苏昆山农村商业银行股份有限公司 | Data warehouse topic model construction method and system |
CN116719875A (en) * | 2023-08-09 | 2023-09-08 | 恩核(北京)信息技术有限公司 | Data standardization maintenance method, system, equipment and medium |
CN116719875B (en) * | 2023-08-09 | 2023-12-26 | 恩核(北京)信息技术有限公司 | Data standardization maintenance method, system, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114091426A (en) | Method and device for processing field data in data warehouse | |
US8630989B2 (en) | Systems and methods for information extraction using contextual pattern discovery | |
US20130060769A1 (en) | System and method for identifying social media interactions | |
CN109522341B (en) | Method, device and equipment for realizing SQL-based streaming data processing engine | |
CN113760891B (en) | Data table generation method, device, equipment and storage medium | |
CN111435406A (en) | Method and device for correcting database statement spelling errors | |
CN109344374B (en) | Report generation method and device based on big data, electronic equipment and storage medium | |
US10303704B2 (en) | Processing a data set that is not organized according to a schema being used for organizing data | |
CN113986864A (en) | Log data processing method and device, electronic equipment and storage medium | |
CN111708805A (en) | Data query method and device, electronic equipment and storage medium | |
US20240220772A1 (en) | Method of evaluating data, training method, electronic device, and storage medium | |
CN114579104A (en) | Data analysis scene generation method, device, equipment and storage medium | |
CN113660541A (en) | News video abstract generation method and device | |
CN112579673A (en) | Multi-source data processing method and device | |
CN116150194B (en) | Data acquisition method, device, electronic equipment and computer readable medium | |
CN111126073A (en) | Semantic retrieval method and device | |
US10180938B2 (en) | Assisted free form decision definition using rules vocabulary | |
CN115146070A (en) | Key value generation method, knowledge graph generation method, device, equipment and medium | |
CN115357286A (en) | Program file comparison method and device, electronic equipment and storage medium | |
CN111368036A (en) | Method and apparatus for searching information | |
CN113760945A (en) | Method and device for auditing SQL (structured query language) statements | |
CN114065727A (en) | Information duplication eliminating method, apparatus and computer readable medium | |
CN114372083A (en) | Metadata analysis method and device | |
CN112988778A (en) | Method and device for processing database query script | |
US12050873B2 (en) | Semantic duplicate normalization and standardization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |