HK1112762A1

HK1112762A1 - Method and system for data processing in data warehouse

Info

Publication number: HK1112762A1
Application number: HK08107424.3A
Authority: HK
Inventors: 史健; 樊亚飞; 张金银
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2006-07-11
Filing date: 2008-07-07
Publication date: 2008-09-12
Also published as: CN101105793A; CN100487700C

Abstract

The invention discloses a data handling method and a system in database to solve the problem that the single course treatment requires great amount of time and manpower cost and can not handle abnormities in ETL process. The method includes: arrange data synchronous configuration table and data refreshing configuration table; configuration information of the corresponding data source, start up a plurality of synchronous courses in parallel; refreshing grades, and start up a plurality of refreshing courses to make sequential synchronous and refreshing treatment on a plurality of data sources. The invention also includes parallel monitoring treatment; periodically checking synchronous refreshing information table; any abnormity discovered should be recorded in the alarm information table; the invention also includes parallel alarm treatment: Periodically checking alarm information table; alarm information should be sent immediately for new abnormities. The invention enhances the efficiency of synchronization and refreshing and saves a great deal of time and manpower cost. Real-time monitoring ensures timely handling of problems and safe, reasonable, and stable operation of database.

Description

Data processing method and system in data warehouse

Technical Field

The present invention relates to data processing of a data warehouse, and more particularly, to a method and a system for processing an online analytical processing (OLAP) system data extraction, transformation, and transshipment (ETL) process of a data warehouse.

Background

A data warehouse is a theme-oriented, integrated, time-dependent, non-modifiable data collection in enterprise Management and decision-making, that is, integration by theme and recording of entire historical changes for all applications, such as Customer Relationship Management (CRM) systems, financial systems, etc. With the continuous improvement of the informatization degree of the enterprise, a large amount of business data are accumulated in the enterprise, and the data warehouse system is used for uniformly processing the mutually independent and dispersed data so as to meet the high-level decision and analysis requirements of the enterprise.

Referring to FIG. 1, there is an architectural block diagram of a data warehouse system. The whole data warehouse system is an architecture comprising four layers, including a data source 101, a data warehouse 102, an On-line Analytical Processing (OLAP) system 103 and a front-end tool 104. Wherein:

data source 101, which is the basis for a data warehouse system, typically includes information internal to the enterprise and information external thereto. The internal information includes various business processing data and various document data, and the external information includes various laws and regulations, market information, competitor information and the like. E.g., CRM systems, financial systems, etc.;

a data warehouse 102, which stores the data of the data sources 101 in a data table structure, wherein each data table corresponds to one data object, and one data source can correspond to a plurality of data objects;

the OLAP system 103 is used for effectively integrating data required by analysis, organizing the data according to a multi-dimensional model so as to analyze in multiple angles and layers and find a trend;

the front-end tools 104 mainly include various reporting tools, query tools, data analysis tools, data mining tools, and various application development tools based on the data warehouse, and access to the data warehouse 102 is realized. The data analysis tool mainly aims at an OLAP server, and the report tool and the data mining tool mainly aim at a data warehouse.

The ETL process of the data warehouse refers to processes of data extraction (Extract), transformation (Transform), and loading (Load), and is a process for developing the OLAP system. Wherein the data extraction refers to extracting data from a source system; the data conversion means that a developer converts the extracted data into a target data structure according to business requirements and realizes summarization; the data loading refers to loading the converted and summarized data into the target data warehouse. The process of extracting the source data is called a synchronization process, and the process of converting and loading the source data is called a refresh process.

Referring to fig. 2, a flowchart of an ETL processing procedure of a data warehouse OLAP system in the prior art is shown, where the processing procedure is a pipelined processing mode, and includes function calls of an ETL main module, a synchronization module, and a refresh module. The treatment process comprises the following steps: step 201, calling an ETL main module through a timing trigger (Crontab) every day; step 202, the main module calls the synchronization module again to wait for the data source to be recovered (i.e. wait for whether the data source data is ready), and performs synchronization operation on the recovered data sources one by one, for example, the data of the data source of the CRM is synchronized first, then the data of the financial data source is synchronized, and the synchronization of a plurality of data objects corresponding to each data source is performed by adopting a single process; and step 203, after the synchronization module is completed, calling a refreshing module, and refreshing each data table one by adopting a single-process processing mode. The synchronization and refresh process completes the extraction, conversion and loading processing of the data source data.

The data warehouse building method has the advantages that the data warehouse building method is used for meeting the high-level decision and analysis requirements of enterprises, and is also provided with a monitoring module and an alarm module for processing abnormal conditions occurring in the ETL processing process of the data warehouse OLAP system in order to ensure the accuracy of data analysis of the data warehouse. After the data source data are synchronized and refreshed, executing: step 204, after the refreshing module is finished, starting the monitoring module, and checking whether the log record of the synchronous refreshing process is abnormal; step 205, after the monitoring module checks the abnormality, the monitoring module starts the alarm module to send alarm information to prompt the ETL engineer to re-execute the synchronous refresh processing.

In the processing method, a single-process processing mode is adopted for the synchronization and refreshing process of each data source and each data object, and after one data source is synchronized, the other data source is synchronized; and after one data table is refreshed, refreshing the other data table. On one hand, because the recovery time points of the data sources are different, the data cannot be synchronized in time in a single-process processing mode, the processing time of the ETL process is overtime, and the data processing efficiency is influenced. On the other hand, the serial mode of one process going to the end does not reasonably and fully utilize the network and server resources, and when the network or server resources are idle, the ETL processing process needs to spend much time waiting for the processing of a single process.

Moreover, with the continuous expansion of company business and the rapid change of market, the data analysis demand increases at a high speed, and the required data sources and data objects also increase continuously. In the single-process processing mode, a data source or a data object is newly added in the ETL process, and a program needs to be modified. Therefore, the method will consume a lot of development labor cost for different processing requirements.

In addition, in the ETL processing process, the four modules of synchronization, refreshing, monitoring and alarming are called in a serial scheduling mode, so that the monitoring of the synchronous refreshing process belongs to a post-event monitoring mode, the problem of abnormal ETL process cannot be solved in time, the timeliness of company decision analysis data acquisition is influenced, and the influence of a data warehouse in a client is influenced.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and a system for processing an ETL process of an OLAP system in a data warehouse, so as to solve the problems of low processing efficiency, unreasonable utilization of network and server resources, inconvenience of program modification, and incapability of timely processing an exception in the ETL process.

In order to solve the technical problem, the invention provides a data processing method in a data warehouse, which comprises the following steps:

setting a data synchronization configuration table and a data refreshing configuration table comprising data source configuration information; wherein each data source comprises at least one data object;

starting a plurality of synchronous processes in parallel corresponding to the number of the data sources in the data source configuration information;

and starting a plurality of refreshing processes in parallel according to the number of the data objects.

Preferably, corresponding to each synchronization process, after the synchronization process is executed, the refresh process is started.

And starting the refreshing process after all the synchronous processes are executed.

Wherein, a plurality of synchronous processes are started in parallel according to the following steps: corresponding to the number of the data sources in the data source configuration information, starting synchronous waiting processes with corresponding number in parallel; and checking whether the data source is recovered or not at regular time, and if so, starting a plurality of synchronous processes in parallel by the corresponding synchronous waiting process.

Preferably, a plurality of refresh processes are started in parallel according to the following steps: the data refreshing configuration table is provided with refreshing levels, and corresponding number of refreshing waiting processes is started in parallel corresponding to the number of data objects defined by the refreshing levels; and regularly checking whether the refresh execution condition of the data object is met, and if so, starting a refresh process by a corresponding refresh waiting process.

Wherein the refresh execution condition includes: and completing synchronous processing of the corresponding data objects, and calling the upper-level data object of the data objects to complete refreshing processing.

Preferably, the plurality of synchronous processes or the plurality of refresh processes are invoked in a priority mode.

Further comprising: and recording the processing procedure information into a synchronous refreshing information table.

The method also comprises the following parallel monitoring processing: setting an alarm level configuration table; and checking the synchronous refreshing information table at regular time, and if abnormity is found, recording abnormal information into an alarm information table corresponding to the alarm level configuration table.

The method also comprises the following steps of: and checking the alarm information table at regular time, corresponding to the abnormal information, and immediately sending corresponding alarm information.

The present invention also provides a data processing system in a data warehouse, comprising:

the storage unit is used for storing a data synchronization configuration table comprising data source configuration information, a data refreshing configuration table comprising refreshing levels and a synchronous refreshing information table;

the synchronization unit is used for reading the data synchronization configuration table; corresponding to the number of the data sources in the data source configuration information, starting synchronous waiting processes with corresponding number in parallel; checking whether the data source is recovered at regular time, and if so, starting a plurality of synchronous processes in parallel by the corresponding synchronous waiting process;

the refreshing unit is used for reading the data refreshing configuration table; each data source comprises at least one data object, and corresponding to the number of the data objects defined by the refresh level, refreshing waiting processes with corresponding numbers are started in parallel; and regularly checking whether the refresh execution condition of the data object is met, and if so, starting a refresh process by a corresponding refresh waiting process.

Further comprising: and the recording unit is used for recording the information of the synchronization and refreshing processing process to the synchronous refreshing information table.

The storage unit also stores an alarm level configuration table and an alarm information table.

Further comprising: the monitoring unit is used for reading the alarm level configuration table; and checking the synchronous refreshing information table at regular time, if abnormity is found, corresponding to the alarm level configuration table, and recording abnormal information into the alarm information table.

Further comprising: the alarm unit is used for reading the alarm information table; and checking the alarm information table at regular time, and immediately sending corresponding alarm information corresponding to the abnormal information.

Compared with the prior art, the invention has the following advantages:

firstly, the invention adopts a multi-process concurrent processing mode, and the multi-process concurrently starts synchronization and refresh processing to multi-data source multi-data objects:

firstly, multi-data source multi-process synchronous waiting processing solves the problem of inconsistent data source synchronous time caused by different recovery time points of a plurality of data sources, and after a certain data source is recovered, a waiting process starts a synchronous process in time, so that the data processing time is ensured, network and server resources are saved for other data source processing, and the reasonable utilization of the network and server resources is achieved;

secondly, multi-data object multi-process synchronization and refresh are carried out, and synchronization or refresh operation is carried out on a plurality of data objects corresponding to the same data source or a plurality of data objects of a plurality of data sources at the same time, namely, the data objects are synchronized to a data warehouse at the same time or a plurality of data tables in the data warehouse are refreshed at the same time, so that the synchronization and refresh efficiency is improved, a large amount of synchronization and refresh time is saved, residual time resources are provided for data growth, and optimal utilization of network and server resources is achieved.

Thirdly, the synchronization and refresh processes are started concurrently, and the synchronization and refresh scheduling of the same data object is realized by checking the completion condition of the synchronization process and the scheduling execution relation between the refresh processes, so that the problem of timely refreshing of data is solved, the refresh time of the ETL process is saved, and the refresh efficiency is improved; the refresh scheduling between the data objects is realized by adopting a refresh priority mode, the association problem of the refresh objects and the multi-dimensional gradient dimensionality is solved, and after the gradient dimensionality refresh process is completed, the refresh process of the associated data objects is started, so that the accuracy of data logic is ensured.

Secondly, the invention adopts a configurable mode, and the synchronization, refreshing, monitoring and alarm processing are all carried out in a configuration table configuration mode, and the synchronous configuration table, the refreshing configuration table and the alarm configuration table are called through scripts, so that the ETL process can be configured, the human intervention degree and the programming workload of engineers are reduced, the development efficiency of ETL engineers is improved, the labor cost required by the development of the ETL process is saved, and the surplus human resources are provided for expanding the range of a data warehouse.

And thirdly, the invention adopts a real-time monitoring mode, simultaneously starts synchronous, refreshing, monitoring and alarming processing, adopts a real-time monitoring mode for the synchronous refreshing process, finds problems and timely reports the problems to an ETL engineer of the data warehouse according to the problem grade, ensures the timeliness of problem processing and the safe, reasonable and stable operating environment of the data warehouse, and ensures that customers can timely and accurately obtain analytic data, thereby providing a solid foundation for the status of the data warehouse in the mind of the customers.

Drawings

FIG. 1 is an architectural block diagram of a data warehouse system;

FIG. 2 is a flow diagram of a prior art ETL process for a data warehouse OLAP system;

FIG. 3 is a flow chart of a data warehouse synchronous refresh process according to the present invention;

FIG. 4 is a flow diagram of the optimized data warehouse ETL process of the present invention;

FIG. 5 is a process flow diagram of an embodiment of the invention;

FIG. 6 is a block diagram of a data warehouse ETL process processing system in accordance with the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The core idea of the invention is as follows: by setting a data synchronization configuration table and a data refreshing configuration table, a multi-process parallel synchronous refreshing mode is adopted for the ETL process of the data warehouse OLAP system, and after the data source is recovered, synchronous processing starts a plurality of synchronous processes for a plurality of data sources; the refreshing process starts a plurality of refreshing processes for the multiple data objects according to the synchronous process completion condition of the same data object and the refreshing scheduling relation of the data object; and simultaneously starting synchronization and refreshing, and refreshing in time after the same data object is subjected to synchronization. The monitoring processing and the alarming processing are started in parallel by setting an alarming level configuration table, the running condition of the synchronous refreshing process is monitored in real time, the monitoring result is checked in a real-time mode, an ETL engineer is immediately notified when an abnormal condition is found, the running condition of the synchronous refreshing process is reported in the fastest time, and the safe and stable running of a data warehouse is ensured.

Referring to fig. 3, a flow chart of the data warehouse synchronous refresh process according to the present invention is shown.

Step 301, setting a data synchronization configuration table and a data refresh configuration table. The data synchronization configuration table records data source configuration information, target data information, synchronization levels, increment synchronization fields and the like for synchronous processing and calling. The data source configuration information comprises the number of data objects contained in the data source, original data table names corresponding to the data objects, a server where the data source is located, data source link information and the like; the target data information comprises the number of target data objects, data table names corresponding to the target data objects and the like, is data mirror image information of the remote data source in the data warehouse, and the data mirror image is an environment integrating data from different databases, refers to backup of the remote data source in the data warehouse, and is used for data extraction, conversion and loading. And the data refreshing configuration table records the refreshing process name, the refreshing parameters, the refreshing execution conditions, the refreshing period (such as day and week execution periods), the refreshing level and the like for the refreshing processing and calling.

Step 302, the data warehouse ETL master module is started. The ETL process of the data warehouse OLAP system is realized by taking a Shell script and a Structured Query Language (SQL) script as main development scripts, a main module starts the main Shell script through Crontab, and the main module script is used for starting a synchronization and refresh processing script.

The ETL master module starts synchronization, refresh processing simultaneously, via step 302. For the parallel mode, there may be multiple processing modes, for example, corresponding to multiple processing objects, starting the corresponding number of processes to execute; it is also possible to set the number of processes in advance, execute a plurality of processing objects by a small number of parallel processes, and the like. But for the multi-process processing of the ETL process of the data warehouse OLAP system, the following manner is generally adopted. The following describes the synchronization and refresh processes separately. The synchronous processing flow comprises the following steps:

step 303, reading the data synchronization configuration table.

Step 304, start the synchronous wait process. And starting the synchronous waiting processes with corresponding numbers in parallel according to the number of the data sources recorded in the configuration table information.

Step 305, wait for whether the data source is recovered. The data source recovery refers to the recovery condition of a data source mirror image (Standby) server, and the data warehouse acquires data from the mirror image server, so that the performance of a formal server can be ensured. In the process of waiting for the data source to recover, the synchronous waiting process plays a role in monitoring whether the data source recovers or not. The synchronous waiting process regularly checks the recovery condition of the primary data source Standby server, and the timing time is generally 5 minutes. If so, continue to step 306, otherwise wait for the next timing point to arrive.

Because the recovery time points of the data sources are different, the recovered data sources can be processed in time by starting a plurality of synchronous waiting processes in parallel; in addition, the parallel processing of a plurality of data sources in the synchronization process can save the processing time, save the network and server resources for the processing of other data sources, and achieve the reasonable utilization of the network and server resources.

Step 306, start the synchronization process. And if the Data source Standby server finishes restoring, the corresponding synchronous waiting process starts a plurality of synchronous processes in parallel, and Data is extracted to an Operation Data Storage (ODS) according to the Data synchronous configuration table information. The ODS, also known as data mirror, is an environment that integrates data from different databases. Typically, each data source corresponds to multiple data objects, and thus multiple synchronization processes are initiated in parallel to handle synchronization of multiple data objects. However, the number of synchronization processes is allocated according to the usage of network and server resources, and does not correspond to the number of data objects one-to-one. Therefore, the synchronization level of the synchronization process is defined in the data synchronization configuration table, which is used for ensuring the network bandwidth, avoiding reaching the network bottleneck and freely adjusting the synchronization parallelism. For example, by configuring the synchronization level for 8 data objects corresponding to the same data source, firstly executing 4 synchronization processes in parallel, and then executing the remaining 4 synchronization processes in parallel, the plurality of synchronization processes are executed according to the priority order, so that the server pressure is reduced, and the network and server resources are reasonably and fully utilized.

In the ETL process, the abstraction of a data source from different perspectives is referred to as the varying dimensionality of the data. For example, a data table corresponding to a certain data object defines 4 fields, that is, data is extracted from 4 different angles, and if one field is added, the dimension of the data changes in an incremental manner. And in the synchronization processing, the incremental data are synchronized to the data warehouse according to the information of the incremental synchronization field. Because the data that changes in the data warehouse synchronously every day is many, reduce the data bulk of synchronization, can raise the synchronous efficiency. Therefore, a multi-process processing mode of a data source is adopted, a plurality of data objects are synchronized to the data warehouse at the same time, and the efficiency of synchronous processing is improved, so that a large amount of synchronous time is saved, residual time resources are provided for data growth, and optimal utilization of network and server resources is achieved.

Step 307, the synchronization process information is recorded. In the synchronization process, synchronous operation process information is recorded to a synchronous refresh information table (Promsg) at the same time, and a consistent enterprise data integration view is provided for users. The synchronous refreshing information table records the information of the completion of the synchronization time or the abnormal time, the abnormal reason and the like. The process has similar functions to the system log record, is convenient for data analysis of the synchronization process, and is also convenient for providing different process information for users according to different requirements of the users.

In the synchronous processing process, the multiple data sources and the multiple data objects are processed in parallel in multiple processes, so that the problem of inconsistent data source recovery time points is solved, and the synchronous processing efficiency is improved. And for the same data source data object, after synchronous processing is completed, entering a refreshing process, namely a data conversion and loading process. The refreshing processing flow comprises the following steps:

step 308, reading the data to refresh the configuration table.

Step 309, a refresh waiting process is initiated. And according to the refreshing level recorded by the configuration table, parallelly starting refreshing waiting processes corresponding to a plurality of data objects. For example, there are 8 data tables corresponding to the data object to be refreshed, and the sequence of the refresh priority is to refresh 3 data tables first and then 5 data tables, so that 3 refresh waiting processes are started in the first level refresh, and 5 refresh waiting processes are started in the second level refresh. The purpose of the refresh level is to:

firstly, due to the adoption of a multi-process parallel refreshing mode, the data table with the logic relation needs to be scheduled and refreshed in sequence, and the defined refreshing priority solves the association problem of a refreshing object and a multidimensional gradient dimension. The multidimensional gradient dimension refers to that the dimension of data (namely, fields contained in a data table) is changed, and in the ETL process, because some dimension attribute dimension data need to keep a history track, when the data table is refreshed, the history is recorded in a slowly gradient dimension mode, and a unique identification record history state is added in the data table. The problem of association between the refresh object and the multidimensional gradient dimension refers to scheduling between the data tables of the upper level and the lower level, and the data table of the next level can be refreshed only after the data table of the upper level is refreshed. And the multi-dimensional gradual change dimension adopts a mode of slowly changing dimension, and after the gradual change dimension refreshing process is completed, the associated object refreshing process is started, so that the accuracy of data logic is ensured.

And secondly, refreshing parallelism can be freely adjusted, hardware resources of the server can be reasonably utilized, and the IO operation bottleneck of the server is avoided.

In step 310, it is checked whether the refresh execution condition is satisfied. The refresh execution condition is defined in a data refresh configuration table, and comprises: firstly, the synchronous processing of the refreshing object is completed; secondly, the upper-level data object of the refreshing object is scheduled to complete the refreshing processing. The refresh waiting process plays a role of monitoring whether the refresh execution condition is met, the check is performed at regular time according to the refresh cycle, when the refresh object meets the two conditions at the same time, the step 311 is continuously executed, otherwise, the next refresh cycle is waited.

Step 311, a refresh process is initiated. And when the refreshing object meets the refreshing execution condition, starting a refreshing process by the refreshing waiting process to finish the refreshing of the data table. And meanwhile, the multi-process refreshing can save the refreshing processing time, improve the refreshing efficiency and reasonably utilize the network and server resources.

At step 312, the refresh process information is recorded. In the refresh process, the synchronized running process information is simultaneously recorded to a synchronized refresh information table (Promsg). The synchronous refreshing information table records the information of the refreshing time or the abnormal time, the abnormal reason and the like. Again, this is a log-like process.

The following exemplifies the procedure of the refresh process. For example: the refresh process starts a refresh waiting process of the product information table (Offer table), and checks the following conditions every 3 minutes: the waiting process first checks whether the synchronization of the Offer table is completed, and then checks whether the refresh of the schedule table Member information (Member table) of the Offer table is completed. And under the condition that the two conditions are met, starting to execute the refreshing process of the Offer table, and simultaneously recording the refreshed running condition information into a synchronous refreshing information table (Promsg).

In the multi-process synchronization and refresh processing process, if the synchronization processing and the refresh processing are still performed in a serial manner (i.e. all data sources complete the synchronization process and then perform refresh), in the synchronization process, multiple data sources and multiple data objects are synchronized to the data warehouse at the same time; in the refreshing process, the data objects of multiple data sources are refreshed simultaneously, so that the efficiency of the synchronization and refreshing process is improved. In the serial processing mode, step 310 only needs to determine the second refresh execution condition, and all synchronous processing before the refresh processing is completed, so the first refresh execution condition is not determined.

However, because the multi-process processing of each process is realized by adopting the configuration table, the parallel processing of synchronization and refresh can be conveniently realized, namely, the main module simultaneously starts the synchronization and refresh processing scripts, corresponding to each synchronization process, and starts the refresh process after the synchronization process is executed. Under this optimization method, step 310 needs to check the first and second refresh execution conditions, and when the first and second refresh execution conditions are met, a refresh process is started to implement the scheduling of the synchronous object and the refresh object. The scheduling can realize timely refreshing of the data object, save the refreshing time of the ETL process and improve the refreshing efficiency. The invention realizes synchronous refreshing parallel processing of a plurality of data objects of a plurality of data sources, but the synchronous refreshing processing is still executed for the same data object of the data source in sequence.

The synchronization and refreshing process completes the processes of extraction, conversion and loading of data source data, and in order to ensure safe and stable operation of the ETL process of the data warehouse, the monitoring and alarm process is started in parallel while the synchronization and refreshing process is started in parallel, so that the synchronous refreshing process is monitored in real time. Referring to FIG. 4, a flow diagram of an optimized data warehouse ETL process of the present invention is shown.

Step 401, setting a data synchronization configuration table, a data refreshing configuration table, and an alarm level configuration table. The data synchronization configuration table comprises data source configuration information, target data information, synchronization levels, increment synchronization fields, synchronization alarm levels and the like, the data refreshing configuration table comprises refreshing process names, refreshing parameters, refreshing execution conditions, refreshing periods (such as days and week execution periods), refreshing levels, refreshing alarm levels and the like, and the alarm level configuration table comprises alarm level distribution information, alarm receivers and the like. The alarm level allocation information recorded in the alarm level configuration table is used for dividing alarm levels according to abnormal levels, such as sending alarms immediately after several hours or several points. For example, level 1 corresponds to immediate transmission, level 2 corresponds to transmission after 2 hours, and so on. And the alarm receiver is set according to different settings of engineers of all modules of the ETL. In this step, the synchronous alarm level and the refresh alarm level information are added in the data synchronous configuration table and the data refresh configuration table. The synchronous alarm level and the refreshing alarm level respectively correspond to alarm level distribution information, and the alarm level corresponding to each alarm sending is recorded. For example, synchronization process 1 has an alarm of abnormal transmission level 2, refresh process 3 has an alarm of abnormal transmission level 1, and so on.

At step 402, the data warehouse ETL master module is started.

In step 403, the ETL master module starts synchronization, refresh, monitoring, and alarm processing at the same time. The synchronous and refreshing process is the same as the above, the monitoring process monitors the synchronous refreshing process in real time, and checks the synchronous refreshing information table at regular time. If the abnormal information is found, the abnormal information is recorded into the alarm information table according to the alarm level configuration table and the synchronous and refreshing alarm level information, and the ETL process completion information is recorded into the alarm information table every day, so that the safe, stable and reasonable operation of the ETL process of the data warehouse is ensured. The abnormal information comprises alarm sending time, an alarm receiver, alarm information and the like, and the alarm information table is used for alarm processing; the time interval for the timing check is typically 5 minutes. Different alarm information can be set for distinguishing corresponding to different processing processes.

And (3) parallelly started alarm processing is used for sending abnormal information of an ETL process in real time, and the alarm information table is checked at regular time, usually once every 5 minutes. If the abnormal information is found, the corresponding alarm information is immediately sent to the alarm receiver according to the alarm sending time, an ETL engineer is informed, the abnormal problem in the ETL process is guaranteed to be timely processed, and a foundation is laid for the timeliness of data. Meanwhile, the completion information is also sent to inform the ETL engineer of the end condition of the whole ETL process.

The parallel processing of synchronization, refreshing, monitoring and alarming can timely report problems to the ETL engineer of the data warehouse according to the problem level, thereby ensuring the timeliness of problem processing and the safe, reasonable and stable operating environment of the data warehouse, ensuring that customers can timely and accurately obtain analytic data, and further providing a solid foundation for the status of the data warehouse in the customer mind.

The invention adopts a configurable mode, and the synchronization, refreshing, monitoring and alarm processing are realized by a configuration table configuration mode and calling a synchronization configuration table, a refreshing configuration table and an alarm configuration table through a script, thereby realizing the configurable ETL process, reducing the human intervention degree and the workload of engineer programming, improving the development efficiency of ETL engineers, saving the labor cost required by the development of the ETL process and providing the residual human resources for expanding the range of a data warehouse.

To illustrate the core of the present invention more clearly, specific examples are given for illustration. Referring to FIG. 5, a process flow diagram of an embodiment of the invention is shown.

Step 501, setting a synchronous configuration table, refreshing the configuration table and alarming the configuration table. In the processing process, the number of the data sources is 3, the data source 1 corresponds to A, B, C, D four data tables, and the data source 2 and the data source 3 also correspond to a plurality of data tables respectively.

Step 502, the data warehouse ETL master module is started.

Step 503, the ETL master module starts synchronization, refresh, monitoring, and alarm processing at the same time.

Step 504, start the synchronous waiting process and refresh waiting process. According to the 3 data source information recorded by the configuration table, a synchronous waiting process of a source database 1, a source database 2 and a source database 3 is started in parallel; the refresh waiting process of the table A, B, D is started in parallel according to the number of refresh objects.

Step 505, start the synchronization process, corresponding to the source database 1 synchronization waiting process, according to the synchronization level and the recovery condition of the data source, start 2 synchronization processes in parallel, perform synchronization processing on table a and table C, and the synchronization of table B and table D waits for the call of the next level.

A refresh process is initiated. And the table A completes synchronous processing, and when the table A does not have the association scheduling of the upper level, the table A refreshing process is started immediately. And the next level of the table A is associated with the scheduling table B and the table D and needs to wait for the refreshing of the table A to be completed.

Step 506, waiting for synchronous processing of the table B and the table D according to the synchronous level; according to the refresh level, table B and table D wait for the refresh process.

And the monitoring and alarming process carries out real-time monitoring and real-time alarming processing on the synchronous refreshing process.

The invention also provides a data processing system in the data warehouse. Referring to fig. 6, there is a block diagram of a data warehouse ETL process processing system according to the present invention. The system comprises a storage unit 601, a main unit 602, a synchronization unit 603, a refresh unit 604, a monitoring unit 605, an alarm unit 606 and a recording unit 607. Wherein the content of the first and second substances,

the storage unit 601 is used for storing a data synchronization configuration table, a data refresh configuration table, an alarm level configuration table, a synchronization refresh information table, and an alarm information table. The data synchronization configuration table comprises data source configuration information, target data information, synchronization levels, increment synchronization fields, synchronization alarm levels and the like, the data refreshing configuration table comprises refreshing process names, refreshing parameters, refreshing execution conditions, refreshing periods (such as days and week execution periods), refreshing levels, refreshing alarm levels and the like, and the alarm level configuration table comprises alarm level distribution information, alarm receivers and the like. The synchronous refreshing information table records synchronous and refreshing operation process information, including synchronization completion or refreshing time, abnormal reasons and the like. The alarm information table records abnormal information, completion information and the like in synchronization and refreshing processes, wherein the abnormal information comprises alarm sending time, an alarm receiver, alarm information and the like.

And a main unit 602, configured to start a main Shell script through a bridge (operating the daemon of each user and the execution schedule), and call the synchronization unit, the refresh unit, the monitoring unit, the alarm unit, and the recording unit.

A synchronization unit 603, configured to read the data synchronization configuration table; corresponding to the number of data sources recorded by the data synchronization configuration table, and starting synchronization waiting processes with corresponding numbers in parallel; and checking whether the data source is recovered or not at regular time, and if so, starting a plurality of synchronous processes in parallel by the corresponding synchronous waiting process.

A refresh unit 604, configured to read the data refresh step configuration table; each data source comprises at least one data object, the number of the data objects defined by the corresponding refreshing level is corresponded, and refreshing waiting processes with corresponding number are started in parallel; and regularly checking whether the refresh execution condition of the data object is met, and if so, starting a refresh process by a corresponding refresh waiting process. The refreshing execution condition comprises that the first step is to complete synchronous processing of the corresponding data object, and the second step is to call the upper-level data object of the data object to complete refreshing processing; if the two conditions are checked simultaneously, the refreshing unit and the synchronizing unit are called in parallel; if only the second condition is checked, the refresh unit and the synchronization unit are called serially.

A monitoring unit 605, configured to read the alarm level configuration table; checking a synchronous refreshing information table at regular time, if abnormity is found, corresponding to an alarm level configuration table, and recording abnormal information into the alarm information table; and recording ETL process completion information to an alarm information table every day. The abnormal information comprises alarm sending time, an alarm receiver, alarm information and the like.

An alarm unit 606, configured to read the alarm information table; checking an alarm information table at regular time, corresponding to new abnormal information, immediately sending corresponding alarm information according to the alarm sending time, informing an ETL engineer, and ensuring that the abnormal problem in the ETL process is timely processed; and sending a completion message to inform an ETL engineer of the end condition of the whole ETL process.

A recording unit 607 for recording the synchronization and refresh processing procedure information to the synchronization refresh information table.

The processing system for the data warehouse data extraction, conversion and loading processes realizes synchronous refreshing processing of multiple data sources and multiple data objects, the main unit simultaneously calls the synchronization unit, the refreshing unit, the monitoring unit, the alarm unit and the recording unit, the synchronous refreshing process of data can be processed in time, synchronous refreshing efficiency is improved, data processing time is guaranteed, network and server resources are saved for processing other data sources, and reasonable utilization of the network and server resources is achieved. The monitoring unit and the alarm unit adopt real-time monitoring to the synchronous refreshing process, find problems and timely report to an ETL engineer of the data warehouse according to the problem grade, and therefore the timeliness of problem processing and the safe, reasonable and stable operation environment of the data warehouse are guaranteed. The invention also realizes a configurable mode, and the synchronization, refreshing, monitoring and alarming units acquire relevant configuration information by reading the configuration table, and call the synchronization configuration table, the refreshing configuration table and the alarming configuration table through the script, thereby realizing the process configurability of the ETL, reducing the human intervention degree and the workload of the engineer programming, improving the development efficiency of the ETL engineer, saving the labor cost required by the development of the ETL process, and providing the residual human resources for expanding the range of a data warehouse.

The data processing method and system in the data warehouse provided by the invention are introduced in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method of processing data in a data warehouse, comprising:

2. The method of claim 1, wherein: and corresponding to each synchronous process, starting the refreshing process after the synchronous process is executed.

3. The method of claim 1, wherein: and starting the refreshing process after all the synchronous processes are executed.

4. The method of claim 1, wherein the plurality of synchronization processes are initiated in parallel according to the steps of:

corresponding to the number of the data sources in the data source configuration information, starting synchronous waiting processes with corresponding number in parallel;

and checking whether the data source is recovered or not at regular time, and if so, starting a plurality of synchronous processes in parallel by the corresponding synchronous waiting process.

5. The method of claim 2, wherein a plurality of refresh processes are initiated in parallel according to the steps of:

the data refreshing configuration table is provided with refreshing levels, and corresponding number of refreshing waiting processes is started in parallel corresponding to the number of data objects defined by the refreshing levels;

and regularly checking whether the refresh execution condition of the data object is met, and if so, starting a refresh process by a corresponding refresh waiting process.

6. The method according to claim 5, wherein the refresh execution condition includes: and completing synchronous processing of the corresponding data objects, and calling the upper-level data object of the data objects to complete refreshing processing.

7. The method according to claim 1 or 2, characterized in that: and calling the synchronous processes or the refreshing processes in a priority mode.

8. The method of claim 1, further comprising: and recording the information of the synchronization and refreshing process to a synchronization refreshing information table.

9. The method of claim 8, further comprising a parallel monitoring process:

setting an alarm level configuration table;

and checking the synchronous refreshing information table at regular time, and if abnormity is found, recording abnormal information into an alarm information table corresponding to the alarm level configuration table.

10. The method of claim 9, further comprising parallel alarm processing: and checking the alarm information table at regular time, corresponding to the abnormal information, and immediately sending corresponding alarm information.

11. A data processing system in a data warehouse, comprising:

12. The system according to claim 11, wherein the refresh execution condition includes: and completing synchronous processing of the corresponding data objects, and calling the upper-level data object of the data objects to complete refreshing processing.

13. The system of claim 11, further comprising: and the recording unit is used for recording the information of the synchronization and refreshing processing process to the synchronous refreshing information table.

14. The system of claim 11, wherein: the storage unit also stores an alarm level configuration table and an alarm information table.

15. The system of claim 14, further comprising:

the monitoring unit is used for reading the alarm level configuration table; and checking the synchronous refreshing information table at regular time, if abnormity is found, corresponding to the alarm level configuration table, and recording abnormal information into the alarm information table.

16. The system of claim 15, further comprising:

the alarm unit is used for reading the alarm information table; and checking the alarm information table at regular time, and immediately sending corresponding alarm information corresponding to the abnormal information.