CN112365244B

CN112365244B - Data life cycle management method and device

Info

Publication number: CN112365244B
Application number: CN202011359475.1A
Authority: CN
Inventors: 周统汉; 覃娆; 孙朝辉; 崖飞虎
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2024-04-26
Anticipated expiration: 2040-11-27
Also published as: CN112365244A

Abstract

The embodiment of the application provides a data life cycle management method and device. The method comprises the following steps: the first equipment calculates the data active time length of a data table based on the access condition of the data table; when the creation time of the data table is longer than the data active time, the first device determines a target processing strategy based on the type of the data table; the first device performs data lifecycle management on the data table according to the target processing policy. Therefore, the operation of manual intervention is reduced, the processing efficiency of data life cycle management is improved, the safety of data archiving or data cleaning is ensured, and the cost pressure of enterprise storage is reduced.

Description

Data life cycle management method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data life cycle management method and device.

Background

In the field of financial science and technology (Fintech), the scale of data is rapidly expanding with the rapid development of business. In general, a large number of data tables such as temporary tables, invalid intermediate tables and the like are stored in a database/data cluster, and the data are stored in a high-end storage device in a centralized manner, so that a large amount of storage resources and calculation resources are occupied, and the investment cost caused by capacity expansion is increased continuously. In order to reasonably configure storage resources and improve the utilization efficiency of system resources, data lifecycle management is required.

For any one data table, the data in different stages in the whole life cycle of the data has different performance, availability and storage requirements. In general, in the early stage of data establishment, the frequency of data use is high, and high-speed storage is required to ensure high availability of data. In the middle of data establishment, the importance of the data is gradually reduced, the use frequency is reduced, and the data is stored in different levels, so that proper availability and storage space are provided for the data, and the management cost and resource expenditure of the data are reduced. Most of the data will not be used any more in the later stage of the data establishment, the data should be archived and saved after being cleaned for temporary use or the data should be deleted.

However, developers often make a processing policy corresponding to data life cycle management before the data is online based on understanding the data, set a retention period of the registered data, and then archive or destroy the data according to the retention period, which easily causes subjective evaluation errors of the data retention period, so that the risk of false deletion or redundant storage phenomenon exists in the data life cycle management.

Disclosure of Invention

The embodiment of the application provides a data life cycle management method and a data life cycle management device, which can accurately recommend the data active time length of a data table and automatically recommend specific data life cycle management, intelligently realize the data life cycle management process of the data table, ensure the safety performance of data and reduce the cost pressure of enterprises for storing data.

In a first aspect, an embodiment of the present application provides a data lifecycle management method.

The method comprises the following steps: the first equipment calculates the data active time length of a data table based on the access condition of the data table; when the creation time of the data table is longer than the data active time, the first device determines a target processing strategy based on the type of the data table; the first device performs data lifecycle management on the data table according to the target processing policy.

By the method provided by the first aspect, the first device combines the access condition of one data table, so that the data active time length of the data table can be calculated, and the accurate recommendation of the data active time length is realized. When the creation time of the data table is longer than the data active time of the data table, the first device may determine that the data table needs to be subjected to data life cycle management. The first device may determine a target processing policy based on the type of the data table, accurately evaluating whether the data table requires data archiving or data cleaning. Therefore, the first device can intelligently conduct automatic data life cycle management on the data table according to the target processing strategy, manual intervention operation is reduced, processing efficiency of data life cycle management is improved, safety of data archiving or data cleaning is ensured, and cost pressure of enterprise storage is reduced.

In one possible design, the first device calculates a data active duration of a data table based on an access condition of the data table, including:

The method comprises the steps that first equipment obtains creation time of a data table, and creation time and latest access time of each partition table in the data table; when the creation time of the data table is longer than the preset time, the first device determines the difference value between the creation time of each partition table and the latest access time as a difference value sequence; the first equipment performs density clustering processing on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of a partition table of the same type in the data table; the first equipment calculates a maximum boundary point corresponding to a maximum core point in a plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table; the first device determines a maximum boundary point as a data active duration.

Therefore, the first device can perform density clustering processing on a difference sequence formed by the difference value between the creation time length of each partition table in the data table and the latest access time, eliminate noise points which are used for temporarily accessing certain data sporadically by a user, obtain a plurality of core points which meet the access frequency correspondence rule of the data from the initial stage of establishment, the middle stage of establishment to the later stage of establishment, and each core point represents the access time length of the partition table of the same type in the data table. The first device selects the maximum core point from the plurality of core points, and because the maximum boundary point represents the maximum access time of all types of partition tables in the data table, the maximum boundary point corresponding to the maximum core point can be determined as the data active time of the data table, so that the safety of data can be ensured, misoperation of carrying out data life cycle management on the data table is avoided, and accurate data active time recommendation is realized.

In one possible design, the step of performing density clustering on the difference sequence by the first device to obtain a plurality of core points includes:

Step1, a first device inputs a difference sequence L, a preset neighborhood radius Eps and a minimum neighborhood point number MinPts of which a preset given point is a core point in a neighborhood into a preset algorithm module to output a result sequence R and an reachable distance rd of a difference sample of each partition table;

and 2, outputting clustering results C corresponding to the core points based on the result sequence R.

In one possible design, the first device inputs the difference sequence L, the preset neighborhood radius Eps, and the minimum neighborhood point number MinPts where the preset given point is a core point in the neighborhood into a preset algorithm module to output a result sequence R and an reachable distance rd of each partition table difference sample, including:

step 1.1: selecting a sample point which is not in the result sequence R and is a core point, finding all the sample points with the direct density reaching the sample point, if the sample point is not in the result sequence R, putting the sample point into the ordered sequence Q, and sequencing the sample points according to the reaching distance rd from small to large, wherein the sample point is the active duration of each partition table;

Step 1.2: if the ordered sequence Q is empty, executing step 1.1, and if the ordered sequence Q is not empty, taking out a first sample point m from the ordered sequence Q, and storing the taken sample point m into a result sequence R;

step 1.3: and (3) iterating the step 1.2 until the algorithm is finished, and outputting a result sequence R and an reachable distance rd of each partition table difference value sample.

In one possible design, based on the result sequence R, a clustering result C corresponding to a plurality of core points is output, including:

step 2.1: sequentially taking out sample points from the result sequence R;

step 2.2: if the core distance of the sample point is larger than the given neighborhood radius Eps, determining the sample point as a noise point, and neglecting the sample point, otherwise, determining that the sample point belongs to a new cluster, and jumping to the step 2.1;

step 2.3: and (3) finishing traversing the result sequence R to output clustering results C corresponding to the plurality of core points.

In one possible design, the method further comprises: and when the creation time length of the data table is less than or equal to the preset time length, the first device stops managing the data life cycle of the data table. Therefore, the phenomenon of data deletion or long-time storage is avoided, and the effectiveness of data life cycle management is improved.

In one possible design, the first device determines the target processing policy based on a type of the data table, including:

The first device identifies basic data information of the data table, the basic data information including: whether the structures of the data table and the primary upstream data table are consistent, the access condition of the data table within a preset time length, the same-database output number and the same-database input number of the data table, the different-database output number and the different-database input number of the data table, and other output numbers and other input numbers of the data table; the first device determines the type of the data table based on the association relationship between the basic data information and the type of the data table; the first device determines a target processing policy based on an association between the type of the data table and the processing policy of the data table.

Therefore, the first device is preconfigured with the association relation between the basic data information and the type of the data table and the association relation between the type of the data table and the processing strategy of the data table, so that the first device can determine the target processing strategy based on the acquired basic data information of the data table. Thus, specific recommendations for data lifecycle management are given in efficient combination with the specific circumstances of the data table.

In one possible design, the types of data tables include: direct patch source table, secondary patch source table, intermediate table, results table, temporary table, and other tables.

In one possible design, the first device performs data lifecycle management for the data table according to the target processing policy, including:

The first device sends a first request to the second device, wherein the first request is used for requesting to trigger a target processing strategy; the first device receives a first response from the second device; the first device performs data lifecycle management on the data table according to the target processing policy when it is determined that the first response indicates that the developer approves the target processing policy.

Therefore, the first device further determines the data active time length of the data table according to the actual management requirements of enterprises or regulations on the data and in combination with the approval management flow of developers, ensures the safety of the data and ensures the integrity of the data life cycle management.

In one possible design, the method further comprises: and the first device updates the data active time length to the data active time length carried in the first response when the first response indicates that the developer is not feasible to approve the target processing strategy, determines the target processing strategy based on the type of the data table when the creation time length of the data table is longer than the data active time length, and performs data life cycle management on the data table according to the target processing strategy when the first response indicates that the developer is feasible to approve the target processing strategy.

Therefore, the data active time length of the data table is effectively corrected, the safety of the data is ensured, and the integrity of data life cycle management is ensured.

In one possible design, the first device and the second device are the same device or different devices.

In a second aspect, an embodiment of the present application provides a data lifecycle management apparatus, which is applied to a first device.

The apparatus may include:

the computing module is used for computing the data active time length of the data table based on the access condition of the data table;

The determining module is used for determining a target processing strategy based on the type of the data table when the judging module determines that the creation time length of the data table is longer than the data active time length;

And the management module is used for carrying out data life cycle management on the data table according to the target processing strategy.

In one possible design, the computing module is specifically configured to obtain a creation time length of the data table, and a creation time length and a latest access time of each partition table in the data table; when the creation time length of the data table is longer than the preset time length, determining a difference value between the creation time length of each partition table and the latest access time as a difference value sequence; performing density clustering on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of a partition table of the same type in the data table; calculating a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table; and determining the maximum boundary point as the data active duration.

In one possible design, the management module is further configured to stop performing data lifecycle management on the data table when the determination module determines that the creation duration of the data table is less than or equal to the preset duration.

In one possible design, the determining module is specifically configured to identify basic data information of the data table, where the basic data information includes: whether the structures of the data table and the primary upstream data table are consistent, the access condition of the data table within a preset time length, the same-database output number and the same-database input number of the data table, the different-database output number and the different-database input number of the data table, and other output numbers and other input numbers of the data table; determining the type of the data table based on the association relationship between the basic data information and the type of the data table; and determining a target processing strategy based on the association relation between the type of the data table and the processing strategy of the data table.

In one possible design, the management module is specifically configured to send a first request to the second device, where the first request is used to request triggering of the target processing policy; receiving a first response from a second device; and when the judging module determines that the first response indicates that the developer is feasible to approve the target processing strategy, carrying out data life cycle management on the data table according to the target processing strategy.

In one possible design, the apparatus further comprises: the updating module is used for updating the data active time length into the data active time length carried in the first response when the judging module determines that the first response indicates that the developer approves the target processing strategy to be feasible, and executing the steps of determining the target processing strategy based on the type of the data table when the determining module determines that the creation time length of the data table is longer than the data active time length, and executing the data life cycle management on the data table according to the target processing strategy when the judging module determines that the first response indicates that the developer approves the target processing strategy to be feasible.

The advantages of the second aspect and the data lifecycle management apparatus provided in the possible designs of the second aspect may be referred to the advantages brought by the possible implementations of the first aspect and the first aspect, and are not described herein.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke program instructions in the memory to cause the electronic device to perform the data lifecycle management method of the first aspect and any of the possible designs of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the data lifecycle management method of the first aspect and any one of the possible designs of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product which, when run on a computer, causes the computer to perform the data lifecycle management method of the first aspect and any one of the possible designs of the first aspect.

In a sixth aspect, an embodiment of the present application provides a chip system, including: a processor; when the processor executes computer instructions stored in the memory, the electronic device performs the data lifecycle management method of the first aspect and any of the possible designs of the first aspect.

Drawings

FIG. 1 is a flow chart of a conventional data lifecycle management method;

FIG. 2 is a flow chart of a method for managing a data lifecycle according to an embodiment of the present application;

FIG. 3A is a flowchart illustrating a method for managing a data lifecycle according to an embodiment of the present application;

FIG. 3B is a flowchart illustrating a method for obtaining an active duration of data according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for managing a data lifecycle according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for managing a data lifecycle according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a data lifecycle management apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data lifecycle management apparatus according to an embodiment of the present application.

Detailed Description

First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

1. Data lifecycle management: a data management mode comprising full lifecycle management of the generation, use, migration, cleaning, destruction of data. The data scale of the production system can be effectively controlled through the data life cycle management, and the data access efficiency is improved, so that the overall efficiency of system operation is improved, and enterprises are helped to obtain maximum value at the lowest cost in each stage of data life.

2. Data active duration: the data access heat generally follows a rule that the access frequency gradually decreases with time. The access heat gradually decreases with the lapse of time until stabilized at a certain fixed period of time, which is the data active period.

The embodiment of the application provides an existing data life cycle management method. Referring to fig. 1, fig. 1 is a flow chart of a conventional data lifecycle management method. As shown in fig. 1, in the existing data lifecycle management method, a developer sets a processing policy corresponding to the data lifecycle management based on a data requirement and understanding of data, where the processing policy may generally include: data processing means (e.g., permanent storage, archiving/cleaning after a period of storage), data retention period (e.g., 30/60/90 days).

The processor may determine if the actual data retention period exceeds the data retention period set in the processing policy during the online phase of the data. If the actual data retention period exceeds the data retention period set in the processing strategy, the processor triggers the processing flow of data archiving or data cleaning. After the process is approved, the operation and maintenance personnel execute data archiving and data cleaning, namely enter the archiving stage of data or the destroying stage of data.

The archiving phase of data may include, among other things, online archiving and offline archiving. In online archiving of data, the data is stored in an archive cluster of high-end storage devices and supports online querying by users. In offline archiving of data, the data is stored in a low-end storage device and does not support online querying by the user.

However, in the existing data lifecycle management method exemplarily shown in fig. 1, one of the most critical places is to determine when data should be archived or cleaned, and the data retention period often depends on the data awareness of the developer, which has the following problems:

1. The accuracy is lower: the retention period of the data is artificially planned, and the evaluation error in the aspect of the data exists, so that a large gap exists between the planned value and the actual value easily.

2. Data security presents a risk or increases the cost of the device: data is the property of an enterprise, and general cleaning and archiving need to be carefully carried out so as to avoid the problem of production caused by false deletion, so that the data has safety risk, or so as to avoid the problem of overdue preservation, and increase the equipment cost.

3. The treatment efficiency is low: the existing scheme depends on the fact that a developer formulates a data retention period before data are online, the corresponding processing strategy needs to be repeatedly reviewed, complexity and management cost of data life cycle management are intangibly increased, and processing efficiency is low.

In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a computer storage medium for managing a data lifecycle, where an execution body of the method for managing a data lifecycle is a first device (e.g., a server), and the method is applied to the field of finance and technology, etc., and the first device can combine an access condition of a data table to accurately calculate a data active duration of the data table. And when the creation time of the data table is longer than the data active time, the first device may determine that the data table needs to be subjected to data life cycle management at the moment. And the first device can determine the target processing strategy based on the type of the data table, and accurately evaluate whether the data table needs to be filed or cleaned. Therefore, the first device can intelligently conduct automatic data life cycle management on the data table according to the target processing strategy, manual intervention operation is reduced, processing efficiency of data life cycle management is improved, safety of data archiving or data cleaning is ensured, and cost pressure of enterprise storage is reduced.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data lifecycle management method according to an embodiment of the application.

As shown in fig. 2, with the first device as an execution body, the data lifecycle management method according to an embodiment of the present application may include:

S101, the first device calculates the data active time length of a data table based on the access condition of the data table. Since the access condition of one data table may represent the actual condition of the user accessing the data table, and the data active time of one data table is obtained by taking into consideration the access frequency of the data from the initial stage of the establishment, the middle stage of the establishment to the later stage of the establishment, and the phenomenon that the user needs to access certain data sporadically and temporarily. Therefore, the first device can calculate the data active duration of one data table based on the access condition of the one data table.

Wherein, the access condition of one data table can include but is not limited to: description information of the data table, creation time, access frequency, and the like of each partition table in the data table.

In addition, as will be appreciated by those skilled in the art, the density clustering algorithm assumes that the clustering structure can be determined by how densely the sample is distributed, and clusters based on how densely the data set is spatially distributed. It can be seen that the data active time length of the data table is matched with the density clustering algorithm. Thus, the first device may use a density clustering algorithm to calculate the data active duration of a data table based on the access condition of the data table.

The embodiment of the application does not limit the specific implementation mode of the density clustering algorithm. In some embodiments, the density clustering algorithm includes: the OPTICS (ordering points to IDENTIFY THE clustering structure) Density clustering algorithm or the DBSCAN (Density-based spatial clustering of applications with noise) Density clustering algorithm.

S102, when the creation time of the data table is longer than the data active time, the first device determines a target processing strategy based on the type of the data table.

S103, the first device manages the data life cycle of the data table according to the target processing strategy.

Based on the data active duration determined in step S101, the first device may determine whether the creation duration of the data table is greater than the data active duration. When the creation time of the data table is longer than the data active time, the first device may determine that data lifecycle management of the data table is required at this time. Data lifecycle management may include, among other things, data archiving or data cleansing.

Because the first device stores the corresponding relation between the type of the data table and the processing strategy corresponding to the data life cycle management in advance, the first device can determine the target processing strategy based on the type of the data table, namely, the first device files the data of the data table and cleans the data table. Thus, the first device may perform data lifecycle management on the data table according to the target processing policy.

The specific implementation manner of the type of the data table is not limited in the embodiment of the application. In some embodiments, the types of data tables include: direct patch source table, secondary patch source table, intermediate table, results table, temporary table, and other tables.

According to the data life cycle management method provided by the embodiment of the application, the data active time length of the data table can be calculated by combining the access condition of the first device with the data table, so that the accurate recommendation of the data active time length is realized. When the creation time of the data table is longer than the data active time of the data table, the first device may determine that the data table needs to be subjected to data life cycle management. The first device may determine a target processing policy based on the type of the data table, accurately evaluating whether the data table requires data archiving or data cleaning. Therefore, the first device can intelligently conduct automatic data life cycle management on the data table according to the target processing strategy, manual intervention operation is reduced, processing efficiency of data life cycle management is improved, safety of data archiving or data cleaning is ensured, and cost pressure of enterprise storage is reduced.

Next, in conjunction with fig. 3A, a possible implementation manner of calculating the data active duration of the data table according to the access condition of the data table by the first device in step S101 is described.

Referring to fig. 3A, fig. 3A is a flowchart illustrating a data lifecycle management method according to an embodiment of the application.

As shown in fig. 3A, the data lifecycle management method according to an embodiment of the present application may include:

S201, the first device acquires the creation time of the data table, the creation time of each partition table in the data table and the latest access time.

One data table is typically in the form of a partition table, so one data table may include multiple partition tables. For example, the data table is a Hive table in a Hadoop cluster, and the partition table in the data table is a partition table classified according to a date directory in the Hive table.

In addition, in the initial stage of the establishment of the data table, metadata information such as description information, creation time, access frequency and the like of the data table and metadata information such as description information, creation time, access frequency, storage address and the like of each partition table are recorded.

Thus, the first device obtains the creation time of the data table, the creation time of each partition table in the data table and the latest access time from the module for storing the information. The creation time length is the difference between the creation time and the current time.

S202, the first device judges whether the creation time of the data table is longer than a preset time.

Based on the creation duration of the data table acquired in step S201, the first device may determine whether the creation duration of the data table is greater than a preset duration, to determine whether the data in the data table is in an active period.

The specific numerical value of the preset duration is not limited in the embodiment of the application. Typically, the preset time period may be set based on the enterprise situation, such as 90 days.

If yes, the first device may determine that the data in the data table is not in the active period, and execute step S203 to step S206; if not, the first device may determine that the data in the data table is in the active period, and perform step S207.

S203, the first device determines the difference value between the creation time length of each partition table and the latest access time as a difference value sequence.

When the creation time length of the data table is longer than the preset time length, based on the creation time length and the latest access time of each partition table in the data table obtained in step S201, the first device may calculate a difference value between the creation time length and the latest access time of each partition table, and determine the difference value corresponding to each partition table as a difference value sequence.

For example, for any partition table in the data table, the creating duration of the partition table is T _ci, the latest access time of the partition table is T _ei, i is a positive integer greater than or equal to 1 and less than or equal to n, and n is the total number of partition tables of the data table.

Then, the difference sequence L is l= { (T _e1-T_c1),(T_e2-T_c2),...,(T_en-T_cn) }.

S204, the first device performs density clustering processing on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of the partition table of the same type in the data table.

Since the difference sequence can not only show the access frequency of data from the initial stage of establishment, the middle stage of establishment to the later stage of establishment, but also show the phenomenon that a user needs to access certain data sporadically and temporarily. Therefore, the first device can perform density clustering processing on the difference sequence based on a density clustering algorithm, and remove a plurality of noise points possibly existing to obtain a plurality of core points. Wherein a core point may be used to represent an access duration of a partition table of the same type in the data table. A noise point may be used to represent the moment when a user sporadically accesses certain data.

S205, the first device calculates a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table.

S206, the first device determines that the maximum boundary point is the data active duration.

Since a core point may represent an access duration of a partition table of the same type in the data table. Therefore, the first device may take the maximum core point of the plurality of core points, that is, calculate the maximum access duration of all types of partition tables in the data table. Although the maximum core point may represent the maximum access duration of the data table, in consideration of the security of data management, the first device may determine the maximum boundary point corresponding to the maximum core point as the data active duration of the data table, so as to ensure the security of data, and avoid misoperation of performing data lifecycle management on the data table.

Based on step S204-step S206, taking the OPTICS density clustering algorithm as an example in conjunction with fig. 3B, the first device may execute the following steps to obtain the data active duration of the data table:

Step 1, the first device inputs a difference sequence L, a preset neighborhood radius Eps and a preset minimum neighborhood point MinPts of which a given point is a core object in a neighborhood into an algorithm module of the first device. Thus, the algorithm module outputs the result sequence R and the reachable distance rd of each partition table difference sample according to the following algorithm flow.

The core object, i.e. the core point, is understood to be a circle drawn with a sample point in a sample as the center and a neighborhood radius Eps as the radius. If the number of sample points other than the sample point and falling in the circle is equal to or greater than the minimum neighborhood point number MinPts, the sample point is a core object, and the sample point falling in the circle is a boundary point of the sample point. Otherwise, the sample point is not the core point.

Step 1.1: the algorithm module selects a sample point that is not in the result sequence R and is a core object, and finds all the direct densities of the sample point that can reach the sample point (i.e., the set N corresponding to the field object point of the core object c in fig. 3B). If the sample point does not exist in the result sequence R, the algorithm module places the sample point in the ordered sequence Q and orders it from small to large by the reachable distance rd.

The difference sequence L is a difference sample of each partition table, and one sample point is the active time length of one partition table, namely the difference between the last access time of the partition table and the creation time length of the partition table.

Step 1.2: if the ordered sequence Q is empty, the algorithm module performs step 1.1. If the ordered sequence Q is not empty, the algorithm module takes the first sample point m from the ordered sequence Q and saves the taken sample point m to the result sequence R.

Step 1.2.1: the algorithm module determines whether the sample point m is a core object. If not, the algorithm module performs step 1.2. If so, the algorithm module finds all the set of direct density reachable points N for the sample point m (i.e., set N corresponding to the domain object point of the core object m in FIG. 3B).

Step 1.2.2: the algorithm module determines whether a sample point in the set N already exists in the result sequence R. If so, the algorithm module does not process. If not, the algorithm module performs step 1.2.3 or step 1.2.4.

Step 1.2.3: if the direct density reachable sample points already exist in the ordered sequence Q and the new reachable distance is now smaller than the old reachable distance, the algorithm module replaces the old reachable distance with the new reachable distance and reorders the ordered sequence Q.

Step 1.2.4: if the direct density reachable sample point does not exist in the ordered sequence Q, the algorithm module inserts the point and reorders the ordered sequence Q.

Step 1.3: the algorithm module iterates step 1.2 until the algorithm is finished, and a result sequence R and an reachable distance rd of each partition table difference value sample are output.

Wherein, core distance cd: so that the sample point is the smallest radius of the core object, i.e., the distance of the sample point from its MinPts nearest point. Distance rd is reached: for the neighbor point t ₁、t₂、t₃、…、t_n of the sample point t, if the point-to-point t distances are greater than the core distance, then its reachable distance is the actual distance of the point-to-point t; if these points are less than the core distance from point t, then their reachable distance is the core distance from point t.

And 2, the algorithm module outputs a final clustering result C based on the result sequence R obtained in the step 1.

Step 2.1: the algorithm module sequentially fetches sample points from the result sequence R.

If the reachable distance of the sample point is not greater than the given neighborhood radius Eps, the algorithm module determines that the sample point belongs to the current category. Otherwise, the algorithm module performs step 2.2.

Step 2.2: if the core distance of the sample point is greater than a given neighborhood radius Eps, the algorithm module determines the sample point as a noise point and may ignore the sample point. Otherwise, the algorithm module determines that the sample point belongs to a new cluster, and jumps to step 2.1.

Step 2.3: and the algorithm module finishes traversing the result sequence R, and the algorithm is finished.

Therefore, the algorithm module obtains a clustering result C _i, i is a positive integer greater than or equal to 1 and less than or equal to n, and n is the total number of types of the partition table in the data table. Wherein, each clustering result corresponds to a plurality of access time lengths (including core points and boundary points) of the partition table of the same type.

It can be understood that through the steps, the multiple access durations of each partition table in the data table can be accurately analyzed through the time length of creation of each partition table and the latest access time, and then the content to be deleted of the final data table is determined according to the multiple access durations of each partition table, so that the data table can be accurately and effectively managed later.

And 3, in the clustering result C, the algorithm module takes the maximum class value C _x in the class C ₁、C₂、C₃、…、C_n values. In order to ensure the safety of data archiving or data cleaning, the algorithm module takes the maximum sample value (namely the maximum boundary point) in C _x as the recommended value of the data active duration.

It should be noted that, the first device may also use a DBSCAN density clustering algorithm to replace the OPTICS density clustering algorithm, which only needs to ensure that the neighborhood radius Eps input to the algorithm module of the first device by the first device and the minimum neighborhood point MinPts where a given point becomes a core object in the neighborhood are appropriate.

In addition, the first device may also calculate the data active duration of the data table by adopting a clustering algorithm of other ideas. For example, the first device may use a k-means clustering algorithm (k-means clustering algorithm) or a deformation clustering algorithm thereof, and may calculate k centroid points, perform denoising processing on the k centroid points, and then calculate a recommended value of the maximum value of the boundaries in the k centroid points for the data active duration, so as to obtain an accurate data active duration, and ensure the security of the data.

S207, the first device stops managing the data life cycle of the data table.

When the creation duration of the data table is less than or equal to the preset duration, the first device can determine that the data in the data table is in an active period. Thus, the first device may cease data lifecycle management for the data table, i.e., without considering the operations of data archiving or data cleaning.

In the embodiment of the application, the first device can perform density clustering processing on a difference sequence formed by the difference between the creation time length of each partition table in the data table and the latest access time, and eliminates noise points which are used for temporarily accessing certain data sporadically by a user, so as to obtain a plurality of core points which meet the access frequency correspondence rule of the data from the initial stage of establishment, the middle stage of establishment to the later stage of establishment, wherein each core point is used for representing the access time length of the partition table of the same type in the data table. The first device selects a maximum core point from a plurality of core points, calculates a corresponding maximum boundary point of the maximum core point, namely, the maximum access time length of all types of partition tables in the data table, and determines the maximum boundary point corresponding to the maximum core point as the data active time length of the data table, so that the safety of data can be ensured, misoperation of data life cycle management on the data table is avoided, and accurate data active time length recommendation is realized.

Next, in connection with fig. 4, a possible implementation of determining the target processing policy by the first device based on the type of the data table in step S102 will be described.

Referring to fig. 4, fig. 4 is a flowchart illustrating a data lifecycle management method according to an embodiment of the application.

As shown in fig. 4, the data lifecycle management method according to an embodiment of the present application may include:

S301, the first device identifies basic data information of the data table.

Wherein the basic data information includes: the method comprises the steps of judging whether the structures of a data table and a primary upstream data table are consistent, judging the access condition of the data table within a preset time period, judging the same-library output number and the same-library input number of the data table, judging the different-library output number and the different-library input number of the data table, and judging the other output numbers and other input numbers of the data table.

The first device may identify the underlying data information of the data table based on the blood-source data of the data table, such as cluster, library name, table name, etc.

For example, the first device may obtain the triplet through Spark GraphX by calculation. Specifically, the first device may analyze dependency between tables and cross-base dependency according to information such as a start node, an edge, an end node, and the like of the triplet, to obtain basic data information of the data table. Wherein a node represents a certain upstream/downstream table. Edges represent a certain hive/sqoop/mask task. The number of outputs is the number of tables downstream from the table. The number of entries is the number of tables upstream of the table.

S302, the first device determines the type of the data table based on the association relation between the basic data information and the type of the data table.

Because the first device has preconfigured the association relationship between the basic data information and the type of the data table, the first device can determine the type of the data table based on the basic data information of the data table.

The specific implementation manner of the association relationship between the first device configuration basic data information and the type of the data table is not limited in the embodiment of the present application. In some embodiments, the first device may construct the association between the underlying data information and the type of data table based on whether the data table is accessed in line with the primary upstream table structure, whether there is near 30 days/90 days/180 days, the same-library access level, the different-library access level, other access levels, etc., as shown in table 1.

TABLE 1

Wherein, directly paste the source table: synchronization from online layer source databases (e.g., mysql) to direct target tables of data tables (hive tables). Secondary source pasting table: and (5) transferring from a direct source-attached table. Intermediate table: and a data table of the intermediate calculation result is specially stored in the database. Results table: and supporting a result table of the business application.

S303, the first device determines a target processing strategy based on the association relationship between the type of the data table and the processing strategy of the data table.

Because the first device has preconfigured an association relationship between the type of the data table and the processing policy of the data table, the first device may determine the target processing policy based on the type of the data table.

The specific implementation manner of the association relationship between the type of the first device configuration data table and the processing policy of the data table is not limited in the embodiment of the present application. In some embodiments, the processing strategies corresponding to the direct source pasting table, the secondary source pasting table, the intermediate table, the result table, the temporary table and the other tables are shown in table 2.

TABLE 2

Wherein, "permanent": the table is built for not more than half a year, or access records exist in half a year, and the whole table is kept on line permanently. "day 0": and no access record exists for more than half a year, and the whole table is filed.

In the embodiment of the application, the first device is preconfigured with the association relation between the basic data information and the type of the data table and the association relation between the type of the data table and the processing strategy of the data table, so that the first device can determine the target processing strategy based on the acquired basic data information of the data table. Thus, specific recommendations for data lifecycle management are given in efficient combination with the specific circumstances of the data table.

Based on step S101, the first device may calculate when data lifecycle management for the data table should be initiated. Based on step S102, the first device may determine whether the data table is subject to a target processing policy for data archiving or data cleansing. In order to ensure reasonable compliance of data archiving or data cleaning, strict flow management and control of developers are also required for data life cycle management.

Next, a specific implementation procedure of the data lifecycle management method participated by the developer is described with reference to fig. 5.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data lifecycle management method according to an embodiment of the application.

As shown in fig. 5, the data lifecycle management method according to an embodiment of the present application may include:

s401, the first device calculates the data active time length of a data table based on the access condition of the data table.

S402, when the creation time of the data table is longer than the data active time, the first device determines a target processing strategy based on the type of the data table.

Wherein S401 and S402 are similar to the implementation manners of S101 and S102 in the embodiment of fig. 2, respectively, and the embodiments of the present application are not described herein again.

S403, the first device sends a first request to the second device, wherein the first request is used for requesting to trigger the target processing strategy.

The first device may send a first request to the second device requesting triggering of the target processing policy, so that a developer may learn in time that data archiving or data cleaning is currently required to be performed on the data table. The specific implementation manners of the first request and the second device are not limited in the present application.

In some embodiments, the first device and the second device are the same device, e.g., the first device and the second device are both servers, and the developer may learn, through the servers, the first request in the form of a prompt message, a web interface, or the like.

In other embodiments, the first device and the second device are different devices, for example, the first device is a server, the second device is a terminal device, the server sends a first request to the terminal device, and a developer can obtain the first request in the form of a short message, a reminding interface of an application program, a web page interface, and the like through the terminal device.

S404, the first device receives the first response from the second device.

S405, the first device judges whether the first response indicates that the developer is feasible to approve the target processing strategy.

The developer may approve the feasibility of the target processing strategy through the second device. The second device carries the approval result in the first response and sends the approval result to the first device. Thus, the first device may determine whether the first response indicates that the developer is feasible to approve the target processing policy.

If yes, the first device may execute step S406; if not, the first device may execute step S407.

S406, the first device manages the data life cycle of the data table according to the target processing strategy.

Upon determining that the first response indicates that the developer is feasible to approve the target processing policy, the first device may perform data lifecycle management on the data table in accordance with the target processing policy.

S407, the first device updates the data active duration to the data active duration carried in the first response, and executes steps S402-S405.

When it is determined that the first response indicates that the developer is not capable of approving the target processing strategy, the first response may carry updated data active duration, for example, the updated data active duration may be duration input by the developer or may be default duration. Thus, the first device may update the data active duration of the data table to the updated data active duration. The first device further performs steps S402-S405 until it is determined that the developer approves the target policy processing, and the first device may perform data lifecycle management on the data table according to the target processing policy.

In the embodiment of the application, the first equipment further determines or effectively corrects the data active time length of the data table according to the actual management requirement of enterprises or regulations on the data and in combination with the approval management flow of developers, thereby ensuring the safety of the data and ensuring the integrity of the data life cycle management.

The embodiment of the application also provides a data life cycle management device.

Referring to fig. 6-7, fig. 6 is a schematic structural diagram of a data lifecycle management apparatus according to an embodiment of the present application.

The data life cycle management device of the embodiment of the application can be arranged in a server, and the operation of the embodiment of the application container management method corresponding to the first equipment can be realized. As shown in fig. 6, the apparatus may include:

A calculating module 101, configured to calculate a data active duration of a data table based on an access condition of the data table;

A determining module 102, configured to determine, based on a type of the data table, a target processing policy when the determining module 103 determines that a creation time of the data table is longer than a data active time;

And the management module 104 is used for carrying out data life cycle management on the data table according to the target processing strategy.

In some embodiments, the computing module 101 is specifically configured to obtain a creation duration of the data table, and a creation duration and a latest access time of each partition table in the data table; when the creation time length of the data table is longer than the preset time length, determining a difference value between the creation time length of each partition table and the latest access time as a difference value sequence; performing density clustering on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of a partition table of the same type in the data table; calculating a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table; and determining the maximum boundary point as the data active duration.

In some embodiments, the calculation module 101 is specifically configured to input, to the preset algorithm module, the difference sequence L, the preset neighborhood radius Eps, and the minimum neighborhood point MinPts where the preset given point becomes the core point in the neighborhood to the first device, so as to output a result sequence R and an reachable distance rd of each partition table difference sample; and 2, outputting clustering results C corresponding to the core points based on the result sequence R.

In some embodiments, the computing module 101 is specifically configured to step 1.1: selecting a sample point which is not in the result sequence R and is a core point, finding all the sample points with the direct density reaching the sample point, if the sample point is not in the result sequence R, putting the sample point into the ordered sequence Q, and sequencing the sample points according to the reaching distance rd from small to large, wherein the sample point is the active duration of each partition table; step 1.2: if the ordered sequence Q is empty, executing step 1.1, and if the ordered sequence Q is not empty, taking out a first sample point m from the ordered sequence Q, and storing the taken sample point m into a result sequence R; step 1.3: and (3) iterating the step 1.2 until the algorithm is finished, and outputting a result sequence R and an reachable distance rd of each partition table difference value sample.

In some embodiments, the computing module 101 is specifically configured to step 2.1: sequentially taking out sample points from the result sequence R; step 2.2: if the core distance of the sample point is larger than the given neighborhood radius Eps, determining the sample point as a noise point, and ignoring the sample point, otherwise, determining that the sample point belongs to a new cluster, and jumping to the step 2.1. Step 2.3: and (3) finishing traversing the result sequence R to output clustering results C corresponding to the plurality of core points.

In some embodiments, the management module 104 is further configured to stop performing data lifecycle management on the data table when the determining module 103 determines that the creation duration of the data table is less than or equal to the preset duration.

In some embodiments, the determining module 102 is specifically configured to identify basic data information of the data table, where the basic data information includes: whether the structures of the data table and the primary upstream data table are consistent, the access condition of the data table within a preset time length, the same-database output number and the same-database input number of the data table, the different-database output number and the different-database input number of the data table, and other output numbers and other input numbers of the data table; determining the type of the data table based on the association relationship between the basic data information and the type of the data table; and determining a target processing strategy based on the association relation between the type of the data table and the processing strategy of the data table.

In some embodiments, the types of data tables include: direct patch source table, secondary patch source table, intermediate table, results table, temporary table, and other tables.

In some embodiments, the management module 104 is specifically configured to send a first request to the second device, where the first request is used to request triggering of the target processing policy; receiving a first response from a second device; when the determination module 103 determines that the first response indicates that the developer is feasible to approve the target processing policy, data lifecycle management is performed on the data table according to the target processing policy.

As shown in fig. 7, the data lifecycle management apparatus may further include, based on the apparatus structure shown in fig. 6:

an updating module 105, configured to update the data active time length to the data active time length carried in the first response when the determining module 103 determines that the first response indicates that the developer approves the target processing policy is not feasible, and execute the steps of determining the target processing policy based on the type of the data table when the determining module 103 determines that the creation time length of the data table is longer than the data active time length, and managing the data life cycle of the data table according to the target processing policy when the determining module 103 determines that the first response indicates that the developer approves the target processing policy is feasible.

In some embodiments, the first device and the second device are the same device or different devices.

In the embodiment of the present application, the application data lifecycle management apparatus may be divided into functional modules according to the above method example, for example, each functional module may be divided into functional modules corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiments of the present application, the division of the modules is merely a logic function division, and other division manners may be implemented in practice.

The data lifecycle management apparatus of the embodiment of the present application may be used to execute the technical solution of the first device in the aforementioned data lifecycle management method, and its implementation principle and technical effects are similar, where the operations of the implementation of each module may further refer to the relevant descriptions of the method embodiments, which are not repeated herein. The modules herein may also be replaced with components or circuits.

The embodiment of the application also provides an electronic device, which comprises: a memory and a processor; the memory is used for storing program instructions; the processor is configured to invoke the program instructions in the memory to cause the electronic device to perform the data lifecycle management method of the previous embodiments.

Exemplary embodiments of the present application also provide a computer storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform the data lifecycle management method of the previous embodiments.

The present application also provides, illustratively, a computer program product which, when run on a computer, causes the computer to perform the data lifecycle management method of the previous embodiments.

Exemplary, an embodiment of the present application provides a chip system including: a processor; when the processor executes the computer instructions stored in the memory, the electronic device performs the data lifecycle management methods of the previous embodiments.

In the above-described embodiments, all or part of the functions may be implemented by software, hardware, or a combination of software and hardware. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid State Disk (SSD)) or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

Claims

1. A method of data lifecycle management, the method comprising:

The method comprises the steps that first equipment calculates data active time length of a data table based on access condition of the data table;

when the creation time of the data table is longer than the data active time, the first device determines a target processing strategy based on the type of the data table;

The first device manages the data life cycle of the data table according to the target processing strategy; the first device calculates a data active duration of a data table based on an access condition of the data table, including:

The first device obtains the creation time of the data table, the creation time of each partition table in the data table and the latest access time;

When the creation time of the data table is longer than a preset time, the first device determines a difference value between the creation time of each partition table and the latest access time as a difference value sequence;

the first device performs density clustering on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of a partition table of the same type in the data table;

the first device calculates a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table;

and the first equipment determines the maximum boundary point as the data active duration.

2. The method of claim 1, wherein the step of the first device performing density clustering on the sequence of differences to obtain a plurality of core points comprises:

Step 1, a first device inputs a difference sequence L, a preset neighborhood radius Eps and a minimum neighborhood point number MinPts of which a preset given point is a core point in a neighborhood into a preset algorithm module to output a result sequence R and an reachable distance ^rd of a difference sample of each partition table;

and 2, outputting clustering results C corresponding to the plurality of core points based on the result sequence R.

3. The method according to claim 2, wherein the first device inputs the difference sequence L, the preset neighborhood radius Eps, and the minimum neighborhood point MinPts where the preset given point becomes the core point in the neighborhood into a preset algorithm module to output a result sequence R and an achievable distance rd of each partition table difference sample, comprising:

Step 1.1: selecting a sample point which is not in the result sequence R and is a core point, finding all the sample points with the direct density reaching the sample point, if the sample point is not in the result sequence R, putting the sample point into an ordered sequence Q, and sequencing the sample points according to the reaching distance rd from small to large, wherein the sample point is the active duration of each partition table;

4. The method according to claim 2, wherein outputting the clustering result C corresponding to the plurality of core points based on the result sequence R includes:

step 2.1: sequentially taking out sample points from the result sequence R;

5. The method according to claim 1, wherein the method further comprises:

And when the creation time length of the data table is smaller than or equal to the preset time length, the first device stops managing the data life cycle of the data table.

6. The method of claim 1, wherein the first device determining a target processing policy based on the type of the data table comprises:

The first device identifies basic data information of the data table, wherein the basic data information comprises: whether the structures of the data table and the primary upstream data table are consistent, the access condition of the data table within a preset time length, the same-database output number and the same-database input number of the data table, the different-database output number and the different-database input number of the data table, and other output numbers and other input numbers of the data table;

the first device determines the type of the data table based on the association relation between the basic data information and the type of the data table;

The first device determines the target processing policy based on an association between a type of the data table and a processing policy of the data table.

7. The method of claim 1, wherein the type of data table comprises: direct patch source table, secondary patch source table, intermediate table, results table, temporary table, and other tables.

8. The method of any of claims 1-7, wherein the first device performing data lifecycle management of the data table in accordance with the target processing policy, comprises:

the first device sends a first request to the second device, wherein the first request is used for requesting to trigger the target processing strategy;

The first device receives a first response from the second device;

And when the first device determines that the first response indicates that the developer approves the target processing strategy, carrying out data life cycle management on the data table according to the target processing strategy.

9. The method of claim 8, wherein the method further comprises:

And when the first device determines that the first response indicates that the developer approves the target processing strategy to be infeasible, updating the data active time length to the data active time length carried in the first response, determining a target processing strategy based on the type of the data table when the creation time length of the data table is longer than the data active time length, and performing data life cycle management on the data table according to the target processing strategy when the first response indicates that the developer approves the target processing strategy to be feasible.

10. A data lifecycle management apparatus, the apparatus comprising:

The computing module is used for computing the data active time length of a data table based on the access condition of the data table;

The determining module is used for determining a target processing strategy based on the type of the data table when the judging module determines that the creation time of the data table is longer than the data active time;

the management module is used for managing the data life cycle of the data table according to the target processing strategy;

The computing module is specifically used for acquiring the creation time of the data table, the creation time of each partition table in the data table and the latest access time; when the creation time length of the data table is longer than the preset time length, determining a difference value between the creation time of each partition table and the latest access time as a difference value sequence; performing density clustering on the difference sequence to obtain a plurality of core points, wherein each core point is used for representing the access time length of a partition table of the same type in the data table; calculating a maximum boundary point corresponding to a maximum core point in the plurality of core points, wherein the maximum boundary point is used for representing the maximum access time length of all types of partition tables in the data table; and determining the maximum boundary point as the data active duration.

11. An electronic device, comprising: a memory and a processor;

the memory is used for storing program instructions;

The processor is configured to invoke program instructions in the memory to cause the electronic device to perform the data lifecycle management method of any of claims 1-9.

12. A computer storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the data lifecycle management method as claimed in any one of claims 1-9.

13. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the data lifecycle management method as claimed in any of claims 1-9.