CN115712524A

CN115712524A - Data recovery method and device

Info

Publication number: CN115712524A
Application number: CN202211402105.0A
Authority: CN
Inventors: 马阳阳; 丁国涛; 周致达
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-02-24

Abstract

An embodiment of the present application provides a data recovery method, including: under the condition that a preset condition is met, acquiring an original key value of an original check point, wherein the preset condition at least comprises that the maximum parallelism of operation changes; re-determining the key group to which the original key value belongs under the preset condition, and generating a new check point according to the re-determined key group; and under the condition that recovery is needed, recovering according to the new checkpoint. The data recovery method provided by the embodiment of the application can be used for recovering the check point under the condition that the maximum parallelism of the operation is changed, and the recoverability of the check point is improved.

Description

Data recovery method and device

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data recovery method and apparatus, a computer device, and a storage medium.

Background

Flink is an open source streaming framework developed by the Apache software Foundation that provides a high throughput, low latency streaming data engine and support for event-time processing and state management. To achieve fault tolerance, flink stores state data (state) so that recovery can be facilitated when an anomaly occurs.

Currently, flink-based real-time computing is widely and deeply deployed in an increasing number of platforms. In a Flink job, as traffic or computational complexity changes, a user or platform may need to change the maximum parallelism of the job to increase processing power. However, after the change of the maximum parallelism, flink cannot normally perform recovery of checkpoint (checkpoint).

Disclosure of Invention

The application aims to provide a data recovery method, a data recovery device, computer equipment and a storage medium, which are used for solving the problem that checkpoint recovery cannot be normally performed due to maximum parallelism change at present.

One aspect of an embodiment of the present application provides a data recovery method, including: under the condition that a preset condition is met, acquiring an original key value of an original check point, wherein the preset condition at least comprises that the maximum parallelism of operation changes; re-determining the key group to which the original key value belongs under the preset condition, and generating a new check point according to the re-determined key group; and under the condition that recovery is needed, recovering according to the new checkpoint.

Optionally, the obtaining an original key value of an original checkpoint includes: acquiring a first key value of an original check point from an original state file, wherein the first key value is the original key value which is serialized; and deserializing the keys in the first key value, and taking the deserialized result as the original key value.

Optionally, the obtaining a first key value of an original checkpoint from an original state file includes: and acquiring a first key value of the original check point from the original state file through the state processor API.

Optionally, the obtaining a first key value of an original checkpoint from an original state file includes: reading first IDs of all operators of an original check point from a metadata file; and calling a state processor API (application program interface) for each first ID in sequence to acquire the first key value.

Optionally, before the sequentially calling a state handler API to each of the first IDs to obtain the first key value, the method further includes: adding first information into the metadata file; the sequentially calling a state handler API to each first ID to obtain the first key value includes: acquiring the first information from the metadata file; generating a state descriptor according to the first information; and calling a state processor API (application program interface) for each first ID according to the state descriptor to acquire the first key value.

Optionally, the method further comprises: and adjusting the calculation mode of the maximum parallelism of the operator to increase the maximum parallelism of the operator.

Optionally, the method further comprises: and generating the ID of the operator according to a preset rule under the condition that the preset condition is met, wherein the preset rule comprises that the upstream operator and the downstream operator chain are not connected together.

Optionally, the method further comprises: under the condition that the ID of the operator is changed, acquiring the name of the operator and a second ID of the operator in the directed acyclic graph of the current operation; associating the first ID and the second ID according to the operator name; and distributing the state data of the original check point to an operator of the directed acyclic graph according to the incidence relation between the first ID and the second ID.

An aspect of an embodiment of the present application further provides a data recovery apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original key value of an original check point under the condition that a preset condition is met, and the preset condition at least comprises that the maximum parallelism of operation changes; the determining module is used for re-determining the key group to which the original key value belongs under the preset condition and generating a new check point according to the re-determined key group; and the recovery module is used for recovering according to the new check point under the condition that the recovery is needed.

An aspect of the embodiments of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the data recovery method described above.

An aspect of the embodiments of the present application further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor to cause the at least one processor to execute the steps of the data recovery method described above.

The data recovery method, the data recovery device, the computer equipment and the storage medium provided by the embodiment of the application have the following advantages:

under the condition that a preset condition is met (at least the maximum parallelism of the operation is changed), acquiring an original key value of an original check point, re-determining a key group to which the original key value belongs under the preset condition, generating a new check point according to the re-determined key group, and carrying out corresponding recovery operation according to the new check point under the condition that recovery is needed; the method comprises the steps of obtaining a key value of a key group, confirming the key group of the original key value, and determining the state data corresponding to the new check point in a new state file.

Drawings

FIG. 1 schematically illustrates an environmental architecture diagram of an embodiment of the present application;

FIG. 2 is a diagram schematically illustrating an application environment of a data recovery method according to an embodiment of the present application;

fig. 3 schematically shows a flowchart of a data recovery method according to a first embodiment of the present application;

FIG. 4 is a flowchart of sub-steps of step S410 of FIG. 3;

fig. 5 is a flowchart of sub-steps of step S411 in fig. 4;

FIG. 6 is a flowchart of the substeps of step S520 in FIG. 5;

FIG. 7 is a schematic illustration of a data recovery method;

FIG. 8 is a flow chart of the addition step of FIG. 3;

fig. 9 is a block diagram schematically showing a data recovery apparatus according to a second embodiment of the present application;

fig. 10 schematically shows a hardware architecture diagram of a computer device according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present application are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.

In the description of the present application, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present application and to distinguish each step, and therefore should not be construed as limiting the present application.

The following are explanations of terms referred to in the present application:

flink, an open source stream processing framework developed by Apache software foundation, is at the core a distributed stream data processing engine written in Java and Scala, executes arbitrary stream data programs in a data parallel and pipeline manner, and its pipeline runtime system can execute batch and stream processing programs.

Kafka, an open source streaming platform developed by the Apache software foundation, aims to provide a unified, high-throughput, low-latency platform for processing real-time data, and its persistence layer is essentially a "large-scale publish/subscribe message queue according to the distributed transaction log architecture", which can be connected to external systems (for data input/output) through Kafka Connect and provides Kafka Streams (a Java streaming library).

Kubernets (often abbreviated as K8 s) is an open source system for automatically deploying, extending and managing "containerized" applications.

The Yarn (Another Resource coordinator) is a new Hadoop Resource manager, which is a universal Resource management system, can provide uniform Resource management and scheduling for upper-layer applications, and the introduction of the Yarn brings great benefits to the cluster in the aspects of utilization rate, uniform Resource management, data sharing and the like.

The HDFS is a Hadoop Distributed File System (Hadoop Distributed File System) and realizes reliable Distributed reading and writing of large-scale data.

Zookeeper, a software project of the Apache software foundation, provides open-source distributed configuration services, synchronization services and naming registration for large-scale distributed computing.

RocksDB, a high-performance embedded database for key-value data, is a branch of google LevelDB, and optimized to be able to utilize multi-core CPU and effectively utilize fast storage such as Solid State Drive (SSD) to process input/output (I/O) bound workload, using a merged tree (LSM tree) data structure based on a log structure.

Fig. 1 schematically shows an environmental architecture diagram of an embodiment of the present application, as shown in the drawing:

client 100 is connected to compute engine 200, and compute engine 200 is connected to state backend 300. The user can change the maximum parallelism of the compute engine 200 job through the client 100; the computing engine 200 stores the state data to the state backend 300 to perform checkpoint operation, and when a fault occurs, the operation is recovered according to the checkpoint to realize fault tolerance. In the data recovery method of the embodiment of the application, when a preset condition is met (including that the maximum parallelism of the operation of the computing engine 200 changes), an original key value of an original check point is obtained from an original state file of the state back end 300, a key-group to which the original key value belongs under the preset condition is re-determined, and a new check point is generated at the state back end 300 according to the re-determined key group; in the event that a recovery is required, the compute engine 200 performs the recovery based on the new checkpoint generated. Where compute engine 200 may be Flink and state backend 300 may be, but is not limited to, rocksDB.

Please refer to fig. 2, which is a diagram illustrating an application environment of a data recovery method according to an embodiment of the present application. As shown in the figure:

flink as the compute engine 200 may run in several cluster environments, such as a Yarn cluster deployed purely by a physical machine, a Yarn cluster deployed in a mixture with Kafka (to improve Kafka 'S resource utilization), or a Yarn cluster deployed in a mixture with K8S (to improve online server' S resource utilization), etc.

In the data recovery method in the embodiment of the application, when the Flink meets the preset condition (including the change of the maximum parallelism of the operation), the original key value of the original check point is obtained, the key group to which the original key value belongs under the preset condition is determined again, and a new check point is generated according to the determined key group; in the case of a need for recovery, recovery is performed according to the new checkpoint generated.

In the related art, when the user or the platform changes the maximum parallelism of the job to increase the processing capacity, but after the maximum parallelism of the job is changed, the Flink cannot normally perform the recovery of the checkpoint.

According to the data recovery scheme, the check point can be recovered under the condition that the maximum parallelism of the operation is changed.

The data recovery scheme will be described below by way of several embodiments, and for ease of understanding, the following description will be exemplarily described with the computing engine 200 of fig. 1 as the executing agent. In some embodiments, the compute engine 200 is described directly with Flink as an example.

Example one

Fig. 3 schematically shows a flowchart of a data recovery method according to an embodiment of the present application, and as shown in the figure, the method may include steps S410 to S430, which are specifically described as follows:

step S410, under the condition that a preset condition is met, an original key value of an original check point is obtained, wherein the preset condition at least comprises that the maximum parallelism of the operation changes.

The change in the maximum parallelism of the job includes an increase or decrease in the maximum parallelism. When the maximum parallelism of a job changes, the execution logic of the Flink itself cannot allocate a state, and the checkpoint cannot be normally restored.

It is understood that steps S410 to S430 are steps executed when a preset condition is satisfied, and the preset condition may set other conditions according to actual needs besides the change of the maximum parallelism degree, and is not limited herein.

In order to recover from the state, all state files do not need to be pulled, the Flink uses a method similar to consistent hashing, the key values of the states are divided into a fixed number of shares after hashing, and one parallelism of each operator is responsible for one range of the fixed number of shares. When acquiring the original key value of the original check point, the Flink can acquire the original key value of the original check point from the original state file according to the ID corresponding to the operator before the maximum parallelism change.

And step S420, re-determining the key group to which the original key value belongs under the preset condition, and generating a new check point according to the re-determined key group.

Under the condition that the maximum parallelism of the operation is changed, the number of operators is changed, and the original key value needs to be redistributed to the changed operators, so that the key group (key-group) to which the original key value belongs is changed, and the key group to which each operator belongs needs to be redetermined. For example, when the maximum concurrency of the job is increased, the number of operators is increased, the original key value needs to be redistributed to a larger number of operators, keys in which at least some of the operators are responsible become fewer, and at this time, a key corresponding to each operator after change needs to be determined, so as to re-determine the key group to which the original key value belongs.

When the key group to which the original key value belongs is re-determined under the preset condition, the original key value can be re-allocated to the operator under the preset condition according to the changed maximum parallelism, so that the key group to which the original key value belongs under the preset condition is determined. When the original key value is reassigned to an operator, the key (key) in the original key value is reassigned to the operator. For example, if 2 operators exist before the maximum parallelism change, the number of keys in the original key value is 1-10, and 5 operators exist after the maximum parallelism change, the keys 1-2, 3-4, 5-6, 7-8, and 9-10 are assigned to operators No. 0, operators No. 1, operators No. 2, operators No. 3, and operators No. 4, respectively, and the key group to which each key belongs after the maximum parallelism is increased is determined according to the assigned key corresponding to each operator.

After the key group to which the original key value belongs is re-determined, a new key group can be written into a new state file, so that a new checkpoint is generated.

And step S430, if the recovery is needed, recovering according to the new checkpoint.

Since the new checkpoint has been generated based on the re-determined key group, when the computing engine 200 fails and needs to be restored, the state data corresponding to the new checkpoint can be retrieved from the new state file for restoration.

Specifically, when receiving the recovery instruction, the Flink recovers according to the recovery instruction and the new checkpoint, so as to recover the state to the new checkpoint and implement fault tolerance.

According to the data recovery method provided by the embodiment of the application, under the condition that a preset condition (at least including the change of the maximum parallelism degree of operation) is met, an original key value of an original check point is obtained, a key group to which the original key value belongs under the preset condition is determined again, a new check point is generated according to the determined key group, and under the condition that the data recovery is needed, corresponding recovery operation is carried out according to the new check point; the method comprises the steps of obtaining a key value of a key group, confirming the key group of the original key value, and determining the state data corresponding to the new check point in a new state file.

In an exemplary embodiment, as shown in fig. 4, the step S410 of obtaining an original key value of an original checkpoint may include steps S411 to S412, which are specifically as follows:

step S411, a first key value of the original checkpoint is obtained from the original state file, where the first key value is a serialized original key value.

Since the original key value is serialized when being written into the original state file, the serialized original key value, that is, the first key value, is obtained when the key value of the original checkpoint is obtained from the original state file.

Step S412, deserializing the keys in the first key value, and taking the deserialized result as the original key value.

Optionally, the computing engine 200 may perform deserialization on the key and the value in the first key value at the same time, and then obtain the original key value according to the deserialized key and the value. However, since the data size included in the value of the original key value is usually large, if the value of the original key value is deserialized, more computing resources will be occupied, and the performance of the computing engine 200 is affected; and when the key group to which the original key value belongs is determined again under the preset condition, the key group to which the original key value belongs can be determined only according to the key, so that only the key in the first key value can be deserialized, and the value in the first key value is not deserialized.

Specifically, the value in the first key value may be set as an object that does not need deserialization, for example, the value may be treated as a byte array, so that only the key in the first key value is deserialized and not the value in the first key value.

In this embodiment, the first key value of the original check point is obtained from the original state file, then the keys in the first key value are deserialized, and the deserialized result is used as the original key value, so that only the keys in the first key value can be deserialized without deserializing the values in the first key value, the situation that the performance is affected due to too much computing resources occupied by the deserialized values in the first key value is avoided, and the efficiency of obtaining the original key value is improved.

When the computing engine 200 obtains the first key value of the original checkpoint from the original state file, it may obtain the first key value through a preset tool. In an exemplary embodiment, in step S411, the obtaining a first key of an original checkpoint from an original state file may include: and acquiring a first key value of the original check point from the original State file through a State Processor API (State Processor API).

That is, the default tool is the state handler API. The state handler API can read, write or modify the save point (savepoint) and the check point, and can also convert into SQL (structured query language) query to analyze and process the state data. Because the state processor API is native to the community, the first key value of the original check point is obtained from the original state file through the state processor API, and the implementation difficulty of obtaining the original key value can be reduced.

Since the user interface provided by the state handler API is based on a single operator, which in fact requires reading and computing all the states of the original checkpoint, appropriate modifications to the state handler API are required. In an exemplary embodiment, in step S411, the step S510 to step S520 may be included to acquire the first key value of the original checkpoint from the original state file, as shown in fig. 5, specifically as follows:

step S510, reading the first IDs of all operators of the original checkpoint from the metadata file.

Step S520, sequentially calling the state handler API for each first ID to obtain a first key value.

Specifically, because the metadata file includes IDs of all operators, the Flink may read IDs (i.e., first IDs) of all operators of the original checkpoint from the metadata file, and then sequentially invoke the API of the state processor according to each first ID to calculate each operator of the original checkpoint, so as to obtain first key values corresponding to all operators of the original checkpoint.

In this embodiment, the first IDs of all operators of the original checkpoint are read from the metadata file, and the state processor API is sequentially called for each first ID to obtain the first key values, and since the first key values corresponding to each operator of the original checkpoint can be obtained by calling the state processor API through the first IDs in the metadata file, the first key values of all operators of the original checkpoint can be obtained by using the state processor API, and the defect that a user interface provided by the state processor API can only be based on a single operator is overcome.

In the foregoing embodiment, in step S412, the keys in the first key value are deserialized, and since the state handler API may also deserialize the first key value, the deserialization of the keys in the first key value may also be implemented by the state handler API. In addition, since the state handler API deserializes the keys and the values in the first key value at the same time, only the keys in the first key value may be deserialized and the values in the first key value may not be deserialized by performing special processing on the values in the first key value, for example, the values in the first key value are treated as a byte array as described above; the state processor API may also be set to directly skip the deserialization process of the value in the first key value, so that only the key in the first key value is deserialized.

In an exemplary embodiment, before step S520, that is, before sequentially calling the status handler API for each first ID to obtain the first key value, the method may further include: adding first information into the metadata file; accordingly, as shown in fig. 6, step S520 may include steps S521 to S523 as follows:

in step S521, first information is acquired from the metadata file.

The first information may include, but is not limited to, status data name, serializers, status data type, etc. Wherein the status data may include:

value status data: namely, a single-value state with the type of T is bound with a corresponding key, the state value can be updated by an update method, and the state value can be obtained by a value () method;

list status data: namely, the state value on the key is a list, the added value can be added to the list through an add method, and an Iterable < T > can be returned through a get () method to traverse the state value;

reduction status data: the state is transmitted by a user, the reducinfunction is called each time the add method is called to add the value, and finally the reducinfunction and the added value are combined into a single state value;

folding status data: similar to Reducing status data;

map state data: i.e., the state value is a map, elements are added by the put or pull method.

In step S522, a status descriptor (StateDescriptor) is generated according to the first information.

Since Flink defines a state by a state descriptor, and the state descriptor is an abstract class, and the internal part of the state descriptor defines basic information such as a state data name, a sequencer, a state data type and the like, the state descriptor is generated according to the first information, and the state descriptor required by calling the API of the state processor can be generated. Corresponding to the above status data, the status descriptor derives a Value status descriptor, a List status descriptor, and the like.

Step S523, call a state handler API to each first ID according to the state descriptor to obtain a first key value.

After the state descriptor for calling the API user interface is generated, the Flink may call the API of the state handler for each first ID according to the state descriptor, so as to obtain the corresponding first key value.

The current state processor API needs a user to manually construct a state descriptor and transmit the state descriptor to a corresponding method to generate the state descriptor needed for calling the state processor API user interface, so the efficiency is low. In this embodiment, the pre-added first information is obtained from the metadata file, the state descriptor is generated according to the first information, the state handler API is called for each first ID according to the state descriptor to obtain the first key value, and the state descriptor can be automatically generated according to the pre-added first information, so that the efficiency of obtaining the first key value by using the state handler API can be effectively improved.

Please refer to fig. 7, which is a diagram illustrating a principle of a data recovery method. Specifically, first information (corresponding to writesetDescriptor in the figure) may be added in advance to the metadata file (metadata), and in addition, a checkpoint coordinator (checkpoint coordinator) writes an ID and a state handle (corresponding to writeOperatoridStandHandles in the figure) of an operator in the metadata file (metadata), and the JobManager (job manager) reads the first information from the metadata file (metadata) and generates a corresponding state descriptor according to the first information; and obtaining the ID of the operator from the metadata file (metadata), calling a state processor API to obtain an original key value according to the ID of each operator by the state descriptor, writing the re-determined key group into the metadata file after re-determining the key group to which the original key value belongs, and generating a new check point according to the re-determined key group, so that recovery can be performed according to the new check point, and the recoverability of the check point is improved.

In an exemplary embodiment, the data recovery method may further include: and adjusting the calculation mode of the maximum parallelism of the operator to increase the maximum parallelism of the operator.

Optionally, in the algorithm for the maximum parallelism of the flank pair operator, a minimum value of the maximum parallelism of the operator is set, and therefore, the calculation manner for adjusting the maximum parallelism of the operator may include increasing the minimum value of the maximum parallelism of the operator. For example, if the minimum value of the original maximum parallelism of the operator is 128, the operator can be adjusted to 1024. In addition, the maximum parallelism of the operators can be increased to the original N times by multiplying N (N is larger than 1) in the algorithm formula of the maximum parallelism of the operators, for example, the maximum parallelism of the operators can be increased to the original 10 times by multiplying 10 in the algorithm formula of the maximum parallelism of the operators.

It will be appreciated that when the maximum parallelism of an operator is small, it may often be necessary to change the maximum parallelism due to traffic changes or complexity changes, requiring a new checkpoint to be regenerated for recovery. In this embodiment, the maximum parallelism is increased by adjusting the calculation mode of the maximum parallelism of the operator, and the job can be processed by using the more appropriate parallelism at the beginning, so that the situation that the maximum parallelism needs to be changed due to flow change or complexity change, a new check point needs to be generated, and recovery is performed according to the new check point is reduced. In addition, in practical application, a method for increasing parallelism can be adopted for a new job, and a method corresponding to the foregoing embodiment can be adopted for an inventory job to perform checkpoint recovery.

In an exemplary embodiment, the data recovery method may further include: and generating the ID of the operator according to a preset rule under the condition that a preset condition is met, wherein the preset rule comprises not combining the upstream operator with the downstream operator chain.

Currently, flink typically operates in the following cluster environment: the device comprises a pure physical machine deployed Yarn cluster, a Yarn cluster deployed in a mixed mode with Kafka, and a Yarn cluster deployed in a mixed mode with K8S. If the maximum parallelism of the operation changes, the connection relation between the KafkaSource operator and the downstream operator can be changed. In the algorithm for generating the primary operator ID of the Flink community, when the ID of one operator is calculated, the ID is calculated according to the operators which can chain together at the downstream; when the maximum parallelism of the operation changes, the Kafka Source operator may change from being able to join with a downstream operator chain to being unable to chain, so that the operator ID calculated according to the operator ID generation algorithm native to the Flink community changes, and the operator ID cannot be recovered from the original check point.

Alternatively, the ID of the operator may be directly generated according to a predetermined rule regardless of whether a preset condition is satisfied. In addition, when generating the ID of the upstream operator, the ID of the operator with chain is not considered, and the specific algorithm may be set according to actual needs, which is not limited here. For example, the Flink community native operator ID generation algorithm StreamGraphHasherV2 may be extended, where the ID of an operator is generated by not bringing an upstream operator and a downstream operator chain together.

In this embodiment, when the preset condition is satisfied, the ID of the operator is generated according to the preset rule, and the preset rule includes that the upstream operator and the downstream operator are not connected together, so that when the ID of the operator is generated, the ID of the operator is not changed by the connection relationship between the operator and the downstream operator, a stable operator ID is generated, and the check point and the corresponding recovery operation are facilitated according to the operator ID.

In the above embodiment, the ID of the operator is generated according to a predetermined rule, and when the ID is applied to a new job, the problem that the operation of the checkpoint cannot be normally performed does not occur; however, in the case of the stock operation, the IDs of the preceding and following operators may change, and the IDs may collide with each other, and the check point operation may not be performed normally. In an exemplary embodiment, as shown in fig. 8, the data recovery method may further include steps S610 to S630, which are specifically as follows:

step S610, when the ID of the operator changes, acquiring the operator name and a second ID of the operator in the Directed Acyclic Graph (DAG) of the current job.

Step S620, the first ID and the second ID are associated according to the operator name.

Step S630, according to the incidence relation between the first ID and the second ID, the state data of the original check point is distributed to an operator of the directed acyclic graph.

Specifically, when the ID of an operator changes, flink acquires the name of each operator and the second ID of the operator in the DAG of the current job, and since the operator names basically do not have the same condition, the first ID and the second ID are associated according to the operator names, so that the state data of the original check point can be distributed to the operator of the DAG, and the new check point can be generated conveniently and subsequently. In practical applications, since there is no state data in some operators and the names of operators with state data are usually not the same, the state data can be distributed only by the operator with state data associated with the name.

In this embodiment, when the ID of an operator changes, an operator name and a second ID of the operator in the directed acyclic graph of the current job are obtained, the first ID and the second ID are associated according to the operator name, and then the state of the original check point is allocated to the operator of the directed acyclic graph according to the association relationship between the first ID and the second ID.

Example two

Fig. 9 schematically shows a block diagram of a data recovery apparatus 700 according to the second embodiment of the present application, where the data recovery apparatus 700 may be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the second embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.

As shown in fig. 9, the data recovery apparatus 700 may include an acquisition module 710, a determination module 720, and a recovery module 730.

An obtaining module 710, configured to obtain an original key value of an original check point when a preset condition is met, where the preset condition at least includes that a maximum parallelism of a job changes;

a determining module 720, configured to re-determine a key group to which the original key value belongs under a preset condition, and generate a new check point according to the re-determined key group;

and a recovery module 730, configured to perform recovery according to the new checkpoint if recovery is required.

In an exemplary embodiment, the obtaining module 710 is further configured to: acquiring a first key value of an original check point from an original state file, wherein the first key value is a serialized original key value; and deserializing the keys in the first key value, and taking the deserialized result as the original key value.

In an exemplary embodiment, the obtaining module 710 is further configured to: and acquiring a first key value of the original check point from the original state file through the state processor API.

In an exemplary embodiment, the obtaining module 710 is further configured to: reading first IDs of all operators of an original check point from a metadata file; and calling the state processor API for each first ID in sequence to acquire a first key value.

In an exemplary embodiment, the data recovery apparatus 700 further includes a joining module (not shown), and the obtaining module 710 is further configured to: acquiring first information from a metadata file; generating a state descriptor according to the first information; a status handler API is called for each first ID based on the status descriptor to obtain a first key value.

In an exemplary embodiment, the data recovery apparatus 700 further comprises an adjustment module (not shown), wherein the adjustment module is configured to: and adjusting the calculation mode of the maximum parallelism of the operator to increase the maximum parallelism of the operator.

In an exemplary embodiment, the data recovery apparatus 700 further comprises a generating module (not shown in the figure), wherein the generating module is configured to: and generating the ID of the operator according to a preset rule under the condition that a preset condition is met, wherein the preset rule comprises not combining the upstream operator with the downstream operator chain.

In an exemplary embodiment, the data recovery apparatus 700 further comprises an association module (not shown), wherein the association module is configured to: under the condition that the ID of the operator is changed, the name of the operator and a second ID of the operator in the directed acyclic graph of the current operation are obtained; associating the first ID and the second ID according to the operator name; and distributing the state data of the original check point to an operator of the directed acyclic graph according to the incidence relation between the first ID and the second ID.

EXAMPLE III

Fig. 10 schematically shows a hardware architecture diagram of a computer apparatus 800 suitable for the data recovery method according to the third embodiment of the present application. The computer device 800 may be a device capable of automatically performing numerical calculations and/or data processing according to instructions set in advance or stored. For example, the server may be a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers), a gateway, and the like. As shown in fig. 10, computer device 800 includes at least, but is not limited to: memory 810, processor 820, and network interface 830 may be communicatively linked to each other by a system bus. Wherein:

the memory 810 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 810 may be an internal storage module of the computer device 800, such as a hard disk or a memory of the computer device 800. In other embodiments, the memory 810 may also be an external storage device of the computer device 800, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 800. Of course, memory 810 may also include both internal and external memory modules of computer device 800. In this embodiment, the memory 810 is generally used for storing an operating system installed in the computer apparatus 800 and various application software, such as program codes of a data recovery method, and the like. In addition, the memory 810 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 820 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 820 generally serves to control overall operation of the computer device 800, such as performing control and processing related to data interaction or communication with the computer device 800. In this embodiment, the processor 820 is used to execute program codes stored in the memory 810 or process data.

The network interface 830 may include a wireless network interface or a wired network interface, and the network interface 830 is typically used to establish communication links between the computer device 800 and other computer devices. For example, the network interface 830 is used to connect the computer apparatus 800 with an external terminal through a network, establish a data transmission channel and a communication link between the computer apparatus 800 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Internet), the Internet (Internet), a Global System of Mobile communication (GSM), wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, bluetooth (Bluetooth), wi-Fi (wireless fidelity), or the like.

It is noted that FIG. 10 only shows a computer device having components 810-830, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the data recovery method stored in the memory 810 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 820) to complete the embodiments of the present application.

Example four

Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the data recovery method in the embodiments are implemented.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program codes of the data recovery method in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It should be obvious to those skilled in the art that the modules or steps of the embodiments of the present application described above can be implemented by a general-purpose computing device, they can be centralized on a single computing device or distributed on a network composed of a plurality of computing devices, alternatively, they can be implemented by program code executable by the computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, the steps shown or described can be executed in a sequence different from that shown or described, or they can be separately manufactured as individual integrated circuit modules, or a plurality of modules or steps in them can be manufactured as a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims

1. A method for data recovery, comprising:

under the condition that a preset condition is met, acquiring an original key value of an original check point, wherein the preset condition at least comprises that the maximum parallelism of operation changes;

re-determining the key group to which the original key value belongs under the preset condition, and generating a new check point according to the re-determined key group;

and under the condition that recovery is needed, recovering according to the new checkpoint.

2. The method for recovering data according to claim 1, wherein the obtaining the original key value of the original checkpoint comprises:

acquiring a first key value of an original check point from an original state file, wherein the first key value is the original key value which is serialized;

and deserializing the keys in the first key value, and taking the deserialized result as the original key value.

3. The data recovery method of claim 2, wherein the obtaining the first key value of the original checkpoint from the original state file comprises:

and acquiring a first key value of the original check point from the original state file through the state processor API.

4. The data recovery method of claim 2, wherein the obtaining the first key value of the original checkpoint from the original state file comprises:

reading first IDs of all operators of an original check point from a metadata file;

and calling a state processor API (application program interface) for each first ID in sequence to acquire the first key value.

5. The data recovery method of claim 4, further comprising, before the sequentially calling a state handler API for each of the first IDs to obtain the first key value:

adding first information into the metadata file;

the sequentially calling a state handler API to each first ID to obtain the first key value includes:

acquiring the first information from the metadata file;

generating a state descriptor according to the first information;

and calling a state processor API (application program interface) for each first ID according to the state descriptor to acquire the first key value.

6. The data recovery method of any one of claims 1-5, further comprising:

and adjusting the calculation mode of the maximum parallelism of the operator to increase the maximum parallelism of the operator.

7. The data recovery method of any one of claims 1-5, further comprising:

and generating the ID of the operator according to a preset rule under the condition that the preset condition is met, wherein the preset rule comprises that the upstream operator and the downstream operator chain are not connected together.

8. The data recovery method of claim 7, further comprising:

under the condition that the ID of the operator is changed, acquiring the name of the operator and a second ID of the operator in the directed acyclic graph of the current operation;

associating the first ID and the second ID according to the operator name;

and distributing the state data of the original check point to an operator of the directed acyclic graph according to the incidence relation between the first ID and the second ID.

9. A data recovery apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an original key value of an original check point under the condition that a preset condition is met, and the preset condition at least comprises that the maximum parallelism of operation changes;

the determining module is used for re-determining the key group to which the original key value belongs under the preset condition and generating a new check point according to the re-determined key group;

and the recovery module is used for recovering according to the new check point under the condition that the recovery is needed.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, is adapted to carry out the steps of the data recovery method according to any of claims 1 to 8.

11. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the data recovery method of any one of claims 1 to 8.