CN114140363B

CN114140363B - Video deblurring method and device and video deblurring model training method and device

Info

Publication number: CN114140363B
Application number: CN202210117459.4A
Authority: CN
Inventors: 江邦睿; 谢植淮; 李松南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2022-05-24
Anticipated expiration: 2042-02-08
Also published as: CN114140363A

Abstract

The application discloses a video deblurring method, a deblurring model training method, a device, a medium and equipment, which can be applied to video deblurring scenes, video programs or applets. The method comprises the following steps: acquiring a current video frame of a video and a plurality of adjacent video frames before and after the current video frame; processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame; and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames to obtain a deblurred target video frame corresponding to the current video frame, wherein the gradient branch model and the reconstruction branch model are models of the structure of the coder and the decoder.

Description

Video deblurring method and device and video deblurring model training method and device

Technical Field

The application relates to the technical field of computers, in particular to a video deblurring method, a video deblurring device, a training method and a training device of a video deblurring model, a medium and computer equipment.

Background

With the popularization of handheld photographic equipment such as mobile phones, more and more users use the handheld equipment to record videos in the product use or program use process. However, due to the shake of the mobile phone, the motion of an object, and the like, a part of the video is inevitably blurred, and the user experience of the user on the device or the program is affected.

Disclosure of Invention

The embodiment of the application provides a video deblurring method, a video deblurring device, a training method and a training device for a video deblurring model, a medium and computer equipment.

In one aspect, a video deblurring method is provided and includes:

acquiring a current video frame of a video and a plurality of adjacent video frames at the front moment and the rear moment of the current video frame;

processing the current video frame according to a gradient branch model in a trained video deblurring model to generate gradient information of multiple scales of the current video frame;

and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of the scales and the adjacent video frames to obtain a deblurred target video frame corresponding to the current video frame, wherein the gradient branch model and the reconstruction branch model are models of codec structures.

In another aspect, a training method for a video deblurring model is provided, where the method includes:

acquiring historical training data, wherein the historical training data comprises a historical current video frame of a video and a plurality of historical adjacent video frames of the historical current video frame;

performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame;

reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames to obtain a deblurred training target video frame corresponding to the historical current video frame;

acquiring a label gradient map and a label video frame of the historical current video frame in the historical training data;

establishing a first loss function for the label gradient map and the training gradient map, and performing first optimization on the gradient branch model according to the first loss function;

establishing a second loss function for the label video frame and the training target video frame, and performing second optimization on the reconstruction branch model according to the second loss function;

and determining the model as a trained video deblurring model according to the gradient branch model after the first optimization and the reconstruction branch model after the second optimization.

In another aspect, a video deblurring apparatus is provided, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a current video frame of a video and a plurality of adjacent video frames at the front moment and the rear moment of the current video frame;

the gradient processing unit is used for processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame;

and the reconstruction processing unit is used for reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of the scales and the adjacent video frames to obtain a deblurred target video frame corresponding to the current video frame, wherein the gradient branch model and the reconstruction branch model are models of codec structures.

In another aspect, an apparatus for training a video deblurring model is provided, including:

the video processing device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring historical training data, and the historical training data comprises a historical current video frame of a video and a plurality of historical adjacent video frames of the historical current video frame;

the gradient processing unit is used for performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame;

the reconstruction processing unit is used for reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames so as to obtain a deblurred training target video frame corresponding to the historical current video frame;

the second acquisition unit is used for acquiring a label gradient map and a label video frame of the historical current video frame in the historical training data;

the first optimization unit is used for establishing a first loss function for the label gradient map and the training gradient map and performing first optimization on the gradient branch model according to the first loss function;

the second optimization unit is used for establishing a second loss function for the label video frame and the training target video frame and carrying out second optimization on the reconstruction branch model according to the second loss function;

and the determining unit is used for determining the model as a trained video deblurring model according to the first optimized gradient branch model and the second optimized reconstruction branch model.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, the computer program being adapted to be loaded by a processor to perform the steps of the method according to any of the embodiments above.

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having a computer program stored therein, the processor being configured to perform the steps of the method according to any of the above embodiments by calling the computer program stored in the memory.

The method comprises the steps of obtaining a current video frame of a video and a plurality of adjacent video frames at the front moment and the rear moment of the current video frame; processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame; and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames so as to obtain the deblurred target video frame corresponding to the current video frame. In one aspect, a current video frame is processed by using a branch model to obtain gradient information of multiple scales, wherein the gradient information comprises local information and global information of a gradient. And guiding the reconstruction branch model to reconstruct the current video frame according to the gradient information so as to provide structural prior information for reconstructing the video frame. Meanwhile, compared with the prior art, the method and the device have the advantages that an end-to-end video deblurring model is provided, the reconstructed video frame has a clearer outline by fusing gradient information, an optical flow method is not needed for aligning adjacent frames, the calculation complexity is low, and the memory occupation is small. On the other hand, compared with the prior art that the deblurring reconstruction is carried out only through a single current video frame, the method and the device have the advantages that the current video frame and a plurality of adjacent video frames, namely a plurality of continuous video frames are used as input, the gradient branch model is used for reconstructing a gradient image of the intermediate frame, the reconstruction branch model is used for reconstructing the intermediate frame, the reconstruction result of the intermediate frame is finally obtained, the information of the adjacent frames is used for assisting the reconstruction, and the deblurring processing effect can be effectively improved.

The method comprises the steps of obtaining historical training data, wherein the historical training data comprises historical current video frames of videos and a plurality of historical adjacent video frames of the historical current video frames; and performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame. And reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames to obtain the deblurred training target video frame corresponding to the historical current video frame. Acquiring a label gradient map and a label video frame of a historical current video frame in historical training data; establishing a first loss function for the label gradient map and the training gradient map, and performing first optimization on the gradient branch model according to the first loss function; establishing a second loss function for the label video frame and the training target video frame, and performing second optimization on the reconstruction branch model according to the second loss function; and determining the model as a trained video deblurring model according to the gradient branch model after the first optimization and the reconstruction branch model after the second optimization. The two branch models of the video deblurring model can be optimized by using an L1 loss function, so that parameters in network modules in the two branch models are optimized, the deblurring effect in deblurring application is improved, the calculation complexity is low, and the occupied memory is small.

Drawings

In order to more clearly illustrate the technical method in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 is a schematic flow chart of video deblurring based on full-range region correlation;

fig. 2 is a schematic flowchart of a video deblurring method according to an embodiment of the present application;

fig. 3 is another schematic flow chart of a video deblurring method according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a video deblurring method according to an embodiment of the present disclosure;

FIG. 5 is a system framework diagram of a video deblurring method provided by an embodiment of the present application;

FIG. 6 is a block diagram of another system framework of a video deblurring method according to an embodiment of the present application;

FIG. 7 is a schematic flow chart illustrating a video deblurring method according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of another system framework for a video deblurring method according to an embodiment of the present application;

FIGS. 9a and 9b are diagrams of another example of a video deblurring method according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a video deblurring apparatus provided in an embodiment of the present application;

FIG. 11 is a schematic flow chart illustrating a video deblurring method according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of an exercise device according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical method in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is to be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, some terms or expressions appearing in the course of describing the embodiments of the present application are explained as follows:

the codec structure: the encoding and decoding structure comprises an encoder and a decoder, and the encoder is used for learning to obtain a characteristic map of an input image through a neural network after the input image is given; and the decoder gradually realizes the class marking of each pixel after the encoder provides the feature map. For example, SegNet encoder structure model, SegNet encoder structure and decoder structure are in one-to-one correspondence, i.e., one decoder has the same space size and channel number as its corresponding encoder. For the basic SegNet structure, there are 13 convolutional layers for each, where the convolutional layers of the encoder correspond to the first 13 convolutional layers in the VGG16 network structure.

The block chain system: it may be a distributed system formed by a client, a plurality of nodes (any form of computing device in an access network, such as a server, a user terminal) connected by a network communication form. A Peer-To-Peer (P2P) network is formed among nodes, the P2P Protocol is an application layer Protocol running on a Transmission Control Protocol (TCP), in a distributed system, any machine such as a server and a terminal can be added To become a node, and the node includes a hardware layer, an intermediate layer, an operating system layer and an application layer.

A convolutional neural network: convolutional neural networks are mainly composed of these layers: input layer, convolutional layer, ReLU layer, Pooling (Pooling) layer, and fully-connected layer (fully-connected layer is the same as in a conventional neural network). By adding these layers together, a complete convolutional neural network can be constructed.

Convolutional layer (Convolutional layer): convolutional layers are the core layers of building convolutional neural networks, which produce most of the computational effort in the network. Each convolution layer in the convolutional neural network consists of a plurality of convolution units, and the parameters of each convolution unit are optimized through a back propagation algorithm. The convolution operation aims at extracting different input features, the convolution layer at the first layer can only extract some low-level features such as edges, lines, corners and other levels, and more layers of networks can iteratively extract more complex features from the low-level features.

Residual module (ResNet): the residual error learning module comprises a plurality of convolution layers, wherein the plurality of convolution layers change input data of the residual error learning module, original input information is skipped over the plurality of convolution layers and is directly transmitted to a later layer, and finally the whole is used as input and activated by an activation function, so that an output result of the residual error learning module is obtained. It is essentially the difference between the output result and the input result, i.e. the residual.

At present, in the field of video deblurring technology, a video fused with a time domain and a spatial domain similar region is used for deblurring a video frame. For example, in the paper ARVo, Learning All-Range Volumetric CORRESPONDENCE FOR VIDEO Deblurring, a Video Deblurring model is proposed that fuses similar regions in the temporal and spatial domains. The model takes three adjacent frames as input, firstly aligns the adjacent frames with the intermediate frame by using an optical flow network, and respectively extracts features for the three frames. And then, the model constructs a characteristic pyramid through downsampling, and features of different scales are considered. Then, the model respectively uses an attention mechanism in a time domain and a space domain to calculate the region similarity aiming at the features of each scale, and fuses the features of the similar regions to the current region. And finally, the model sends the fusion characteristics into a reconstruction network to obtain the deblurred intermediate frame. Referring to fig. 1, a diagram of a full-range area correlation-based video deblurring model structure of the model is shown.

It is understood that, when a dynamic scene is photographed, a video inevitably generates a blur due to shaking of a photographing apparatus, a rapid movement of an object, and the like. The video deblurring aims to restore a clear frame sequence from the blurred video, and the clear frame sequence is beneficial to subsequent tasks of object detection, video understanding and the like. However, the above-described deblurring technique has the following disadvantages in practice:

on one hand, in the technical scheme, firstly, an optical flow model is used for aligning adjacent frames with a current frame in an image domain, once a scene is high in change speed and long in change distance, an optical flow field is estimated to be wrong, so that an alignment result is defective, and the reconstruction effect of a subsequent network is influenced finally;

on the other hand, the technical scheme uses an attention mechanism to calculate the region similarity of a time domain and a space domain respectively, the calculation complexity is high, the memory occupation is large, the proportion of the similar region to the whole region is small, and the similarity with the dissimilar region is calculated as invalid operation.

The embodiment of the application provides a video deblurring method, a video deblurring device, a video deblurring medium and video deblurring equipment. The video frame obtained by reconstruction has a clearer outline by fusing gradient information, the deblurring effect of the video is improved, and the video quality is improved. The embodiment of the application can be applied to a video deblurring scene and a video program or an applet, for example, a micro-vision video, and is applied to processing a blurred video with low quality so as to improve the quality of the video.

Specifically, the method of the embodiment of the present application may be executed by a computer device, where the computer device may be a terminal or a server. The user terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an aircraft, and the like.

In order to better understand the technical method provided in the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical method provided in the embodiment of the present application is applicable, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. The video deblurring method is performed as an example by a computer device, wherein the computer device may be a terminal or a server or other devices.

The embodiment of the application can be realized by combining a cloud technology or a block chain network technology. The video deblurring method as disclosed in the embodiments of the present application, wherein the data can be stored on a blockchain, for example: the current video frame, the adjacent video frame, the video deblurring model, the gradient branch model, the reconstruction branch model and the target video frame can be stored on the block chain.

In order to facilitate the storage and query of the current video frame, the neighboring video frame, the video deblurring model, the gradient branch model, the reconstruction branch model, and the target video frame, optionally, the video deblurring method further includes: and sending the current video frame, the adjacent video frame, the video deblurring model, the gradient branch model, the reconstruction branch model and the target video frame to a block chain network, so that a node of the block chain network fills the current video frame, the adjacent video frame, the video deblurring model, the gradient branch model, the reconstruction branch model and the target video frame into a new block, and when the new block is identified in a consistent manner, the new block is added to the tail part of the block chain. According to the method and the device, the current video frame, the adjacent video frame, the video deblurring model, the gradient branch model, the reconstruction branch model and the target video frame can be stored in an uplink mode, the recorded backup is achieved, when the target video frame needs to be obtained, the corresponding target video frame can be directly and rapidly obtained from the block chain, and therefore the processing efficiency of video deblurring is improved.

The following are detailed descriptions. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

Embodiments of the present application provide a video deblurring method, and the embodiments of the present application take a computer device as an example of an identification method.

Referring to fig. 2, a flow chart of a video deblurring method according to an embodiment of the present disclosure is shown, where the method includes:

step 210: the method comprises the steps of obtaining a current video frame of a video and a plurality of adjacent video frames before and after the current video frame.

Specifically, the video may be a video in any format, and first, video frame extraction needs to be performed on the video, and the video may be extracted into pictures in multiple formats, such as RGB formats. The current video frame is determined, and meanwhile, a plurality of adjacent video frames at the time before and after the current video frame are acquired, for example, if a frame before and after the current video frame It is acquired, the plurality of adjacent video frames are It-1 and It + 1. It can be understood that the current video frame is a blurred video frame or needs to be processed into a clearer video frame, and the clearer or clearer video frame is obtained after the video frame is processed by the video deblurring method.

Step 220: and processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame.

Specifically, the video deblurring model includes a process similar to an encoding/decoding structure in which an input is sent to a series of convolutional layers and downsampling layers for processing, and then sent to a series of convolutional layers and upsampling layers for processing.

Optionally, please refer to fig. 3, which is another schematic flow diagram of a video deblurring method according to an embodiment of the present application, including:

step 310: acquiring a current video frame of a video and a plurality of adjacent video frames before and after the current video frame;

the method comprises the following steps of processing a current video frame according to a gradient branch model in a trained video deblurring model, and generating gradient information of multiple scales of the current video frame, wherein the steps comprise:

step 320: sending the current video frame and a plurality of adjacent video frames to a reconstruction branch model for processing to obtain reconstruction characteristics;

step 330: sending the reconstruction characteristics to a gradient branch model so that the gradient branch model generates gradient information of a plurality of scales according to the reconstruction characteristics and the current video frame, and sending the gradient information to the reconstruction branch model;

step 340: and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames so as to obtain the deblurred target video frame corresponding to the current video frame.

Specifically, in step 320, the current video frame and a plurality of neighboring video frames are sent to the reconstruction branch model for processing, so as to obtain the reconstruction characteristics. The current video frame and a plurality of adjacent video frames can be cascaded to obtain a cascaded multi-frame video frame, for example, the current video frame It and the adjacent video frames It-1 and It +1 at 2 moments before and after are extracted. It-1, It and It +1 frames are concatenated.

And the reconstruction branch model processes the It-1 frame, the It frame and the It +1 frame after the cascade connection to obtain reconstruction characteristics. The processing includes processing with convolutional layers (convolutional Layer), or Residual blocks (Residual Block), or dense blocks of convolutional neural networks of tightly-connected nature (DenseBlock).

Please refer to fig. 4, which is a basic structure diagram of a Residual Block (Residual Block), which is a basic component unit of a ResNet model, and adds cross-layer links on the basis of two 3 × 3 convolutional layers.

In step 330, the reconstruction characteristics are sent to the gradient branch model, so that the gradient branch model generates gradient information of multiple scales according to the reconstruction characteristics and the current video frame, and sends the gradient information to the reconstruction branch model.

The reconstruction characteristics obtained according to the current video frame and a plurality of adjacent video frames comprise rich texture and structure information of the video frame, and the information is helpful for processing the gradient branching model. The reconstruction branch model can feed back the reconstruction characteristics to the gradient branch model in a cascading mode, so that the gradient branch model generates gradient information of multiple scales according to the reconstruction characteristics and the current video frame and sends the gradient information to the reconstruction branch model.

Therefore, the intermediate result of the reconstruction branch model, namely the reconstruction characteristic, is fed into the gradient branch model in a cascading manner, so that the image structure information can be provided for the gradient branch model, and the gradient information can be reconstructed by the gradient branch model in an auxiliary manner. Meanwhile, the reconstruction characteristics contain abundant texture and structure information, and the reconstruction method is more beneficial to reconstructing gradient information by gradient branches.

Optionally, the gradient branch model includes M gradient encoding modules and M gradient decoding modules, where M is a positive integer, and the step of processing the current video frame according to the gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame includes:

and sending the current video frame to the gradient branch model for processing to obtain a fuzzy gradient map of the current video frame.

Wherein the processing comprises extraction of a gradient map. It can be understood that the direction of the gradient is the direction in which the function changes most quickly, when there is an edge in the function, there must be a larger gradient value, conversely, when there is a smoother portion in the image, the gray value change is smaller, and the corresponding gradient is also smaller, in the image processing, the mode of the gradient is referred to as the gradient, and the image formed by the image gradient becomes a gradient image. For the extraction of the gradient map, a gradient operator, such as a Sobel operator, a Robinson operator, a Laplace operator and the like, is set in a certain neighborhood of pixels in the original image by considering the gray level change in the certain neighborhood of each pixel of the image and utilizing the first-order or second-order derivative change rule of the edge.

For example, the current video frame is subjected to gradient map extraction, and the extraction expression is as follows:

；

。

wherein x is the x direction of the horizontal coordinate axis, y is the y direction of the ordinate axis, gx (x) is the gradient of the pixel (x, y) in the x direction, gy (x) is the gradient of the pixel (x, y) in the y direction, and I () is the pixel value. The gradient value at pixel point (x, y) is then:

。

where Gx is a directional gradient of the pixel (x, y) in the x direction, and Gy is a directional gradient of the pixel (x, y) in the y direction.

Gradient values of pixel points in all video frames form a fuzzy gradient map.

Performing convolution processing on the fuzzy gradient map to obtain fuzzy gradient characteristics of the fuzzy gradient map;

specifically, the fuzzy gradient map can be subjected to feature extraction through the convolution layer so as to obtain the features of the fuzzy gradient map.

And performing down-sampling processing on the reconstruction characteristics and the fuzzy gradient characteristics for M times through M gradient coding modules.

Wherein the M gradient encoding modules comprise a plurality of gradient encoding modules, e.g. M =2, 3, 4, the number of gradient encoding modules being selectable according to requirements for performance and speed. The larger the number the better the reconstruction of the video frame and the lower the speed.

The gradient coding module may employ a Residual Block (Residual Block) and downsampling, wherein the Residual Block may be replaced by other network blocks having the same function, such as a sense Block. For example, two residual processes and 1 down-sampling process are part of one gradient coding module.

And performing M times of up-sampling processing on the results of the M times of down-sampling processing through M gradient decoding modules to generate gradient information of multiple scales, and sending the gradient information to a reconstruction branch model.

The M gradient decoding modules and the M gradient coding modules are the same in number, and comprise a plurality of gradient decoding modules, such as M =2, 3 and 4, and the number of the gradient decoding modules can be selected according to requirements on performance and speed. The greater the number, the better the reconstructed video frame, and the slower the speed.

Like the gradient coding module, the gradient decoding module can also adopt a Residual Block (Residual Block) and downsampling, wherein the Residual Block can be replaced by other network blocks with the same function, such as a sense Block. For example, 1 upsampling and two residual processing are part of one gradient decoding module.

The M times of down sampling and M times of up sampling are in one-to-one correspondence, the result of each time of sampling can enable the gradient characteristics to obtain different scales, and the information concerned by the different scales is different: each downsampling can make the scale of the feature map smaller, each upsampling can make the scale of the feature map larger, or the feature map obtained by downsampling is restored. Local information such as texture is focused on the layer model with a larger scale, and global information such as an overall structure is focused on the layer model with a smaller scale. The use of downsampling and upsampling may enable the gradient tap model to process local and global information for the gradient simultaneously.

Step 230: and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames so as to obtain the deblurred target video frame corresponding to the current video frame.

Specifically, the gradient information of multiple scales includes global and local information of the gradient, and the gradient information of multiple scales is used to guide the feature including the image information of the reconstructed branch model so as to reconstruct the current video frame, thereby obtaining the deblurred target video frame corresponding to the current video frame.

Referring to fig. 5, a system frame diagram of a video deblurring method according to an example of the present application is shown, in which, on one hand, a current video frame It is extracted, sent to a gradient branch model of a video deblurring model for processing, and gradient information of multiple scales is generated and fed back to a reconstructed gradient branch for processing. On the other hand, the current video frame It and the adjacent video frames It-1, It +1 are extracted and sent to the reconstruction branch model for processing, so that the reconstruction branch model reconstructs the current video frame according to the gradient information and the processing result of the reconstruction branch model to the continuous multi-frame video frame to obtain the target video frame

。

Optionally, the reconstructing branch model includes M reconstructing encoding modules and M reconstructing decoding modules, and the step of reconstructing the current video frame by using the reconstructing branch model in the trained video deblurring model according to the gradient information of multiple scales and multiple adjacent video frames to obtain the deblurred target video frame corresponding to the current video frame includes:

and performing feature extraction on the current video frame and a plurality of adjacent video frames to obtain frame structure features.

The current video frame and a plurality of adjacent video frames can be cascaded to obtain a cascaded multi-frame video frame, for example, the current video frame It and the adjacent video frames It-1 and It +1 of 2 moments before and after are extracted. It-1, It and It +1 frames are concatenated.

And the reconstruction branch model processes the It-1, It and It +1 frames after the cascade connection to obtain the frame structure characteristics. The processing includes processing with convolutional layers (convolutional Layer), or Residual blocks (Residual Block), or dense blocks of convolutional neural networks of tightly-connected nature (DenseBlock).

Performing down-sampling processing on the frame structure characteristics for M times through M reconstruction coding modules to obtain reconstruction intermediate characteristics, and sending the reconstruction intermediate characteristics to M gradient coding modules;

wherein the M reconstruction coding modules include a plurality of reconstruction coding modules, such as M =2, 3, 4, and the number of reconstruction coding modules can be selected according to the requirements for performance and speed. The greater the number, the better the reconstructed video frame, and the slower the speed.

The reconstruction coding module may employ a Residual Block (Residual Block) and downsampling, wherein the Residual Block may be replaced by other network blocks having the same function, such as a sense Block. For example, two residual processes and 1 down-sampling process are part of one reconstruction coding module.

And acquiring gradient information of multiple scales of the gradient branches, and processing the gradient information and reconstruction characteristics of the multiple scales through M reconstruction decoding modules to reconstruct the current video frame so as to obtain a deblurred target video frame corresponding to the current video frame.

After the gradient branch model calculates gradient information of multiple scales of the current video frame, the gradient information of the multiple scales is sent to M reconstruction decoding modules of the reconstruction branch model, and the gradient information and the reconstruction characteristics of the multiple scales are processed.

Wherein the M reconstruction decoding modules include a plurality of reconstruction decoding modules, such as M =2, 3, 4, and the number of reconstruction decoding modules can be selected according to requirements for performance and speed. The greater the number, the better the reconstructed video frame, and the slower the speed.

The reconstruction decoding module may employ a Residual Block (Residual Block) and downsampling, wherein the Residual Block may be replaced with other network blocks having the same function, such as a sense Block. For example, two residual processes and 1 down-sampling process are part of one reconstruction decoding module.

The reconstruction branch model adopts a plurality of coding modules and a plurality of decoding modules to carry out down-sampling and up-sampling processing for a plurality of continuous video frames for a plurality of times, so that the reconstruction branch model can reconstruct the current video frame by utilizing gradient information of a plurality of scales and the characteristics of the video frame to obtain the deblurred target video frame corresponding to the current video frame.

Optionally, M =2, the reconstruction branch model includes 2 reconstruction encoding modules and 2 reconstruction decoding modules, and the method for obtaining the reconstructed intermediate feature by performing downsampling processing on the frame structure feature M times through the M reconstruction encoding modules includes:

sending the frame structure characteristics to a gradient branch model;

sending the frame structure characteristics to a first reconstruction coding module for processing to obtain first reconstruction intermediate characteristics, and sending the first reconstruction intermediate characteristics to a gradient branch model;

and sending the first reconstruction intermediate feature to a second reconstruction coding module for processing to obtain a second reconstruction intermediate feature, and sending the second reconstruction intermediate feature to the gradient branch model.

Specifically, M =2, the reconstruction branch model includes 2 reconstruction encoding modules and 2 reconstruction decoding modules, that is, the reconstruction branch model may include a first reconstruction encoding module, a second reconstruction encoding module, a first reconstruction decoding module, and a second reconstruction decoding module.

The above four modules may adopt a Residual Block (Residual Block) and downsampling, wherein the Residual Block may be replaced by other network modules with the same function, such as a sense Block. For example, two residual processes and 1 down-sampling process are part of one module.

And sending the frame structure characteristics, the first reconstruction intermediate characteristics and the second reconstruction intermediate characteristics extracted from the continuous video frames by the reconstruction branch model to the gradient branch model, wherein the gradient branch model can reconstruct a gradient map of the current video frame according to the characteristics and generate gradient information of a plurality of scales.

Optionally, M =2, the gradient branch model includes 2 gradient encoding modules, 2 gradient decoding modules, and 1 residual module, and the method of performing downsampling processing for M times on the reconstruction feature and the blur gradient feature through the M gradient encoding modules includes:

sending the fuzzy gradient characteristic and the frame structure characteristic to a first gradient coding module for processing through a gradient branch model to obtain a first gradient intermediate characteristic;

specifically, M =2, the gradient branch model includes 2 gradient encoding modules and 2 gradient decoding modules, that is, the gradient branch model may include a first gradient encoding module, a second gradient encoding module, a first gradient decoding module, and a second gradient decoding module.

The gradient branch model can cascade the fuzzy gradient characteristics and the frame structure characteristics, and the cascade is sent to the first gradient coding module for convolution and downsampling processing to obtain first gradient intermediate characteristics.

Sending the first gradient intermediate feature and the first reconstruction intermediate feature to a second gradient coding module for processing to obtain a second gradient intermediate feature;

and cascading the first gradient intermediate characteristic and the first reconstruction intermediate characteristic through a gradient branch model, and sending the cascaded first gradient intermediate characteristic and the first reconstruction intermediate characteristic to a second gradient coding module for convolution and downsampling processing to obtain a second gradient intermediate characteristic.

And sending the second gradient intermediate feature and the second reconstruction intermediate feature to a residual error module for processing so as to finish the down-sampling processing for 2 times.

And cascading the second gradient intermediate feature and the second reconstruction intermediate feature through a gradient branch model, and sending the cascaded second gradient intermediate feature and the second reconstruction intermediate feature to a residual error module for convolution and cross-layer link processing. The implementation of the residual error module is the same as above, and will not be described herein again.

Optionally, the performing, by the M gradient decoding modules, the M upsampling on the result of the M downsampling processes to generate gradient information of multiple scales, and sending the gradient information to the reconstruction branch model includes:

determining the result processed by the residual error module as first scale gradient information, and sending the first scale gradient information to a reconstruction branch model;

specifically, 2 gradient encoding modules perform downsampling processing for 2 times to obtain intermediate features of 3 scales, and further 2 gradient decoding modules perform upsampling processing for 2 times to obtain gradient information of 3 scales, namely first scale gradient information, second scale gradient information and third scale gradient information.

And in the gradient branch model, the second gradient intermediate feature and the second reconstruction intermediate feature are cascaded and then sent to a residual error module for convolution and cross-layer link processing, and the processing result is first scale gradient information.

Sending the first scale gradient information to a first gradient decoding module for processing to obtain second scale gradient information, and sending the second scale gradient information to a reconstruction branch model;

and the first scale gradient information is sent to a first gradient decoding module for convolution and up-sampling processing to obtain second scale gradient information, and the second scale gradient information is sent to a reconstruction branch model.

And sending the second scale gradient information to a second gradient decoding module for processing to obtain third scale gradient information, and sending the third scale gradient information to the reconstruction branch model.

Similarly, the second scale gradient information is sent to the second gradient decoding module for convolution and upsampling to obtain third scale gradient information, and the third scale gradient information is sent to the reconstruction branch model.

Referring to FIG. 6, an exemplary structure of a gradient branch model is shown

In order to blur the characteristics of the gradient,

for the first intermediate feature of the gradient,

the second gradient intermediate feature.

In order to blur the characteristics of the gradient,

for the first intermediate feature of the gradient,

the second gradient intermediate feature.

In order to be a feature of the frame structure,

for the purpose of the first reconstruction of the intermediate features,

intermediate features are reconstructed for the second.

Referring to fig. 7, another flow chart of a video deblurring method according to an embodiment of the present disclosure is shown, where the method includes:

step 710: acquiring a current video frame of a video and a plurality of adjacent video frames before and after the current video frame;

step 720: processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame;

the gradient information of a plurality of scales can be obtained according to any one of the above-described methods for generating gradient information of a plurality of scales.

Step 730: performing feature fusion processing on the gradient information and the reconstruction features of multiple scales by utilizing spatial feature transformation to obtain fusion features;

among them, Spatial Feature Transform (SFT) can perform not only Feature manipulation but also Spatial transformation. Compared with cascade feedback, the method has the advantage that gradient prior can be better provided for image reconstruction by utilizing the spatial feature transformation structure to feed back to a plurality of reconstruction decoders of the reconstruction branch model.

Specifically, the input of the spatial feature transform includes gradient information and reconstructed features of multiple scales. Firstly, gradient information of a plurality of scales can be sent to two different convolution layers for processing to obtain alpha and beta, then the alpha and the feature extraction results of a plurality of adjacent video frames are multiplied by corresponding positions of a matrix, and finally the beta is added to obtain a fusion feature. The specific expression is as follows:

。

wherein alpha and beta are gradient information

The outputs obtained from the convolutional layers are fed separately, for example two 3 x 3 different convolutional layers. To reconstruct the reconstructed features output in the branch model.

Step 740: and processing the fusion characteristics by using a reconstruction branch model in the trained video deblurring model to obtain a deblurred target video frame corresponding to the current video frame.

Optionally, the step of processing the fusion feature by using a reconstruction branch model in the trained video deblurring model to obtain a deblurred target video frame corresponding to the current video frame includes:

performing spatial feature transformation on the second reconstruction intermediate feature and the first scale gradient information to obtain a first fusion feature;

performing spatial feature transformation on the first reconstruction intermediate feature and the second scale gradient information to obtain a second fusion feature;

performing spatial feature transformation on the frame structure feature and the third scale gradient information to obtain a third fusion feature;

and performing upsampling processing on the first fusion feature, the second fusion feature and the third fusion feature according to the reconstruction decoding module to obtain a target video frame.

Please refer to the figureFig. 8 is another schematic flow chart of the video deblurring model performing deblurring processing on the current video frame. The upper part of the graph is a gradient branch model, and the lower part is a reconstruction branch model. In the gradient branch model, gradient extraction is carried out on a fuzzy current video frame It to obtain a fuzzy gradient image, and then the fuzzy gradient image enters the convolution layer to carry out feature extraction to obtain fuzzy gradient features

，

For the first intermediate feature of the gradient,

the second gradient intermediate feature.

In order to blur the characteristics of the gradient,

for the first intermediate feature of the gradient,

the second gradient intermediate feature.

In order to be a feature of the frame structure,

for the purpose of the first reconstruction of the intermediate features,

intermediate features are reconstructed for the second.

In the reconstruction branch model process in the figure,

and

performing spatial feature transformation to obtain a first fusion feature C₁，

And

performing spatial feature transformation to obtain a second fusion feature C₂，

And

performing spatial feature transformation to obtain a third fusion feature C₃. First fusion characteristic C₁Entering a first reconstruction decoding module of a reconstruction branch model to obtain

，

With the second fusion feature C₂Cascade, send to the second reconstruction decoding module after cascade to obtain

，

With the third fusion feature C₃Cascading, sending to a convolution layer after cascading, or carrying out convolution processing by a residual error module to restore the characteristics to the original video frame to obtain the deblurred target video frame

。

Effect detection was performed on 10 test videos, and the experimental results are as follows:

among them, the existing solution of the ARVo is a video deblurring model based on the full-range regional correlation shown in fig. 1.

From two indexes of peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM), the video deblurring method has better deblurring effect on the blurred video frame compared with the index value obtained by the ARvo, and has better deblurring effect.

Please refer to fig. 9a and 9b, which illustrate the deblurring effect of the present application on a portion of a test video, where 9a is a current video frame before deblurring, and 9b is a target video frame obtained by applying the deblurring method of the present application.

Optionally, the reconstructing branch model includes a first residual module and a second residual module, and the step of reconstructing the current video frame by using the reconstructing branch model in the trained video deblurring model according to the gradient information of the multiple scales and the multiple adjacent video frames to obtain the deblurred target video frame corresponding to the current video frame further includes:

cascading a current video frame and a plurality of adjacent video frames;

sending the cascaded result to a first residual error module for processing to obtain frame structure characteristics;

and sending the third fusion characteristics to a second residual error module for processing to obtain a deblurred target video frame corresponding to the current video frame.

Specifically, when the reconstruction branch model is input, the current video frame and a plurality of adjacent video frames can be cascaded, and then the cascaded result is sent to the first residual error module for feature extraction, so as to obtain the frame structure feature. The first residual module may include two convolutional layers and a cross-layer link, please refer to fig. 4 again.

Meanwhile, when the output of the branch model is reconstructed, the third fusion feature can be sent to the second residual error module for processing, so as to obtain the deblurred target video frame corresponding to the current video frame. And in the same way, the second residual error module corresponds to the first residual error module, and the obtained third fusion feature is recovered to obtain a corresponding video frame.

Compared with single-layer convolution processing, the residual error module can better extract the characteristics of the structural information of the video frame, and has better effect on extracting the characteristics of the reconstruction branch model containing the image structural information.

It should be noted that, in the present application, each of the Residual error module, the gradient encoding module, the gradient decoding module, the reconstruction encoding module, and the reconstruction decoding module may adopt a Residual error module (Residual Block) as a constituent module, and in some embodiments, the Residual error module, the gradient decoding module, the reconstruction encoding module, and the reconstruction decoding module may be replaced by other network modules having the same function, such as a Dense Block.

All the above technical methods can be combined arbitrarily to form an optional embodiment of the present application, and are not described herein again.

Therefore, the current video frame of the video and a plurality of adjacent video frames before and after the current video frame are obtained; processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame; and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames so as to obtain the deblurred target video frame corresponding to the current video frame. The method has the following beneficial effects:

1. and processing the current video frame by using the branch model to obtain gradient information of a plurality of scales, wherein the gradient information comprises local information and global information of the gradient. And guiding the reconstruction branch model to reconstruct the current video frame according to the gradient information so as to provide structural prior information for reconstructing the video frame.

2. Compared with the prior art, the method and the device have the advantages that an end-to-end video deblurring model is provided, the reconstructed video frame has a clearer outline by fusing gradient information, an optical flow method is not needed for aligning adjacent frames, the calculation complexity is low, and the memory occupation is small.

3. Compared with the prior art that the deblurring reconstruction is carried out only through a single current video frame, the method and the device have the advantages that the current video frame and a plurality of adjacent video frames, namely a plurality of continuous video frames are used as input, the gradient branch model is used for reconstructing a gradient map of an intermediate frame, the reconstruction branch model is used for reconstructing the intermediate frame, the reconstruction result of the intermediate frame is finally obtained, the information of the adjacent frames is used for assisting in reconstruction, and the deblurring processing effect can be effectively improved.

4. The reconstruction branch model and the gradient branch model both adopt a coder-decoder structure, down sampling is performed before up sampling, the design effectively improves the generation capacity of the network, and simultaneously considers information of a plurality of scales and gives consideration to global characteristics and local characteristics.

5. And sending the fuzzy structural information in the reconstruction branch into an encoder of the gradient branch to assist the reconstruction of the gradient map, and sending the accurate structural information back to the reconstruction branch to assist the reconstruction after the gradient branch processing. The reconstruction of the gradient information can be better provided for the gradient branch.

6. Compared with the gradient information processing in the prior art, the method and the device have the advantages that the spatial feature transformation structure is fed back to the multiple reconstruction decoders of the reconstruction branch model, and the gradient prior can be better provided for image reconstruction.

In order to better implement the video deblurring method of the embodiment of the present application, the embodiment of the present application further provides a video deblurring apparatus. Please refer to fig. 10, which is a schematic structural diagram of a video deblurring apparatus according to an embodiment of the present disclosure. The video deblurring apparatus 1000 may include:

an obtaining unit 1100, configured to obtain a current video frame of a video and a plurality of neighboring video frames before and after the current video frame;

the gradient processing unit 1200 is configured to process the current video frame according to a gradient branch model in the trained video deblurring model, and generate gradient information of multiple scales of the current video frame;

and the reconstruction processing unit 1300 is configured to reconstruct the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of the multiple scales and the multiple adjacent video frames, so as to obtain a deblurred target video frame corresponding to the current video frame.

Optionally, the gradient processing unit 1200 may be configured to send the current video frame and the plurality of neighboring video frames to the reconstruction branch model for feature extraction, so as to obtain a reconstruction feature; and sending the reconstruction characteristics to the gradient branch model so that the gradient branch model generates gradient information of a plurality of scales according to the reconstruction characteristics and the current video frame, and sending the gradient information to the reconstruction branch model.

Optionally, the gradient processing unit 1200 may be further configured to send the current video frame to the gradient branch model for processing, so as to obtain a fuzzy gradient map of the current video frame; performing convolution processing on the fuzzy gradient map to obtain fuzzy gradient characteristics of the fuzzy gradient map; carrying out down-sampling treatment on the reconstruction characteristics and the fuzzy gradient characteristics for M times through M gradient coding modules; and performing up-sampling processing on the results of the down-sampling processing for M times through M gradient decoding modules to generate gradient information of multiple scales, and sending the gradient information to the reconstruction branch model.

Optionally, the reconstruction processing unit 1300 may be configured to perform feature extraction on the current video frame and multiple adjacent video frames to obtain frame structure features; performing down-sampling processing on the frame structure characteristics for M times through M reconstruction coding modules to obtain reconstruction intermediate characteristics, and sending the reconstruction intermediate characteristics to M gradient coding modules; and acquiring gradient information of multiple scales of the gradient branches, and processing the gradient information and reconstruction characteristics of the multiple scales through M reconstruction decoding modules to reconstruct the current video frame so as to obtain a deblurred target video frame corresponding to the current video frame.

Optionally, the reconstruction processing unit 1300 may be configured to send the frame structure feature to the gradient branch model; sending the frame structure characteristics to a first reconstruction coding module for processing to obtain first reconstruction intermediate characteristics, and sending the first reconstruction intermediate characteristics to a gradient branch model; and sending the first reconstruction intermediate feature to a second reconstruction coding module for processing to obtain a second reconstruction intermediate feature, and sending the second reconstruction intermediate feature to the gradient branch model.

Optionally, the gradient processing unit 1200 may be further configured to send the blurred gradient feature and the frame structure feature to the first gradient encoding module through the gradient branch model for processing to obtain a first gradient intermediate feature; sending the first gradient intermediate feature and the first reconstruction intermediate feature to a second gradient coding module for processing to obtain a second gradient intermediate feature; and sending the second gradient intermediate feature and the second reconstruction intermediate feature to a residual error module for processing so as to finish the down-sampling processing for 2 times.

Optionally, the gradient processing unit 1200 may be further configured to determine a result of processing by the residual error module as first scale gradient information, and send the first scale gradient information to the reconstruction branch model; sending the first scale gradient information to a first gradient decoding module for processing to obtain second scale gradient information, and sending the second scale gradient information to a reconstruction branch model; and sending the second scale gradient information to a second gradient decoding module for processing to obtain third scale gradient information, and sending the third scale gradient information to the reconstruction branch model.

Optionally, the reconstruction processing unit 1300 may be further configured to perform feature fusion processing on the gradient information and the reconstruction features of multiple scales by using spatial feature transformation to obtain a fusion feature; and processing the fusion characteristics by using a reconstruction branch model in the trained video deblurring model to obtain a deblurred target video frame corresponding to the current video frame.

Optionally, the reconstruction processing unit 1300 may be further configured to perform spatial feature transformation on the second reconstructed intermediate feature and the first scale gradient information to obtain a first fusion feature; performing spatial feature transformation on the first reconstruction intermediate feature and the second scale gradient information to obtain a second fusion feature; performing spatial feature transformation on the frame structure feature and the third scale gradient information to obtain a third fusion feature; and performing upsampling processing on the first fusion feature, the second fusion feature and the third fusion feature according to the reconstruction decoding module to obtain a target video frame.

Optionally, the reconstruction processing unit 1300 may be further configured to send the current video frame and the multiple neighboring video frames to the first residual error module for processing, so as to obtain a frame structure feature; and sending the third fusion characteristics to a second residual error module for processing to obtain a deblurred target video frame corresponding to the current video frame.

It should be noted that, for the functions of each module in the video deblurring apparatus 1000 in this embodiment, reference may be made to the specific implementation manner of any embodiment in the foregoing method embodiments, and details are not described here again.

The various elements of the video deblurring apparatus 1000 described above may be implemented in whole or in part by software, hardware, and combinations thereof. The units may be embedded in hardware or independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor invokes and executes operations corresponding to the units.

The video deblurring apparatus 1000 may be integrated into a terminal or server having a memory and a processor mounted thereon and having computing capabilities, for example, or the video deblurring apparatus 1000 may be the terminal or server. The terminal can be a smart phone, a tablet Computer, a notebook Computer, a smart television, a smart speaker, a wearable smart device, a Personal Computer (PC), and the like, and the terminal can further include a client, which can be a video client, a browser client, an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The application also provides a training method of the video deblurring model, please refer to fig. 11, which is a schematic flow chart of the training method of the video deblurring model.

The application also provides a training method of the video deblurring model, which comprises the following steps:

step 1110: acquiring historical training data, wherein the historical training data comprises a historical current video frame of a video and a plurality of historical adjacent video frames of the historical current video frame;

step 1120: and performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame.

Step 1130: and reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames to obtain the deblurred training target video frame corresponding to the historical current video frame.

Step 1140: acquiring a sample gradient map and a sample video frame of a historical current video frame in historical training data;

step 1150: establishing a first loss function for the sample gradient map and the training gradient map, and performing first optimization on the gradient branch model according to the first loss function;

step 1160: establishing a second loss function for the sample video frame and the training target video frame, and performing second optimization on the reconstruction branch model according to the second loss function;

step 1170: and determining a video deblurring model according to the gradient branch model after the first optimization and the reconstruction branch model after the second optimization.

Specifically, historical training data is obtained, and the historical training data can comprise a video frame with a blur and a clear sample video frame of the blur.

And processing the historical current video frame by using any one of the video deblurring methods of the gradient branch model, wherein the processed gradient information of the third scale can be subjected to deconvolution processing so as to restore the characteristic graph of the gradient information into a gradient graph, and the gradient graph is determined to be the historical gradient information and the training gradient graph of a plurality of scales of the historical current video frame.

And processing the historical current video frame and a plurality of historical adjacent video frames of the historical current video frame by using any video deblurring method for reconstructing the branch model to obtain the deblurred training target video frame corresponding to the historical current video frame.

The model is then optimized according to a loss function between the training values and the true target values.

Establishing a first loss function for the sample gradient map and the training gradient map, and performing first optimization on the gradient branch model according to the first loss function; and establishing a second loss function for the sample video frame and the training target video frame, and performing second optimization on the reconstruction branch model according to the second loss function. The optimization of the gradient branch model and the reconstruction branch model can be separately and independently carried out, and the two optimizations can also be linearly superposed to carry out simultaneous optimization.

First loss function

The minimum absolute value deviation (LAD), i.e., the L1 loss function, may be employed as follows:

。

wherein,

for a clear frame of the sample video,

for a sample gradient map of a clear sample video frame,

a training gradient map generated for the gradient branch model.

Second loss function

Same as the first loss function

The minimum absolute value deviation (LAD), L1 loss function, may be used as follows:

。

wherein,

for a clear frame of the sample video,

and reconstructing the obtained training target video frame for the reconstruction branch model.

The first loss function may be the same as the second loss function

Make up the total loss function

Optimization, in some embodiments, a weighted overlap-add may be performed.

。

Wherein,

in order to be a function of the first loss,

is a second loss function.

And optimizing parameters of each convolution layer in the M gradient coding modules, the gradient decoding modules, the reconstruction coding modules and the reconstruction decoding modules contained in the gradient branch model and parameters of up-sampling and down-sampling according to the loss function. So that the loss values of the training target video frame and the sample video frame, and the loss values of the training gradient map and the sample gradient map reconstructed by the model are within the training threshold range. And then determining a video deblurring model by the first optimized gradient branch model and the second optimized gradient branch model and the reconstruction branch model.

In some embodiments, other optimization methods such as Stochastic Gradient Descent (SGD) may be used to optimize the Gradient branch model and the reconstruction branch model.

Therefore, historical training data are obtained, and the historical training data comprise historical current video frames of videos and a plurality of historical adjacent video frames of the historical current video frames; and performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame. And reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames to obtain the deblurred training target video frame corresponding to the historical current video frame. Acquiring a label gradient map and a label video frame of a historical current video frame in historical training data; establishing a first loss function for the label gradient map and the training gradient map, and performing first optimization on the gradient branch model according to the first loss function; establishing a second loss function for the label video frame and the training target video frame, and performing second optimization on the reconstruction branch model according to the second loss function; and determining the model as a trained video deblurring model according to the gradient branch model after the first optimization and the reconstruction branch model after the second optimization. The two branch models of the video deblurring model can be optimized by using an L1 loss function, so that parameters in network modules in the two branch models are optimized, the deblurring effect in deblurring application is improved, the calculation complexity is low, and the occupied memory is small.

In order to better implement the video deblurring method according to the embodiment of the present application, an apparatus 2000 for training a video deblurring model is further provided in the embodiment of the present application. Please refer to fig. 12, which is a schematic structural diagram of a training apparatus 2000 according to an embodiment of the present application. Among them, the training apparatus 2000 may include:

a first obtaining unit 2100, configured to obtain historical training data, where the historical training data includes a historical current video frame of a video and a plurality of historical neighboring video frames of the historical current video frame;

the gradient processing unit 2200 is configured to perform gradient extraction on the historical current video frame by using the gradient branch model, and generate historical gradient information and a training gradient map of multiple scales of the historical current video frame.

The reconstruction processing unit 2300 is configured to reconstruct the historical current video frame by using the reconstruction branch model according to the historical gradient information of the multiple scales and the multiple historical neighboring video frames, so as to obtain a deblurred training target video frame corresponding to the historical current video frame.

A second obtaining unit 2400, configured to obtain a sample gradient map of a historical current video frame in historical training data and a sample video frame;

the first optimization unit 2500 is configured to establish a first loss function for the sample gradient map and the training gradient map, and perform first optimization on the gradient branch model according to the first loss function;

the second optimization unit 2600 is configured to establish a second loss function for the sample video frame and the training target video frame, and perform second optimization on the reconstruction branch model according to the second loss function;

the determining unit 2700 is configured to determine the video deblurring model according to the first optimized gradient branch model and the second optimized reconstruction branch model.

It should be noted that, for the functions of each module in the training apparatus 2000 in this embodiment, reference may be made to the specific implementation manner of any embodiment in the foregoing method embodiments, and details are not described here again.

The various elements of the exercise apparatus 2000 described above may be implemented in whole or in part by software, hardware, and combinations thereof. The units may be embedded in hardware or independent from a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor can call and execute operations corresponding to the units.

The training device 2000 may be integrated in a terminal or server with memory and processor installed, or the video deblurring device 1000 may be the terminal or server. The terminal can be a smart phone, a tablet Computer, a notebook Computer, a smart television, a smart speaker, a wearable smart device, a Personal Computer (PC), and the like, and the terminal can further include a client, which can be a video client, a browser client, an instant messaging client, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Fig. 13 is a schematic structural diagram of a computer device 3000 provided in an embodiment of the present application, and as shown in the drawing, the computer device 3000 may include: communication interface 3001, memory 3002, processor 3003 and communication bus 3004. The communication interface 3001, the memory 3002, and the processor 3003 communicate with each other via a communication bus 3004. The communication interface 3001 is used for the computer device 3000 to perform data communication with external devices. The memory 3002 may be used to store software programs and modules, and the processor 3003 may operate by executing the software programs and modules stored in the memory 3002, such as the software programs of the corresponding operations in the foregoing method embodiments.

Optionally, the processor 3003 may also call the software programs and modules stored in the memory 3002 to perform the following operations:

acquiring a current video frame of a video and a plurality of adjacent video frames before and after the current video frame;

processing the current video frame according to a gradient branch model in the trained video deblurring model to generate gradient information of multiple scales of the current video frame;

and reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of a plurality of scales and a plurality of adjacent video frames to obtain a deblurred target video frame corresponding to the current video frame, wherein the gradient branch model and the reconstruction branch model are models of the structure of the coder and the decoder.

Alternatively, the processor 3003 may invoke software programs and modules stored in the memory 3002 to perform the following operations:

and performing gradient extraction on the historical current video frame by using a gradient branch model to generate historical gradient information and a training gradient map of multiple scales of the historical current video frame.

And reconstructing the historical current video frame by using a reconstruction branch model according to the historical gradient information of the plurality of scales and the plurality of historical adjacent video frames to obtain the deblurred training target video frame corresponding to the historical current video frame.

Acquiring a label gradient map and a label video frame of a historical current video frame in historical training data;

Alternatively, the computer device 3000 may be integrated in a terminal or a server having a memory and a processor mounted thereon and having an arithmetic capability, or the computer device 3000 may be the terminal or the server. The terminal can be a smart phone, a tablet computer, a notebook computer, a smart television, a smart sound box, a wearable smart device, a personal computer and the like. The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform and the like.

The present application also provides a computer-readable storage medium for storing a computer program. The computer-readable storage medium can be applied to a computer device, and the computer program enables the computer device to execute the corresponding process in the video deblurring method in the embodiment of the present application, which is not described herein again for brevity.

The present application also provides a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the corresponding process in the video deblurring method in the embodiment of the present application, which is not described herein again for brevity.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical method of the present application or a part of the technical method, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk, and various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of deblurring video, comprising:

sending the current video frame and the plurality of adjacent video frames to a reconstruction branch model in a trained video deblurring model for feature extraction to obtain reconstruction features;

sending the reconstruction characteristics to a gradient branch model in the trained video deblurring model, so that the gradient branch model generates gradient information of multiple scales of the current video frame according to the reconstruction characteristics and the current video frame, and sending the gradient information to the reconstruction branch model;

2. The method of claim 1, wherein the gradient branch model comprises M gradient encoding modules and M gradient decoding modules, where M is a positive integer, and wherein sending the reconstruction features to a gradient branch model in the trained video deblurring model, so that the gradient branch model generates gradient information of multiple scales of the current video frame according to the reconstruction features and the current video frame, and sending the gradient branch model to the reconstruction branch model comprises:

sending the current video frame to the gradient branch model for processing to obtain a fuzzy gradient map of the current video frame;

performing downsampling processing on the reconstruction characteristics and the fuzzy gradient characteristics for M times through the M gradient coding modules;

and performing up-sampling processing on the result of the down-sampling processing for M times through the M gradient decoding modules to generate gradient information of multiple scales, and sending the gradient information to the reconstruction branch model.

3. The method according to claim 2, wherein the reconstruction branch model includes M reconstruction encoding modules and M reconstruction decoding modules, and the reconstructing the current video frame by using the reconstruction branch model in the trained video deblurring model according to the gradient information of the plurality of scales and the plurality of neighboring video frames to obtain the deblurred target video frame corresponding to the current video frame includes:

performing feature extraction on the current video frame and the plurality of adjacent video frames to obtain frame structure features;

carrying out down-sampling processing on the frame structure characteristics for M times through the M reconstruction coding modules to obtain reconstruction intermediate characteristics, and sending the reconstruction intermediate characteristics to the M gradient coding modules;

and acquiring the gradient information of the gradient branches in multiple scales, and processing the gradient information of the multiple scales and the reconstruction characteristics through the M reconstruction decoding modules to reconstruct the current video frame so as to obtain the deblurred target video frame corresponding to the current video frame.

4. The method according to claim 3, wherein M =2, the reconstruction branch model includes 2 reconstruction encoding modules and 2 reconstruction decoding modules, and the downsampling processing is performed M times on the frame structure features by the M reconstruction encoding modules to obtain reconstruction intermediate features, and the sending of the reconstruction intermediate features to the M gradient encoding modules includes:

sending the frame structure features to the gradient branch model;

sending the frame structure characteristics to a first reconstruction coding module for processing to obtain first reconstruction intermediate characteristics, and sending the first reconstruction intermediate characteristics to the gradient branch model;

5. The method of claim 4, wherein the gradient branch model comprises 2 gradient encoding modules, 2 gradient decoding modules and 1 residual module, and wherein the downsampling the reconstructed features and the blurred gradient features M times by the M gradient encoding modules comprises:

sending the fuzzy gradient feature and the frame structure feature to a first gradient coding module for processing through the gradient branch model to obtain a first gradient intermediate feature;

and sending the second gradient intermediate feature and the second reconstruction intermediate feature to the residual error module for processing so as to complete 2 times of downsampling processing.

6. The method of claim 5, wherein the performing M times of upsampling on the result of the M times of downsampling by the M gradient decoding modules to generate the gradient information of the plurality of scales and sending the gradient information to the reconstruction branch model comprises:

determining the result processed by the residual error module as first scale gradient information, and sending the first scale gradient information to the reconstruction branch model;

sending the first scale gradient information to a first gradient decoding module for processing to obtain second scale gradient information, and sending the second scale gradient information to the reconstruction branch model;

7. The method of claim 6, wherein the reconstructing the current video frame by using a reconstruction branch model in the trained video deblurring model according to the gradient information of the plurality of scales and the plurality of neighboring video frames to obtain the deblurred target video frame corresponding to the current video frame further comprises:

performing feature fusion processing on the gradient information of the multiple scales and the reconstruction features by using spatial feature transformation to obtain fusion features;

and processing the fusion characteristics by utilizing a reconstruction branch model in the trained video deblurring model to obtain a deblurred target video frame corresponding to the current video frame.

8. The method of claim 7, wherein the processing the fusion feature using the reconstruction branch model in the trained video deblurring model to obtain the deblurred target video frame corresponding to the current video frame comprises:

performing the spatial feature transformation on the second reconstructed intermediate feature and the first scale gradient information to obtain a first fusion feature;

performing the spatial feature transformation on the first reconstruction intermediate feature and the second scale gradient information to obtain a second fusion feature;

performing the spatial feature transformation on the frame structure feature and the third scale gradient information to obtain a third fusion feature;

and performing upsampling processing on the first fusion feature, the second fusion feature and the third fusion feature according to a reconstruction decoding module to obtain the target video frame.

9. The method of claim 8, wherein the reconstructing branch model comprises a first residual module and a second residual module, and the reconstructing the current video frame by using the reconstructing branch model in the trained video deblurring model according to the gradient information of the plurality of scales and the plurality of neighboring video frames to obtain the deblurred target video frame corresponding to the current video frame further comprises:

sending the current video frame and the plurality of adjacent video frames to a first residual error module for processing to obtain frame structure characteristics;

and sending the third fusion feature to the second residual error module for processing to obtain the deblurred target video frame corresponding to the current video frame.

10. A training method of a video deblurring model is characterized by comprising the following steps:

performing gradient extraction on the historical current video frame by using a gradient branch model to generate a training gradient map of the current video frame;

sending the historical current video frame and the plurality of historical adjacent video frames to a reconstruction branch model for feature extraction to obtain historical reconstruction features;

sending the historical reconstruction characteristics to a gradient branch model, so that the gradient branch model generates historical gradient information of multiple scales of the historical current video frame according to the historical reconstruction characteristics and the historical current video frame, and sends the historical gradient information to the reconstruction branch model;

11. A video deblurring apparatus, comprising:

the gradient processing unit is used for sending the current video frame and the plurality of adjacent video frames to a reconstruction branch model in a trained video deblurring model for feature extraction so as to obtain reconstruction features;

12. An apparatus for training a video deblurring model, comprising:

the gradient processing unit is used for performing gradient extraction on the historical current video frame by using a gradient branch model to generate a training gradient map of the current video frame; and

sending the historical current video frame and the plurality of historical adjacent video frames to a reconstruction branch model for feature extraction to obtain historical reconstruction features; and

13. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the steps of the method according to any one of claims 1-10.

14. A computer arrangement, characterized in that the computer arrangement comprises a processor and a memory, in which a computer program is stored, which processor, by invoking the computer program stored in the memory, is adapted to perform the steps in the method of any of claims 1-10.