CN114913196A

CN114913196A - Attention-based dense optical flow calculation method

Info

Publication number: CN114913196A
Application number: CN202111623934.7A
Authority: CN
Inventors: 张继东; 吕超; 曹靖城; 涂娟娟
Original assignee: Tianyi Digital Life Technology Co Ltd
Current assignee: Tianyi Digital Life Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-08-16
Also published as: WO2023123873A1

Abstract

The invention relates to a dense optical flow calculation method based on attention machine mechanism. The invention provides a dense optical flow calculation method based on Unet and a Transformer. In the invention, two adjacent frames are spliced on a channel through a down-sampling module and then input into a convolution network for down-sampling; then, a feature processing module is used for carrying out global context feature processing on the feature map coding input sequence output by the down-sampling network; and finally, an up-sampling module is used for up-sampling and reconstructing the feature map after feature processing into a light flow map with the same size as the input picture.

Description

Attention-based dense optical flow calculation method

Technical Field

The present invention relates to the field of image processing, and mainly to the field of dense optical flow computation.

Background

When a moving object is viewed by the human eye, the scene of the object forms a series of continuously changing images on the retina of the human eye, and this series of continuously changing information constantly "flows" through the retina (i.e., the image plane) as if it were a "flow" of light, so called optical flow. Specifically, the optical flow is the instantaneous velocity of pixel motion of a spatially moving object on the observation imaging plane. The optical flow method is a method for calculating motion information of an object between adjacent frames by using the change of pixels in an image sequence in a time domain and the correlation between adjacent frames to find the corresponding relationship between a previous frame and a current frame. Conventional methods for computing optical flow are mainly gradient-based, frequency-based, phase-based, and matching-based methods.

Dense optical flow is an image registration method for point-by-point matching of an image or a specified area, which calculates the offset of all points on the image to form a dense optical flow field. With this dense optical flow field, image registration at the pixel level can be performed. The Horn-Schunck algorithm and most optical flow methods based on region matching fall into the category of dense optical flow. Among optical flow calculation methods using deep learning, FlowNet is the most widely used in practical applications.

The patent "robust interpolation optical flow calculation method for pyramid occlusion detection block matching" (CN112509014A) discloses a robust interpolation optical flow calculation method for pyramid occlusion detection block matching, which comprises the steps of firstly carrying out pyramid occlusion detection block matching to obtain a sparse robust motion field, forming k-layer image pyramids on two continuous frames of images through downsampling factors, carrying out block matching on each layer of pyramids, and obtaining a matching result with initial occlusion; obtaining occlusion detection information through an occlusion detection algorithm based on a deformation error; obtaining an accurate sparse matching result through matching, and acquiring a dense optical flow through a robust interpolation algorithm; after obtaining the dense optical flow by a robust interpolation algorithm, optimizing the dense optical flow by global energy functional variational: and obtaining a final optical flow through global energy functional variation optimization.

The patent "an image sequence light stream estimation method based on learnable occlusion mask and secondary deformation optimization" (CN112465872A) discloses an image sequence light stream estimation method based on learnable occlusion mask and secondary deformation optimization, which comprises the steps of firstly inputting any two continuous frames of images in an image sequence, and carrying out feature pyramid downsampling and layering on the images to obtain multi-resolution two-frame features; calculating the correlation degree of the first frame feature and the second frame feature in each layer of pyramid, and constructing a shielding mask-based module by utilizing the correlation degree; then, removing the edge artifact of the deformation feature by using the obtained shielding mask to optimize the optical flow of the image motion edge blur; constructing a secondary deformation optimization module by using the optical flow after the occlusion constraint, and further optimizing the estimation of the optical flow of the image motion edge at a sub-pixel level by secondary deformation; and carrying out the same shielding mask and secondary deformation on the deformation features in each pyramid layer to obtain a residual flow to refine the optical flow, and outputting the final optimized optical flow estimation when the optical flow reaches the pyramid bottom layer.

Both of the above patents effectively improve the calculation accuracy of optical flow estimation, but the requirements of tasks such as video encoding and HDR composition on optical flow cannot be met in terms of the accuracy of dense optical flow. Therefore, there is a need for an improved technique to increase the accuracy of dense optical flow computations.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Compared with the existing dense optical flow method, the method introduces a multi-head self-attention machine into the optical flow prediction calculation task, and improves the effect of the optical flow calculation task by utilizing the global self-attention advantage of the Transformer in the aspect of sequence-to-sequence prediction. In addition, the method can improve the accuracy of the dense optical flow graph at the key position, and meanwhile, the timeliness of dense optical flow calculation is improved by reducing the network depth of up-sampling and down-sampling of Unet.

In accordance with one embodiment of the present invention, a method for dense optical flow computation is disclosed, comprising: splicing adjacent frames on a channel to generate a spliced vector diagram; inputting the spliced vector diagram into a down-sampling network for feature extraction to generate a feature vector; mapping the generated feature vectors to a high-dimensional embedding space of the potential layer to generate a high-dimensional embedding representation sequence; inputting a high-dimensional embedded representation sequence into a feature processing network consisting of I transform layers to generate a hidden feature sequence; recombining the generated hidden characteristic sequence to generate a recombined characteristic vector; and inputting the recombined feature vectors into an upsampling network for processing so as to generate a dense light flow graph.

According to another embodiment of the invention, a system for dense optical flow computation is disclosed that includes a downsampling module, a feature processing module, and an upsampling module. The down-sampling module is configured to: splicing adjacent frames on a channel to generate a spliced vector diagram; and inputting the spliced vector diagram into a down-sampling network for feature extraction to generate a feature vector. The feature processing module is configured to: mapping the feature vectors generated by the down-sampling module to a high-dimensional embedding space of a potential layer to generate a high-dimensional embedding representation sequence; the high-dimensional embedded representation sequence is input into a feature processing network consisting of I transform layers to generate a hidden feature sequence. The upsampling module is configured to: recombining the hidden feature sequences generated by the feature processing module to generate recombined feature vectors; and inputting the recombined feature vectors into an up-sampling network for processing so as to generate a dense light flow graph.

In accordance with another embodiment of the present invention, a computing device for dense optical flow computation is disclosed, comprising: a processor; a memory storing instructions that, when executed by the processor, are capable of performing the method as described above.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only some typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 shows a block diagram of a system 100 for dense optical flow computation according to one embodiment of the invention;

FIG. 2 illustrates a detailed view 200 of the modules 101 and 103 of FIG. 1 according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of a method 300 for dense optical flow computation according to one embodiment of the invention; and

FIG. 4 shows a block diagram 400 of an exemplary computing device, according to an embodiment of the invention.

Detailed Description

The present invention will be described in detail below with reference to the attached drawings, and the features of the present invention will be further apparent from the following detailed description.

The following is an explanation of terms used in the present invention, which includes the general meanings well known to those skilled in the art:

unet: the convolutional layer is completely symmetrical in the down-sampling part and the up-sampling part, and a characteristic diagram of a down-sampling end can skip deep sampling and is spliced to the corresponding up-sampling end.

Transformer is a Natural Language Processing (NLP) model that employs a mechanism of attention to accomplish the task of machine translation.

In computer vision, optical flow plays an important role, and has very important applications in target object segmentation, recognition, tracking, robot navigation, shape information recovery and the like. The optical flow calculation can be widely applied to various scenes, such as motion detection of video coding and decoding in a cloud storage video compression task, high-altitude parabolic motion, falling detection and other motion recognition and video understanding tasks. To obtain more accurate motion estimation, dense optical flow computation is a key module in video codec technology. The traditional dense optical flow calculation method has large calculation amount and poor timeliness. The existing optical flow calculation method based on the deep learning method is improved in timeliness, but the accuracy of a dense optical flow graph is low, and the adverse effect is caused on the quality of video coding and decoding.

The invention provides a dense optical flow calculation method based on Unet and a Transformer, which introduces a Transformer module into an Unet structure, utilizes the global self-attention advantage of the Transformer in the aspect of sequence-to-sequence prediction to improve the accuracy of dense optical flow of a key position, and can reduce the network depth of up-sampling and down-sampling of the Unet and improve the timeliness of dense optical flow calculation.

FIG. 1 shows a block diagram of a system 100 for dense optical flow computation according to one embodiment of the invention. As shown in fig. 1, the system 100 is divided into modules, with communication and data exchange between the modules being performed in a manner known in the art. In the present invention, each module may be implemented by software or hardware or a combination thereof. As shown in fig. 1, the system 100 may include a downsampling module 101, a feature processing module 102, and an upsampling module 103.

According to an embodiment of the present invention, the downsampling module 101 is configured to concatenate two adjacent frames on a channel (e.g., a color channel) to form an input picture, so as to input the input picture to a convolution network for downsampling, thereby obtaining the feature map. The feature processing module 102 is configured to perform global context feature processing on the feature map encoded input sequence output by the downsampling module 101. The upsampling module 103 is configured as a cascade upsampler to upsample the feature-processed feature map to reconstruct a light-flow map of the same size as the input picture.

FIG. 2 illustrates a detailed view 200 of the modules 101 and 103 of FIG. 1 according to one embodiment of the invention.

As shown in fig. 2, the downsampling module 101 receives two adjacent frames 201, first splices the two frames 201 to obtain an h × w × 6 vector map, and then inputs the vector map to a downsampling network composed of 7 convolutional blocks, each of which is composed of a convolutional layer and a ReLU activation function, wherein the step size of 5 convolutional layers is 2.

Finally, the down-sampling module 101 outputs a value of

For processing by the feature processing module 102.

As shown in fig. 2, the feature processing module 102 includes a step of mapping the sequence of feature maps output by the downsampling module 101 into the high-dimensional embedding space of the potential layer using a trainable linear mapping E, which is calculated as shown in equation (1):

the high-dimensional embedded representation sequence is then input into a feature processing network consisting of I transform layers. The specific structure of the Transformer layer is shown in fig. 3. Specifically, the transform Layer is composed of a Multi-head Self-Attention (MSA) and a Multi-Layer Perceptron (MLP), and the output of the i-th Layer is shown in formulas (2) and (3):

z′ _i ＝MSA(LN(z _i-1 ))+z _i-1 , (2)

z _i ＝MLP(LN(z' _i ))+z' _i , (3)

where LN (-) represents a hierarchical normalization operation. The feature processing module 102 finally outputs a hidden feature sequence z _I 。

As shown in FIG. 2, the upsampling module 103 is a cascaded upsampling network that includes multiple upsampling steps to decode the output final optical streamPicture 202. First, the upsampling module 103 processes the hidden feature sequence z finally output by the feature processing module 102 _I Is recombined into

The feature vector of size is then input into an upsampling network consisting of 7 deconvolution blocks, each consisting of one deconvolution layer and one ReLU activation function, where the step size of the 5 deconvolution layers is 2. Finally, an optical flow diagram output with the size h multiplied by w multiplied by 3 is obtained. Furthermore, the invention incorporates three layers of jumps with down-sampled feature vectors to enable feature aggregation (203, 204, 205) at different resolution levels, thereby optimizing the details of the optical flow.

FIG. 3 shows a flow diagram of a method 300 for dense optical flow computation according to one embodiment of the present invention.

In step 301, adjacent frames are spliced on a channel to generate a spliced vector graph. According to one embodiment of the invention, the channel is a color channel, such as an RGB channel. According to one embodiment of the invention, the vector graph size is h × w × 6.

In step 302, the spliced vector graphics are input into a down-sampling network for feature extraction to generate feature vectors. According to one embodiment of the invention, the downsampled network consists of 7 convolution blocks, each convolution block consisting of one convolution layer and one ReLU activation function, where the step size of 5 convolution layers is 2. According to one embodiment of the invention, the feature vector is of a size of

In step 303, the feature vectors generated in step 302 are mapped into the high-dimensional embedding space of the potential layer to generate a high-dimensional embedded representation sequence. According to an embodiment of the invention, the feature vectors obtained in step 302 may be mapped into the high-dimensional embedding space of the latent layer using a trainable linear mapping E.

At step 304, the high-dimensional embedded representation sequence is input into a feature processing network consisting of I transform layers to generate a hidden feature sequence. According to one embodiment of the invention, the Transformer layer is composed of MSA and MLP for global context feature processing.

In step 305, the hidden feature sequences generated in step 304 are recombined to generate recombined feature vectors. According to one embodiment of the invention, the sequence of hidden features z _I Is recombined into

A feature vector of size.

At step 306, the reconstructed eigenvectors are input into the upsampling network for processing to generate a dense optical flow graph. The dense light flow graph may embody the light flow of the object motion in two adjacent frames acquired in step 301. According to one embodiment of the invention, the upsampling network consists of 7 deconvolution blocks, each deconvolution block consisting of one deconvolution layer and one ReLU activation function, wherein the step size of 5 deconvolution layers is 2. According to one embodiment of the invention, the size of the dense light-flow pattern is hxwx 3. According to one embodiment of the invention, the upsampling network is a cascaded upsampling network, enabling feature aggregation at different resolution levels, thereby optimizing the details of dense optical flow.

In summary, compared with the prior art, the invention has the main advantages that: (1) the method has the advantages that a multi-head self-attention machine is introduced into the optical flow prediction calculation task, and the accuracy of dense optical flows of key positions can be improved by utilizing the global self-attention advantage of a Transformer in the aspect of sequence-to-sequence prediction; (2) due to the excellent performance of the multi-head self-attention machine in prediction calculation on the feature layer, the network depth of Unet up-sampling and down-sampling can be reduced, and the timeliness of dense optical flow calculation can be improved.

FIG. 4 shows a block diagram 400 of an exemplary computing device, which is one example of a hardware device that may be applied to aspects of the present invention, according to one embodiment of the present invention. Computing device 400 may be any machine that may be configured to implement processing and/or computing, and may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a smart phone, an in-vehicle computer, or any combination thereof. Computing device 400 may include components that may be connected or communicate via one or more interfaces and a bus 402. For example, computing device 400 may include a bus 402, one or more processors 404, one or more input devices 406, and one or more output devices 408. The one or more processors 404 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., dedicated processing chips). Input device 406 may be any type of device capable of inputting information to a computing device and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device 408 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Computing device 400 may also include or be connected to non-transitory storage device 410, which may be any storage device that is non-transitory and that enables data storage, and which may include, but is not limited to, a disk drive, an optical storage device, a solid-state memory, a floppy disk, a flexible disk, a hard disk, a tape, or any other magnetic medium, an optical disk or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any memory chip or cartridge, and/or any other medium from which a computer can read data, instructions, and/or code. Non-transitory storage device 410 may be detached from the interface. The non-transitory storage device 410 may have data/instructions/code for implementing the above-described methods and steps. Computing device 400 may also include a communication device 412. The communication device 412 may be any type of device or system capable of communicating with internal apparatus and/or with a network and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a bluetooth device, an IEEE1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

The bus 402 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA (eisa) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computing device 400 may also include a working memory 414, which working memory 414 may be any type of working memory capable of storing instructions and/or data that facilitate the operation of processor 404 and may include, but is not limited to, random access memory and/or read only memory devices.

Software components may be located in the working memory 414 including, but not limited to, an operating system 416, one or more application programs 418, drivers, and/or other data and code. Instructions for implementing the above-described methods and steps of the invention may be contained within the one or more applications 418, and the instructions of the one or more applications 418 may be read and executed by the processor 404 to implement the above-described method 300 of the invention.

It should also be appreciated that variations may be made according to particular needs. For example, customized hardware might also be used and/or particular components might be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. In addition, connections to other computing devices, such as network input/output devices, and the like, may be employed. For example, some or all of the disclosed methods and apparatus can be implemented with logic and algorithms in accordance with the present invention through programming hardware (e.g., programmable logic circuitry including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) having assembly language or hardware programming languages (e.g., VERILOG, VHDL, C + +).

Although the aspects of the present invention have been described so far with reference to the accompanying drawings, the above-described methods and apparatuses are merely examples, and the scope of the present invention is not limited to these aspects but only by the appended claims and equivalents thereof. Various components may be omitted or may be replaced with equivalent components. In addition, the steps may also be performed in a different order than described in the present invention. Further, the various components may be combined in various ways. It is also important that as technology develops that many of the described components can be replaced by equivalent components appearing later.

Claims

1. A method for dense optical flow computation, comprising:

splicing adjacent frames on a channel to generate a spliced vector diagram;

inputting the spliced vector diagram into a down-sampling network for feature extraction to generate a feature vector;

mapping the generated feature vectors to a high-dimensional embedding space of a potential layer to generate a high-dimensional embedding representation sequence;

inputting a high-dimensional embedded representation sequence into a feature processing network consisting of I transform layers to generate a hidden feature sequence;

recombining the generated hidden characteristic sequence to generate a recombined characteristic vector; and

and inputting the recombined feature vectors into an up-sampling network for processing so as to generate a dense light flow graph.

2. The method of claim 1, wherein the downsampling network consists of 7 convolutional blocks, each convolutional block consisting of one convolutional layer and one ReLU activation function, wherein the step size for 5 convolutional layers is 2.

3. The method of claim 1, wherein the transform layer consists of a multi-headed spontoon and a multi-layered perceptron.

4. The method of claim 1, wherein the upsampling network is a cascaded upsampling network and consists of 7 deconvolution blocks, each deconvolution block consisting of one deconvolution layer and one ReLU activation function, wherein the step size for 5 deconvolution layers is 2.

5. The method of claim 1, wherein mapping the generated feature vectors to a high-dimensional embedding space of the potential layer to generate a high-dimensional embedded representation sequence further comprises: the feature vectors are mapped into the high-dimensional embedding space of the latent layer using a trainable linear mapping E.

6. A system for dense optical flow computation, comprising:

a downsampling module configured to:

splicing adjacent frames on a channel to generate a spliced vector diagram;

a feature processing module configured to:

mapping the feature vectors generated by the down-sampling module to a high-dimensional embedding space of a potential layer to generate a high-dimensional embedding representation sequence;

an upsampling module configured to:

recombining the hidden feature sequences generated by the feature processing module to generate recombined feature vectors; and

and inputting the recombined feature vectors into an upsampling network for processing so as to generate a dense light flow graph.

7. The system of claim 6, wherein the downsampling network consists of 7 convolution blocks, each convolution block consisting of one convolution layer and one ReLU activation function, wherein the step size for 5 convolution layers is 2;

wherein the upsampling network is a cascaded upsampling network and consists of 7 deconvolution blocks, each deconvolution block consisting of one deconvolution layer and one ReLU activation function, wherein the step size of 5 deconvolution layers is 2.

8. The system of claim 6, wherein the transform layer consists of a multi-headed spontoon and a multi-layered perceptron.

9. The system of claim 6, wherein mapping the generated feature vectors to a high-dimensional embedding space of the potential layer to generate a high-dimensional embedded representation sequence further comprises: the feature vectors are mapped into the high-dimensional embedding space of the latent layer using a trainable linear mapping E.

10. A computing device for dense optical flow computation, comprising:

a processor;

a memory storing instructions that, when executed by the processor, are capable of performing the method of any of claims 1-5.