[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2023103378A1 - Video frame interpolation model training method and apparatus, and computer device and storage medium - Google Patents

Video frame interpolation model training method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2023103378A1
WO2023103378A1 PCT/CN2022/105652 CN2022105652W WO2023103378A1 WO 2023103378 A1 WO2023103378 A1 WO 2023103378A1 CN 2022105652 W CN2022105652 W CN 2022105652W WO 2023103378 A1 WO2023103378 A1 WO 2023103378A1
Authority
WO
WIPO (PCT)
Prior art keywords
image frame
group
training
result
alignment
Prior art date
Application number
PCT/CN2022/105652
Other languages
French (fr)
Chinese (zh)
Inventor
周昆
李文博
蒋念娟
沈小勇
吕江波
Original Assignee
深圳思谋信息科技有限公司
上海思谋科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳思谋信息科技有限公司, 上海思谋科技有限公司 filed Critical 深圳思谋信息科技有限公司
Publication of WO2023103378A1 publication Critical patent/WO2023103378A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0127Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level by changing the field or frame frequency of the incoming video signal, e.g. frame rate converter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of image processing, in particular to a video frame interpolation model training method, device, computer equipment and storage medium.
  • video frame interpolation model training technology has emerged.
  • the main purpose of video interpolation is to improve the frame rate by increasing the frame rate.
  • the smoothness of the screen Today, video frame insertion technology has been applied in various fields. For example, with the development of mobile phone hardware, the refresh rate has also been greatly improved, and the previous video content also needs to increase the frame rate to match the highest supported by the hardware. refresh rate.
  • a video frame interpolation method is also required, which can obtain a smoother video clip based on a small number of key image frames.
  • the embodiment of the present application provides a video frame interpolation model training method, including:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frame corresponding to each training image frame group is output;
  • the third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
  • the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frames corresponding to each training image frame group are output, including:
  • the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously
  • the same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
  • each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
  • a reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
  • the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased;
  • the corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the first image frame, including:
  • the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased;
  • the corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the third image frame, including:
  • the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:
  • the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
  • both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:
  • any training image frame group For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
  • the second character set and the third character set determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel , as the third difference.
  • the process of determining the second difference includes:
  • any training image frame group according to the RGB values of all pixels in the second image frame in any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the second image frame in any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
  • the embodiment of the present application also provides a video frame interpolation model training device, including:
  • the acquisition module is used to obtain training image frame groups, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as each a label intermediate image frame corresponding to the training image frame group;
  • the video frame interpolation module is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the corresponding estimated intermediate image frame of each training image frame group;
  • An adjustment module configured to be based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second difference in each training image frame group The second difference between the first image frame and the estimated intermediate image frame corresponding to each training image frame group, and the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group
  • the third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stops when the training stop condition is satisfied; wherein, the correlation between the second difference and the parameter adjustment is greater than that of the first difference or the third difference and Correlation between parameter adjustments.
  • the embodiment of the present application further provides a computer device.
  • the computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
  • the embodiment of the present application further provides a computer program product.
  • Said computer program product comprises a computer program which, when executed by a processor, implements the following steps:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
  • Adjusting the parameters in the video frame interpolation model can make the texture of the intermediate image frame output by the video frame interpolation model clearer and closer to the texture structure of the input image frame, avoiding the generation of blurred and unclear texture content .
  • Fig. 1 is the application environment diagram of the video frame interpolation model training method in the embodiment of the present application
  • FIG. 2 is a schematic flow diagram of a video frame interpolation model training method in an embodiment of the present application
  • FIG. 3 is a schematic diagram of the reconstruction process of the video frame interpolation model training method in the embodiment of the present application.
  • FIG. 4 is a schematic diagram of the cross-scale alignment processing of the video frame interpolation model training method in the embodiment of the present application.
  • FIG. 5 is a schematic diagram of the matching process of the video frame interpolation model training method in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the training process of the video frame interpolation model training method in the embodiment of the present application.
  • Fig. 7a is the comparative evaluation result figure of single-frame video interpolation in the embodiment of the present application.
  • Fig. 7b is a comparative evaluation result diagram of multi-frame video interpolation in the embodiment of the present application.
  • Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation in the embodiment of the present application.
  • Fig. 7d is a comparison diagram of visual effects after integrating the trained video frame interpolation model into a video super-resolution model in the embodiment of the present application;
  • FIG. 7e is a visual comparison diagram of single-frame video interpolation in the embodiment of the present application.
  • Figure 7f is a visual comparison diagram of multi-frame video interpolation in the embodiment of the present application.
  • Figure 7g is a visual comparison diagram of single-frame video extrapolation in the embodiment of the present application.
  • Figure 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution in the embodiment of the present application.
  • FIG. 7i is a single visual comparison diagram of the TCL loss function added in the embodiment of the present application.
  • Fig. 7j is a plurality of visual comparison diagrams in which the TCL loss function is added in the embodiment of the present application.
  • FIG. 8 is a structural block diagram of a video frame interpolation model training device in an embodiment of the present application.
  • FIG. 9 is an internal structural diagram of a computer device in an embodiment of the present application.
  • first and second used in the embodiments of the present application may be used to describe various technical terms herein, but unless otherwise specified, these technical terms are not limited by these terms. These terms are only used to distinguish one term from another.
  • the third preset threshold and the fourth preset threshold may be the same or different.
  • the video frame interpolation model training method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 .
  • the terminal 101 communicates with the server 102 through a network.
  • the data storage system can store data that needs to be processed by the server 102 .
  • the data storage system can be integrated on the server 102, or placed on the cloud or other network servers.
  • the terminal 101 acquires the training image frame group, and the server processes the training image frame group.
  • the processing function of the server 102 can also be directly integrated into the terminal 101, that is, the terminal 101 acquires training image frames, and processes the training image frames to obtain a trained video frame insertion model.
  • the terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, IoT devices and portable wearable devices.
  • the server 102 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a video frame interpolation model training method is provided.
  • the method is applied to the terminal 101 in FIG. 1 as an example for illustration, including the following steps:
  • each training image frame group is composed of three consecutive image frames arranged in sequence in the video, and the second image frame in each training image frame group is used as each training image
  • the second image frame in each training image frame group Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group The second difference between the estimated intermediate image frames corresponding to each training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference between them adjusts the parameters in the video frame interpolation model until the training ends when the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and the parameter adjustment correlation between.
  • the training image frame group refers to extracting every three consecutive image frames obtained after the image frame extraction process is performed on the video as a training image frame group.
  • the three image frames in each training image frame group are arranged in order of their appearance time in the video.
  • the video can be not only one video, but also multiple different videos, so the obtained training image frame group can come from one video or multiple videos.
  • the second image frame in each image frame group because it is the intermediate image frame corresponding to the first image frame and the third image frame in the image frame group, the content of the second image frame is the first image Frame and the third image frame can constitute the associated connection content, therefore, in this embodiment, the second image frame is used as the corresponding label intermediate image frame of each training image frame group, and each training image frame group The corresponding labeled intermediate image frame is used as the supervised image of each training image frame group, and the video frame interpolation model can be supervised for training.
  • the estimated intermediate image frame corresponding to each training image frame group will be obtained.
  • the content of the intermediate image frame is obtained by processing the contents of the first image frame and the three image frames, and the content of the estimated intermediate image frame is the same as the content of the label intermediate image frame corresponding to each training image frame group resemblance.
  • the second image frame of each training image frame group is only one possible solution corresponding to the first image frame to the third image frame in the training image frame group.
  • the content of the video shooting is that a ball moves from point A to point E through points B and C. If the first image frame of a certain training image frame group shows that the ball is at point A, the third image frame shows that the ball is at point A. The ball is at point E, and the second image frame shows the ball at point B, but the actual ball also passes through point C while moving, but this position at point C is not captured because the video is It consists of still image frames. Therefore, the video cannot reflect the continuous movement of the ball in time, and the moving process of the ball captured by the video only reflects that the ball is at a certain position at a certain moment.
  • the training stop condition refers to: the video frame interpolation model constantly adjusts the parameters of the video frame interpolation model during the training process, and when the change rate of the parameters of the video frame interpolation model does not exceed the predetermined range, the video The interpolation model satisfies the training stop condition.
  • a supervisory function is added in this embodiment, so that the video frame interpolation model can adjust the parameters of the video frame interpolation model during training, so that the video frame interpolation
  • the model is continuously optimized during the training process.
  • the supervision function is divided into two parts, the first part is the first loss function, and the second difference between the label intermediate image frame corresponding to each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Decide.
  • the second part is the texture consistency loss function (Texture Consistency Loss, TCL), which consists of the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group.
  • TCL Textture Consistency Loss
  • the difference is determined by a third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group.
  • the correlation degree between the second difference and the parameter adjustment is greater than that
  • the correlation degree between the first difference or the third difference and the parameter adjustment refers to: in the supervisory function, the first The degree of correlation between parameter adjustments of the loss function is greater than that of the texture consistency loss function.
  • supervisory function can be shown as formula (1):
  • Equation (1) Represents the estimated intermediate image frame corresponding to each training image frame group, I 0 represents the label intermediate image frame corresponding to each training image frame group, I- 1 represents the first image frame in each training image frame group, I 1 represents the third image frame in each training image frame group, ⁇ is an adjustable coefficient, L 1 is the first loss function, and L p is the texture consistency loss function.
  • the video frame interpolation model since the texture consistency loss function is added on the basis of the original supervisory function, the video frame interpolation model not only considers the middle of the label corresponding to each training image frame group in the process of supervised training.
  • the content of the image frame will also consider the content of the first image frame and the third image frame in each training image frame group, which can alleviate the over-constraint problem in the supervised training, so that the image frame output by the video frame interpolation model
  • the texture definition, signal-to-noise ratio and structural similarity are higher, thereby improving the frame rate of the video and increasing the smoothness of the picture.
  • the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frames corresponding to each training image frame group are output, including:
  • any training image frame group use the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, and for the first image frame and the third image frame Frames are adjusted with the same resolution at the same time; among them, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and n is a positive integer and not less than 2.
  • the first image frame and the third image frame will be adjusted n-1 times in resolution , each time the resolution is adjusted, the obtained first image frame and the third image frame will have a lower resolution than the first image frame and the third image frame before the resolution adjustment.
  • performing the third resolution adjustment on the first image frame and the third image frame is to reduce the first image frame and the third image frame on the basis of the first image frame and the third image frame obtained after the second resolution adjustment.
  • the resolution of the third image frame Therefore, the resolutions of the first image frame and the third image frame obtained after the third resolution adjustment are smaller than the resolutions of the first image frame and the third image frame obtained after the second resolution adjustment.
  • the resolution adjustment times of the first image frame and the third image frame should not be less than 1 time.
  • the image frames after resolution adjustment are grouped, and the image frames with the same resolution are grouped into one group. Since n-1 resolution adjustments are performed, the original image frame group without resolution adjustment is added. Therefore, there are n groups of image frame groups with different resolutions. Then feature extraction is performed on n groups of image frame groups with different resolutions to obtain n groups of image feature group sets.
  • this embodiment does not specifically limit the method for obtaining image frame feature groups of different resolutions, including but not limited to: the implementation process of the above steps 301 and 302, and: for any training image frame group, the The first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the same resolution is used to extract features for the first image frame and the third image frame at the same time, and each The features extracted from the two image frames after the second adjustment form each image frame feature group, and each image frame feature group forms an image frame feature group set.
  • the features are extracted n times in total and the resolution of each extracted feature is different, and n is a positive integer and not less than 2.
  • convolution may be used to simultaneously perform resolution adjustment and feature extraction on the first image frame and the third image frame, so as to obtain the image frame feature group set in step 302 above.
  • the order of obtaining the alignment result of the first image frame and the alignment result of the third image frame is not specifically limited in the embodiment of the present application, and the alignment result of the first image frame can be obtained first, and then the alignment result of the second image frame can be obtained result.
  • the alignment result of the second image frame may be obtained first, and then the alignment result of the first image frame may be obtained. It is also possible to obtain the alignment result of the first image frame and the alignment result of the second image frame at the same time.
  • the process of cross-scale aligning the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set is the same as aligning the third image frame in the image frame feature set
  • the process of performing cross-scale alignment processing of the corresponding features in the set to the corresponding features of the first image frame in the image frame feature group set is the same.
  • the reconstruction process refers to regressing the estimated intermediate image frame according to the two-way information fusion result. Specifically, the two-way information fusion result is processed first, then the processing result is input for single-layer convolution, and finally the estimated intermediate image frame is output.
  • the two-way information fusion result F 0 is first input to the first layer (Layer1) for processing, and then the processing result is input to the second layer (Layer2) for single-layer convolution, and finally output Estimating intermediate image frames
  • “40 ⁇ RB(128)” indicates that 40 "RB(128)” are used, and RB(128) indicates a residual block with a channel dimension of 128.
  • Conv(128,3,3,1) represents a single-layer convolution, where the input and output are 128, 3, the convolution kernel is 3, and the convolution step is 1.
  • the parameters of the video frame interpolation model can be adjusted, thereby improving the quality of the output image frames of the video frame interpolation model.
  • the resolutions corresponding to the image frame feature groups in the image frame feature group set increase sequentially; the corresponding features of the first image frame in the image frame feature set set Perform cross-scale alignment (Cross-scale Pyramid Alignment) processing to the features corresponding to the third image frame in the image frame feature group set to obtain the alignment result of the first image frame, including:
  • the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment (Alignment Block, AB) processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation (Bilinear Upsampling, BU) calculation results and the third image frame at the i-th
  • the corresponding features in each image frame feature group are processed by cross-scale fusion (Cross-scale Fusion, CSF), and the i-th cross-scale fusion processing result is obtained, and the i-th cross-scale fusion processing result is combined with the first image frame at the i-th
  • the corresponding features in the image frame feature groups are aligned to obtain the i-th alignment processing result; repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the alignment in this embodiment does not specifically limit the size of the resolution corresponding to the image frame feature group in the image frame feature group set.
  • the number of alignment processes is the same as the number of image frame feature groups in the set of image frame feature groups in step 302 above.
  • the corresponding feature of the first image frame in the image frame feature group collection is transferred to the corresponding feature of the third image frame in the image frame feature group collection
  • four alignment processings are required.
  • the number of times of alignment processing in the process of performing cross-scale alignment processing from the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set should not be less than 2 Second-rate.
  • the image frame feature set contains three sets of images
  • the image frame feature group is illustrated as an example.
  • (a) in Figure 4 where, is the image frame feature group with the highest resolution in the image frame feature group set, that is, the resolution of the image feature group is the same as the resolution of the image frame before the resolution adjustment.
  • (a) in Figure 4 Be the second image frame feature group in the image frame feature group collection, the resolution of (a) in Fig. 4 is the image frame feature group with the smallest resolution in the set of image frame feature groups.
  • Indicates the 3 image frame features extracted from the first image frame after 2 resolution adjustments and Indicates the features of the three image frames extracted from the third image frame after two resolution adjustments. Indicates the alignment result of the first image frame.
  • the method provided in the embodiment of the present application can extract effective reconstruction signals from image frames of multiple scales by aligning features with the same resolution and adding a cross-scale fusion process, thereby improving the output first
  • the accuracy of the alignment results of image frames can comprehensively and effectively utilize multi-scale information.
  • the features corresponding to the third image frame in the image frame feature set are subjected to cross-scale alignment processing to the features corresponding to the first image frame in the image frame feature set to obtain the alignment result of the third image frame ,include:
  • the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above processing for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame.
  • the processing method of obtaining the alignment result of the third image frame is the same as that of obtaining the alignment result of the third image frame, which will not be described here again.
  • the specific processing procedure of obtaining the alignment result of the third image frame refer to the above-mentioned processing procedure of obtaining the alignment result of the first image frame.
  • the third image can be obtained by performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set The alignment result of the frame.
  • two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:
  • this embodiment does not specifically limit the selection of the calculation method for calculating the convolution result, including but not limited to: Sigmoid function and so on.
  • the alignment result of the first image frame and the alignment result of the third image frame are subjected to single-layer convolution to obtain the convolution processing result, and then the convolution processing result is activated by a function to obtain the fusion weight, and then according to the fusion
  • the weight is calculated and processed according to formula (2) on the alignment result of the first image frame and the result of the third image frame to obtain a two-way information fusion result.
  • M is the fusion weight
  • F 0 is the result of two-way information fusion.
  • the method provided in the embodiment of the present application can obtain the bidirectional information fusion result by performing bidirectional information fusion on the alignment result of the first image frame and the result of the third image frame, thereby improving the quality of the image frame output by the video frame interpolation model.
  • both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:
  • any training image frame group For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, and select any training image frame according to the central pixel of any t*t pixel The position in the estimated intermediate image frame corresponding to the group is determined in the first image frame of any training image frame group and the first target pixel of t*t is determined in the third image frame in any training image frame group The third target pixel of t*t; wherein, t is an odd number not equal to 1.
  • the first character set, the second character set, and the third character set determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel Similarity, as a third difference.
  • the intermediate image frame is estimated from the Select any t*t pixel block in Where t is an odd number other than 1, for example, 3, 5, 7, etc., and x is the two-dimensional coordinate of the central pixel of the pixel block. Then according to the pixel block The two-dimensional coordinates x of the central pixel of , determine the first target pixel in the first image frame in any training image frame group and determine the third target pixel in the third image frame in any training image frame group.
  • the first target pixel of t*t and the third image frame in any training image frame group determine the first target pixel of t*t Three target pixels.
  • the first target pixel and the third target pixel of t*t are used as pixels to be matched.
  • the texture consistency loss of the central pixel of any t*t pixel at x is calculated by the texture consistency loss function (Texture Consistency Loss, TCL), and the video is analyzed according to the texture consistency loss
  • TCL Textture Consistency Loss
  • the calculated The texture consistency loss function of the central pixel of any t*t pixel After selecting the best matching pixel corresponding to the central pixel of any t*t pixel from the first target pixel of t*t and the third target pixel of t*t, the calculated The texture consistency loss function of the central pixel of any t*t pixel.
  • the texture consistency loss is determined by comparing the RGB values of the central pixel of any t*t pixel with the best matching pixel.
  • the pixel at the position of the two-dimensional coordinate x of the center is the center pixel, and all the pixels f y t to be matched are respectively obtained from the first image frame I ⁇ 1 and the third image frame I 1 within a certain range d.
  • the value of d is an odd number not less than 3, such as 3, 5, 7, etc., t ⁇ -1, 1 ⁇ , t indicates that the pixel to be matched comes from the first image frame or the third image frame
  • y represents the two-dimensional coordinates of the pixel f y t to be matched.
  • the formula for determining the two-dimensional coordinate ⁇ (x) is shown in formula (3):
  • f x (x) is any 3*3 pixel
  • the RGB value of the pixel at the center position of , f x (x+x n ) is the RGB value of other pixels to be matched
  • x is the coordinate of the pixel at the center position (0, 0)
  • x n is other to be matched
  • the two-dimensional coordinates of the pixel, R is the coordinates of the other eight pixels except the pixel at the center position.
  • R ⁇ (-1,-1), (-1,0), (-1,1), (1,-1), (1,1), (1,0), (0,1 ), (0, -1) ⁇ .
  • L2 is a matching function for similarity matching.
  • the method provided in the embodiment of the present application can alleviate the over-constraint problem caused by the motion ambiguity of the object in the image frame due to the use of the texture consistency function, so that the texture of the image frame output by the trained video frame interpolation model is more accurate. Clear, closer to the texture structure of the input image frame, avoiding blurry and texture-unclear content.
  • the process of determining the second difference includes:
  • the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
  • any training image frame group before determining the second difference, it is necessary to determine the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the corresponding pre-set value of any training image frame group. Estimate the RGB values of all pixels in the intermediate image frame. Then all the pixels in the intermediate image frame of the label are compared with the RGB values of the pixels with the same two-dimensional coordinates in all the pixels in the estimated intermediate image frame one by one, and it is determined that all the pixels in the intermediate image frame of the label are consistent with The estimated difference between RGB values of all pixels in the intermediate image frame. The differences of the RGB values of all pixels are summed and then averaged, and the average value can be used as the second difference.
  • this difference can be used to realize the supervised training of the video frame interpolation model, thereby The accuracy of the image frame output by the video frame interpolation model can be improved, thereby improving the fluency and clarity of the video.
  • the video frame interpolation model after the video frame interpolation model is trained, it includes:
  • the training process of the video frame interpolation model is shown in Figure 6.
  • the process of using it is to obtain the video for which video frame interpolation needs to be performed, and then perform image frame extraction on the video, Two image frames are selected from the extracted image frames. Input two image frames into the trained video frame interpolation model, and after processing by the video frame interpolation model, the intermediate image frame of the two image frames can be output.
  • the video frame interpolation model trained in this embodiment can not only complete single-frame video interpolation and extrapolation, but also complete multi-frame video interpolation. That is, the trained video frame interpolation model in this embodiment can be used to generate an intermediate image frame between two image frames, or it can be used to generate a future image frame placed after the two image frames, and it can also be used for multiple images frame generates an intermediate image frame.
  • the output results of the video frame interpolation model can effectively improve the performance of video super-resolution.
  • the comparison diagrams of the obtained image frames are shown in Fig. 7a to Fig. 7j.
  • Fig. 7a is a comparison and evaluation result diagram of single-frame video interpolation, wherein, the video frame interpolation model has two input image frames and outputs one intermediate image frame.
  • Fig. 7b is a diagram of comparative evaluation results of multi-frame video interpolation, where the video frame interpolation model has 4 input image frames and outputs 1 intermediate image frame.
  • Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation, in which, the video frame interpolation model takes 2 input image frames and outputs 1 future image frame.
  • Fig. 7d is a comparison diagram of visual effect after integrating the trained video frame interpolation model into a video super-resolution model.
  • Fig. 7e is a visual comparison diagram of single-frame video interpolation.
  • Fig. 7f is a visual comparison diagram of multi-frame video interpolation.
  • Figure 7g is a visual comparison diagram of single-frame video extrapolation.
  • Fig. 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution.
  • Figure 7i is a single visualization comparison diagram with TCL loss function added.
  • Figure 7j is a comparison of multiple visualizations with TCL loss function added.
  • the method provided by this application implements the trained video frame interpolation model to process the video to be processed, and can output high-definition image frames, thereby effectively improving the performance of video super-resolution.
  • the method provided in this embodiment achieves the highest peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) and structural similarity (Structural Similarity, SSIM).
  • steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction for the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.
  • an embodiment of the present application also provides a video frame interpolation model training device for implementing the above-mentioned video frame interpolation model training method.
  • the solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the video frame interpolation model training device provided below can be referred to above for video frame interpolation The limitation of the model training method will not be repeated here.
  • a video frame insertion model training device including: an acquisition module, a video frame insertion module and an adjustment module, wherein:
  • the obtaining module 801 is used to obtain the training image frame group, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as Label intermediate image frames corresponding to each training image frame group;
  • the video frame interpolation module 802 is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
  • An adjustment module 803, configured to use a label corresponding to each training image frame group based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group.
  • the second difference between the intermediate image frame and the estimated intermediate image frame corresponding to each training image frame group, and the estimated intermediate image frame corresponding to each training image frame group corresponding to the third image frame in each training image frame group The third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stop condition is satisfied and the training ends; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and Correlation between parameter adjustments.
  • the video frame insertion module 802 includes:
  • the adjustment sub-module is used for any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, for the first image frame Adjusting with the same resolution as the third image frame at the same time; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, wherein n is a positive integer and not less than 2;
  • the feature extraction sub-module is used to perform feature extraction on the two image frames after each adjustment, each image frame feature group is formed by the features extracted from the two image frames after each adjustment, and each image frame feature Grouping constitutes an image frame feature set;
  • the first alignment sub-module is used to perform cross-scale alignment processing of the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the first image frame alignment result;
  • the second alignment sub-module is used to perform cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the third image frame alignment result;
  • the two-way information fusion sub-module is used to perform two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
  • the reconstruction module performs reconstruction processing on the two-way information fusion result to obtain the estimated intermediate image frame corresponding to each training image frame group.
  • the first alignment submodule includes:
  • the first repeating unit is used for the feature group of the i-th image frame, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame in the i-th image frame The corresponding features in the feature group are aligned to obtain the i-th alignment result.
  • the first i-1 bilinear interpolation calculation results and the third image frame are placed in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the first image frame; wherein, i is A positive integer not less than 1 and not greater than n;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the second alignment submodule includes:
  • the second repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame in the i-th image frame
  • the corresponding features in the feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first image frame in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the third image frame; wherein, i is A positive integer not less than 1 and not greater than
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the two-way information fusion submodule includes:
  • the first acquisition unit is configured to convolve the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result
  • the second acquisition unit is used to calculate the convolution result to obtain the fusion weight
  • the first processing unit is configured to perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
  • the adjustment module 803 includes:
  • the first determining unit is used for selecting any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group for any training image frame group, according to the central pixel of any t*t pixel In the position in the estimated intermediate image frame corresponding to any training image frame group, determine the first target pixel of t*t in the first image frame in any training image frame group and the third target pixel in any training image frame group Determine the third target pixel of t*t in an image frame; Wherein, t is an odd number not equal to 1;
  • the second determination unit is used to determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the first character set according to any t*t pixel two-character set;
  • the third determination unit is used to determine the similarity between any pixel and the first target pixel according to the first character set, the second character set and the third character set, as the first difference, determine any pixel and the third character set The similarity between target pixels, as the third difference.
  • the adjustment module 803 further includes:
  • the fourth determination unit is used for any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group.
  • the RGB values of all pixels determine the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group as the second difference.
  • the comparison module is used to compare the first difference with the third difference, determine the best matching pixel of the central pixel of any t*t pixel, and calculate the value of any t*t through the texture consistency loss function according to the best matching pixel
  • the texture consistency loss of the central pixel of the pixel where the texture consistency loss is used to train the video frame interpolation model.
  • An image frame acquisition module configured to acquire two image frames to be processed in the video to be processed
  • the input module is used to input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
  • Each module in the above-mentioned video frame interpolation model device can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 9 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies.
  • a video frame insertion model method is realized.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • FIG. 9 is only a block diagram of a part of the structure related to the embodiment of the application, and does not constitute a limitation on the computer equipment to which the embodiment of the application is applied.
  • the computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, a computer program is stored in the memory, and the processor implements the following steps when executing the computer program:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the correlation between the first difference or the third difference and the parameter adjustment Spend.
  • the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously
  • the same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
  • each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
  • a reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
  • the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
  • any training image frame group For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
  • the second character set and the third character set determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel Similarity, as a third difference.
  • the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group
  • the third difference is to adjust the parameters in the video frame interpolation model until the training stop condition is satisfied, and the training is ended; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference.
  • the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously
  • the same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
  • each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
  • the feature corresponding to the first image frame in the image frame feature group set is carried out to the feature corresponding to the third image frame in the image frame feature set set for cross-scale alignment processing to obtain the alignment result of the first image frame;
  • a reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
  • the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing, to get the i-th alignment processing result, if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
  • any training image frame group For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
  • the second character set and the third character set determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel , as the third difference.
  • the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
  • a computer program product comprising a computer program that, when executed by a processor, implements the following steps:
  • each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
  • the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is met; wherein, the correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment degree of relevance.
  • the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously
  • the same resolution is used for adjustment; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, where n is a positive integer and not less than 2;
  • each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
  • a reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
  • the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result
  • the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above process for each image frame feature group until all image frame feature groups are processed, and take the nth alignment processing result as the alignment result of the first image frame, where i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
  • the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
  • the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
  • any training image frame group For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
  • the second character set and the third character set determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel , as the third difference.
  • the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • any reference to storage, database or other media used in the various embodiments provided in the embodiments of the present application may include at least one of non-volatile and volatile storage.
  • Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc.
  • the volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc.
  • RAM Random Access Memory
  • RAM Random Access Memory
  • RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
  • the databases involved in the various embodiments provided in the embodiments of the present application may include at least one of a relational database and a non-relational database.
  • Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto.
  • the processors involved in the various embodiments provided in the embodiments of the present application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc. Not limited to this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Systems (AREA)

Abstract

The present application relates to a video frame interpolation model training method and apparatus, and a computer device, a storage medium and a computer program product. The method comprises: acquiring training image frame groups; inputting the first image frame and the third image frame in each training image frame group into a video frame interpolation model, and outputting an estimated intermediate image frame corresponding to each training image frame group; and adjusting parameters in the video frame interpolation model on the basis of a first difference, a second difference and a third difference in each training image frame group until a training stop condition is met, and then ending the training. By using the method in the present application, a high-quality video frame can be effectively generated, thereby improving the frame rate of a video, and improving the fluency of a picture.

Description

视频插帧模型训练方法、装置、计算机设备和存储介质Video frame interpolation model training method, device, computer equipment and storage medium
本申请要求于2021年12月06日提交中国国家知识产权局、申请号为202111477500.0、发明名称为“视频插帧模型训练方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on December 06, 2021, with the application number 202111477500.0, and the title of the invention is "Video frame interpolation model training method, device, computer equipment and storage medium". The entire contents are incorporated by reference in this application.
技术领域technical field
本申请涉及图像处理技术领域,特别是涉及一种视频插帧模型训练方法、装置、计算机设备和存储介质。The present application relates to the technical field of image processing, in particular to a video frame interpolation model training method, device, computer equipment and storage medium.
背景技术Background technique
随着图像处理技术的发展,人们对于高刷新率的高质量视频画面的需求也在快速增长,因此,出现了视频插帧模型训练技术,视频插帧的主要目的是为了通过增加帧率,提升画面的流畅感。如今,视频插帧技术已被应用于各个领域中,比如,随着手机硬件的发展,刷新率也得到了极大的提升,而以前的视频内容也需要提升帧率来匹配硬件能支持的最高刷新率。在动画制作中,同样也需要视频插帧方法,可以根据少量关键的图像帧得到一段更为流畅的视频片段。With the development of image processing technology, people's demand for high-quality video images with high refresh rates is also growing rapidly. Therefore, a video frame interpolation model training technology has emerged. The main purpose of video interpolation is to improve the frame rate by increasing the frame rate. The smoothness of the screen. Today, video frame insertion technology has been applied in various fields. For example, with the development of mobile phone hardware, the refresh rate has also been greatly improved, and the previous video content also needs to increase the frame rate to match the highest supported by the hardware. refresh rate. In animation production, a video frame interpolation method is also required, which can obtain a smoother video clip based on a small number of key image frames.
在相关技术中,对于大位移的物体难以准确的捕捉时序的对应,从而容易产生模糊的插帧结果。此外,相关技术依赖监督学习进行模型训练,而监督图像仅仅是一种可能性的解。因此视频插帧模型的输入和输出之间存在着一对多的映射关系,而采用像素级别的一对一的方式进行监督会导致过约束的问题,从而使得输出的结果趋向于生成平均的内容,导致生成的中间图像帧的图像过度平滑和纹理不清晰。In the related art, it is difficult to accurately capture the timing correspondence for objects with large displacements, so blurred frame interpolation results are likely to be generated. Furthermore, related techniques rely on supervised learning for model training, and supervised images are only one possible solution. Therefore, there is a one-to-many mapping relationship between the input and output of the video interpolation model, and the use of pixel-level one-to-one supervision will lead to over-constraint problems, so that the output results tend to generate average content. , resulting in over-smoothed images and unclear textures in the resulting intermediate image frames.
发明内容Contents of the invention
基于此,有必要针对上述技术问题,提供一种视频插帧模型训练方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a video frame interpolation model training method, device, computer equipment, computer readable storage medium and computer program product for the above technical problems.
第一方面,本申请实施例提供了一种视频插帧模型训练方法,包括:In the first aspect, the embodiment of the present application provides a video frame interpolation model training method, including:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一 训练图像帧组对应的预估中间图像帧;The first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frame corresponding to each training image frame group is output;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group and each The second difference between the estimated intermediate image frames corresponding to a training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
在一些实施例中,将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧,包括:In some embodiments, the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frames corresponding to each training image frame group are output, including:
对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,n为正整数且不小于2;For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the alignment result of the first image frame;
将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the alignment result of the third image frame;
将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。A reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
在一些实施例中,图像帧特征组集合中图像帧特征组对应的分辨率按顺序是依次变大的;将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果,包括:In some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; The corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the first image frame, including:
对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结 果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第一图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,图像帧特征组集合中图像帧特征组对应的分辨率按顺序是依次变大的;将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果,包括:In some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; The corresponding features in the frame feature group set are aligned across scales to obtain the alignment result of the third image frame, including:
对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果,包括:In some embodiments, two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:
将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果;Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
对卷积结果进行计算,得到融合权重;Calculate the convolution result to obtain the fusion weight;
根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。According to the fusion weight, the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
在一些实施例中,第一差异及第三差异均为相似度;第一差异及第三差异的确定过程,包括:In some embodiments, both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:
对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像 帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合;Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel;
根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与第一目标像素之间的相似度,作为第一差异;确定任一像素与第三目标像素之间的相似度,作为第三差异。According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel , as the third difference.
在一些实施例中,第二差异的确定过程,包括:In some embodiments, the process of determining the second difference includes:
对于任一训练图像帧组,根据任一训练图像帧组中第二个图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一训练图像帧组中的第二图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。For any training image frame group, according to the RGB values of all pixels in the second image frame in any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the second image frame in any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
在一些实施例中,还包括:In some embodiments, also include:
比较第一差异与第三差异,确定任一t*t的像素的中心像素的最佳匹配像素;根据最佳匹配像素,通过纹理一致性损失函数计算任一t*t的像素的中心像素的纹理一致性损失;其中,纹理一致性损失用于对视频插帧模型进行训练。Comparing the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel; according to the best matching pixel, calculate the center pixel of any t*t pixel through the texture consistency loss function Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.
在一些实施例中,还包括:In some embodiments, also include:
获取待处理视频中两个待处理图像帧;Obtain two image frames to be processed in the video to be processed;
将两个待处理图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。Input the two image frames to be processed into the trained video frame interpolation model to obtain the intermediate image frame of the two image frames to be processed.
第二方面,本申请实施例还提供了一种视频插帧模型训练装置,包括:In the second aspect, the embodiment of the present application also provides a video frame interpolation model training device, including:
获取模块,用于获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;The acquisition module is used to obtain training image frame groups, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as each a label intermediate image frame corresponding to the training image frame group;
视频插帧模块,用于将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;The video frame interpolation module is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the corresponding estimated intermediate image frame of each training image frame group;
调整模块,用于基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差 异或第三差异与参数调整之间的关联度。An adjustment module, configured to be based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second difference in each training image frame group The second difference between the first image frame and the estimated intermediate image frame corresponding to each training image frame group, and the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stops when the training stop condition is satisfied; wherein, the correlation between the second difference and the parameter adjustment is greater than that of the first difference or the third difference and Correlation between parameter adjustments.
第三方面,本申请实施例还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:In a third aspect, the embodiment of the present application further provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group and each The second difference between the estimated intermediate image frames corresponding to a training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group and each The second difference between the estimated intermediate image frames corresponding to a training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
第五方面,本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时实现以下步骤:In a fifth aspect, the embodiment of the present application further provides a computer program product. Said computer program product comprises a computer program which, when executed by a processor, implements the following steps:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group and each The second difference between the estimated intermediate image frames corresponding to a training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is met; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment Correlation.
相较于相关技术中只通过比较第二个图像帧与预估中间图像帧之间的差异调整视频插帧模型中的参数,由于加入了比较第一图像帧以及第三图像帧与预估中间图像帧之间的差异调整视频插帧模型中的参数,可以使视频插帧模型输出的中间图像帧纹理更加清晰,且更加逼近于输入图像帧的纹理结构,避免生成模糊以及纹理不清晰的内容。Compared with the related art that only adjusts the parameters in the video frame interpolation model by comparing the difference between the second image frame and the estimated intermediate image frame, due to the addition of comparing the first image frame and the third image frame with the estimated intermediate The difference between image frames Adjusting the parameters in the video frame interpolation model can make the texture of the intermediate image frame output by the video frame interpolation model clearer and closer to the texture structure of the input image frame, avoiding the generation of blurred and unclear texture content .
附图说明Description of drawings
图1为本申请实施例中视频插帧模型训练方法的应用环境图;Fig. 1 is the application environment diagram of the video frame interpolation model training method in the embodiment of the present application;
图2为本申请实施例中视频插帧模型训练方法的流程示意图;FIG. 2 is a schematic flow diagram of a video frame interpolation model training method in an embodiment of the present application;
图3为本申请实施例中视频插帧模型训练方法的重建处理的示意图;FIG. 3 is a schematic diagram of the reconstruction process of the video frame interpolation model training method in the embodiment of the present application;
图4为本申请实施例中视频插帧模型训练方法的跨尺度对齐处理的示意图;4 is a schematic diagram of the cross-scale alignment processing of the video frame interpolation model training method in the embodiment of the present application;
图5为本申请实施例中视频插帧模型训练方法的匹配过程的示意图;5 is a schematic diagram of the matching process of the video frame interpolation model training method in the embodiment of the present application;
图6为本申请实施例中视频插帧模型训练方法的训练过程的示意图;6 is a schematic diagram of the training process of the video frame interpolation model training method in the embodiment of the present application;
图7a为本申请实施例中单帧视频内插的对比评测结果图;Fig. 7a is the comparative evaluation result figure of single-frame video interpolation in the embodiment of the present application;
图7b为本申请实施例中多帧视频内插的对比评测结果图;Fig. 7b is a comparative evaluation result diagram of multi-frame video interpolation in the embodiment of the present application;
图7c为本申请实施例中单帧视频外插的对比评测结果图;Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation in the embodiment of the present application;
图7d为本申请实施例中将已训练好的视频插帧模型集成在一个视频超分辨率模型后的可视化效果对比图;Fig. 7d is a comparison diagram of visual effects after integrating the trained video frame interpolation model into a video super-resolution model in the embodiment of the present application;
图7e为本申请实施例中单帧视频内插的可视化对比图;FIG. 7e is a visual comparison diagram of single-frame video interpolation in the embodiment of the present application;
图7f为本申请实施例中多帧视频内插的可视化对比图;Figure 7f is a visual comparison diagram of multi-frame video interpolation in the embodiment of the present application;
图7g为本申请实施例中单帧视频外插的可视化对比图;Figure 7g is a visual comparison diagram of single-frame video extrapolation in the embodiment of the present application;
图7h为本申请实施例中单帧视频内插对视频超分的影响对比图;Figure 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution in the embodiment of the present application;
图7i为本申请实施例中加入TCL损失函数的单个可视化对比图;Figure 7i is a single visual comparison diagram of the TCL loss function added in the embodiment of the present application;
图7j为本申请实施例中加入TCL损失函数的多个可视化对比图;Fig. 7j is a plurality of visual comparison diagrams in which the TCL loss function is added in the embodiment of the present application;
图8为本申请实施例中视频插帧模型训练装置的结构框图;FIG. 8 is a structural block diagram of a video frame interpolation model training device in an embodiment of the present application;
图9为本申请实施例中计算机设备的内部结构图。FIG. 9 is an internal structural diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请实施例的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请实施例进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请实施例,并不用于限定本申请实施例。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the embodiments of the present application, and are not intended to limit the embodiments of the present application.
可以理解,本申请实施例所使用的术语“第一”、“第二”等可在本文中用于描述各种专业名词,但除非特别说明,这些专业名词不受这些术语限制。这些术语仅用于将一个专业名词与另一个专业名词区分。举例来说,在不脱离本申请实施例的范围的情况下,第三预设阈值与第四预设阈值可以相同可以不同。It can be understood that the terms "first" and "second" used in the embodiments of the present application may be used to describe various technical terms herein, but unless otherwise specified, these technical terms are not limited by these terms. These terms are only used to distinguish one term from another. For example, without departing from the scope of the embodiments of the present application, the third preset threshold and the fourth preset threshold may be the same or different.
本申请实施例提供的视频插帧模型训练方法,可以应用于如图1所示的应用环境中。其中,终端101通过网络与服务器102进行通信。数据存储系统可以存储服务器102需要处理的数据。数据存储系统可以集成在服务器102上,也可以放在云上或其他网络服务器上。终端101获取训练图像帧组,服务器对训练图像帧组进行处理。当然,实际实施过程中,服务器102的处理功能也可以直接集成到终端101中,也即终端101获取训练图像帧,并对训练图像帧进行处理得到训练好的视频插帧模型。其中,终端101可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备。服务器102可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The video frame interpolation model training method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 . Wherein, the terminal 101 communicates with the server 102 through a network. The data storage system can store data that needs to be processed by the server 102 . The data storage system can be integrated on the server 102, or placed on the cloud or other network servers. The terminal 101 acquires the training image frame group, and the server processes the training image frame group. Of course, in the actual implementation process, the processing function of the server 102 can also be directly integrated into the terminal 101, that is, the terminal 101 acquires training image frames, and processes the training image frames to obtain a trained video frame insertion model. Wherein, the terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, IoT devices and portable wearable devices. The server 102 can be implemented by an independent server or a server cluster composed of multiple servers.
在一些实施例中,如图2所示,提供了一种视频插帧模型训练方法,以该方法应用于图1中的终端101为例进行说明,包括以下步骤:In some embodiments, as shown in FIG. 2 , a video frame interpolation model training method is provided. The method is applied to the terminal 101 in FIG. 1 as an example for illustration, including the following steps:
201、获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧。201. Obtain a training image frame group, each training image frame group is composed of three consecutive image frames arranged in sequence in the video, and the second image frame in each training image frame group is used as each training image The labeled intermediate image frame corresponding to the frame group.
202、将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧。202. Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group.
203、基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图 像帧之间的第一差异、每一训练图像帧组中的第二个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度,大于第一差异或第三差异与参数调整之间的关联度。203. Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the second image frame in each training image frame group The second difference between the estimated intermediate image frames corresponding to each training image frame group, and the difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference between them adjusts the parameters in the video frame interpolation model until the training ends when the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and the parameter adjustment correlation between.
在上述步骤201中,训练图像帧组指的是对视频进行图像帧提取处理后,将提取得到的每连续的三个图像帧作为一组训练图像帧组。其中,每一训练图像帧组中的三个图像帧按其在视频中出现时间顺序排布。另外,视频可以不只是一段视频,也可以是多段不相同的视频,因此得到的训练图像帧组可以来自一段视频,也可以来自多段视频。In the above step 201, the training image frame group refers to extracting every three consecutive image frames obtained after the image frame extraction process is performed on the video as a training image frame group. Wherein, the three image frames in each training image frame group are arranged in order of their appearance time in the video. In addition, the video can be not only one video, but also multiple different videos, so the obtained training image frame group can come from one video or multiple videos.
对于每一图像帧组中的第二个图像帧,因为其是对应图像帧组中第一个图像帧与第三个图像帧的中间图像帧,第二个图像帧的内容为第一个图像帧与第三个图像帧之间可以构成关联的连接内容,因此,本实施例将第二个图像帧作为每一训练图像帧组的对应的标签中间图像帧,以每一训练图像帧组的对应的标签中间图像帧作为每一训练图像帧组的监督图像,可以对视频插帧模型进行监督训练。For the second image frame in each image frame group, because it is the intermediate image frame corresponding to the first image frame and the third image frame in the image frame group, the content of the second image frame is the first image Frame and the third image frame can constitute the associated connection content, therefore, in this embodiment, the second image frame is used as the corresponding label intermediate image frame of each training image frame group, and each training image frame group The corresponding labeled intermediate image frame is used as the supervised image of each training image frame group, and the video frame interpolation model can be supervised for training.
在上述步骤202中,每一训练图像帧组的第一个图像帧及第三个图像帧输入至视频插帧模型后,都会得到每一训练图像帧组对应的预估中间图像帧,预估中间图像帧的内容是通过对第一个图像帧及三个图像帧的内容进行处理后得到的,而且,预估中间图像帧的内容与每一训练图像帧组对应的标签中间图像帧的内容相似。In the above step 202, after the first image frame and the third image frame of each training image frame group are input to the video frame interpolation model, the estimated intermediate image frame corresponding to each training image frame group will be obtained. The content of the intermediate image frame is obtained by processing the contents of the first image frame and the three image frames, and the content of the estimated intermediate image frame is the same as the content of the label intermediate image frame corresponding to each training image frame group resemblance.
值得一提的是,每一训练图像帧组的第二个图像帧只是对应训练图像帧组中第一个图像帧到第三个图像帧的其中一个可能性的解。比如,视频拍摄的内容为一个小球从A点经过B和C点移动到E点,若某一训练图像帧组的第一个图像帧显示小球在A点,第三个图像帧显示小球在E点,而第二个图像帧显示小球在B点,但实际小球在移动时也经过了C点,但是在C点的这一位置并没有被捕捉到,这是因为视频是由一幅幅静止图像帧组成。因此,视频不能反应出小球在时间上的连续移动,视频拍摄的小球的移动过程反映出的只是小球某一个时刻在某一个位置。It is worth mentioning that the second image frame of each training image frame group is only one possible solution corresponding to the first image frame to the third image frame in the training image frame group. For example, the content of the video shooting is that a ball moves from point A to point E through points B and C. If the first image frame of a certain training image frame group shows that the ball is at point A, the third image frame shows that the ball is at point A. The ball is at point E, and the second image frame shows the ball at point B, but the actual ball also passes through point C while moving, but this position at point C is not captured because the video is It consists of still image frames. Therefore, the video cannot reflect the continuous movement of the ball in time, and the moving process of the ball captured by the video only reflects that the ball is at a certain position at a certain moment.
在上述步骤203中,训练停止条件指的是:视频插帧模型在训练过程中,不断地调整视频插帧模型的参数,当视频插帧模型的参数的变化率不超出预定范围时,则视频插帧模型就满足了训练停止条件。In the above step 203, the training stop condition refers to: the video frame interpolation model constantly adjusts the parameters of the video frame interpolation model during the training process, and when the change rate of the parameters of the video frame interpolation model does not exceed the predetermined range, the video The interpolation model satisfies the training stop condition.
具体地,根据每一训练图像帧组对视频插帧模型进行训练时,本实施例中增加了监督 函数,可以使视频插帧模型在训练时,调整视频插帧模型的参数,使得视频插帧模型在训练过程中在不断优化。其中,监督函数分为两部分,第一部分为第一损失函数,由每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异决定。第二部分为纹理一致性损失函数(Texture Consistency Loss,TCL),由每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异决定。Specifically, when training the video frame interpolation model according to each training image frame group, a supervisory function is added in this embodiment, so that the video frame interpolation model can adjust the parameters of the video frame interpolation model during training, so that the video frame interpolation The model is continuously optimized during the training process. Among them, the supervision function is divided into two parts, the first part is the first loss function, and the second difference between the label intermediate image frame corresponding to each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Decide. The second part is the texture consistency loss function (Texture Consistency Loss, TCL), which consists of the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group. The difference is determined by a third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group.
此外,上述步骤203中,第二差异与参数调整之间的关联度大于,所述第一差异或所述第三差异与参数调整之间的关联度指的是:在监督函数中,第一损失函数的参数调整之间的关联度大于纹理一致性损失函数。In addition, in the above step 203, the correlation degree between the second difference and the parameter adjustment is greater than that, the correlation degree between the first difference or the third difference and the parameter adjustment refers to: in the supervisory function, the first The degree of correlation between parameter adjustments of the loss function is greater than that of the texture consistency loss function.
其中,监督函数可以如公式(1)示:Among them, the supervisory function can be shown as formula (1):
Figure PCTCN2022105652-appb-000001
Figure PCTCN2022105652-appb-000001
公式(1)中,
Figure PCTCN2022105652-appb-000002
表示每一训练图像帧组对应的预估中间图像帧,I 0表示每一训练图像帧组对应的标签中间图像帧,I- 1表示每一训练图像帧组中的第一个图像帧,I 1表示每一训练图像帧组中的第三个图像帧,α为可调系数,L 1为第一损失函数,L p为纹理一致性损失函数。
In formula (1),
Figure PCTCN2022105652-appb-000002
Represents the estimated intermediate image frame corresponding to each training image frame group, I 0 represents the label intermediate image frame corresponding to each training image frame group, I- 1 represents the first image frame in each training image frame group, I 1 represents the third image frame in each training image frame group, α is an adjustable coefficient, L 1 is the first loss function, and L p is the texture consistency loss function.
本申请实施例提供的方法,由于在原有的监督函数的基础上增加了纹理一致性损失函数,使得视频插帧模型在进行监督训练的过程中,不只考虑每一训练图像帧组对应的标签中间图像帧的内容,还会考虑每一训练图像帧组中的第一个图像帧和第三个图像帧的内容,可以减缓监督训练中的过约束问题,使得视频插帧模型输出的图像帧的纹理清晰度、信噪比及结构相似性更高,从而提升视频的帧率,增加画面的流畅感。In the method provided by the embodiment of the present application, since the texture consistency loss function is added on the basis of the original supervisory function, the video frame interpolation model not only considers the middle of the label corresponding to each training image frame group in the process of supervised training. The content of the image frame will also consider the content of the first image frame and the third image frame in each training image frame group, which can alleviate the over-constraint problem in the supervised training, so that the image frame output by the video frame interpolation model The texture definition, signal-to-noise ratio and structural similarity are higher, thereby improving the frame rate of the video and increasing the smoothness of the picture.
在一些实施例中,将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧,包括:In some embodiments, the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the estimated intermediate image frames corresponding to each training image frame group are output, including:
301、对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,n为正整数且不小于2。301. For any training image frame group, use the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, and for the first image frame and the third image frame Frames are adjusted with the same resolution at the same time; among them, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and n is a positive integer and not less than 2.
302、对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合。302. Perform feature extraction on each adjusted two image frames, form each image frame feature group from the features extracted from each adjusted two image frames, and form each image frame feature group from each image frame feature group group collection.
303、将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果。303 . Perform cross-scale alignment processing on features corresponding to the first image frame in the image frame feature set set to features corresponding to the third image frame in the image frame feature set set, to obtain an alignment result of the first image frame.
304、将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果。304 . Perform cross-scale alignment processing on features corresponding to the third image frame in the image frame feature set set to features corresponding to the first image frame in the image frame feature set set, to obtain an alignment result of the third image frame.
305、将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;305. Perform bidirectional information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a bidirectional information fusion result;
306、对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。306. Perform reconstruction processing on the bidirectional information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.
具体地,对于任一训练图像帧组,在对第一图像帧及第三图像帧进行特征提取前,会先对其中的第一图像帧及第三图像帧进行n-1次的分辨率调整,每次分辨率调整,都会使得到的第一图像帧及第三图像帧比这次进行分辨率调整前的第一图像帧及第三图像帧的分辨率低。Specifically, for any set of training image frames, before performing feature extraction on the first image frame and the third image frame, the first image frame and the third image frame will be adjusted n-1 times in resolution , each time the resolution is adjusted, the obtained first image frame and the third image frame will have a lower resolution than the first image frame and the third image frame before the resolution adjustment.
比如,对第一图像帧及第三图像帧进行第3次分辨率调整,是在第2次分辨率调整后得到的第一图像帧及第三图像帧的基础上降低该第一图像帧及第三图像帧的分辨率。因此,第3次分辨率调整后得到的第一图像帧及第三图像帧的分辨率小于第2次分辨率调整后得到的第一图像帧及第三图像帧的分辨率。此外,对于第一图像帧及第三图像帧的分辨率调整次数,应不小于1次。For example, performing the third resolution adjustment on the first image frame and the third image frame is to reduce the first image frame and the third image frame on the basis of the first image frame and the third image frame obtained after the second resolution adjustment. The resolution of the third image frame. Therefore, the resolutions of the first image frame and the third image frame obtained after the third resolution adjustment are smaller than the resolutions of the first image frame and the third image frame obtained after the second resolution adjustment. In addition, the resolution adjustment times of the first image frame and the third image frame should not be less than 1 time.
对分辨率调整后的图像帧分组,将分辨率相同的图像帧分为一组,由于进行了n-1次分辨率调整,再加上原来未调整分辨率的图像帧组。因此,一共有n组的分辨率不同的图像帧组。然后对n组的分辨率不同的图像帧组各自进行特征提取,得到n组的图像特征组集合。The image frames after resolution adjustment are grouped, and the image frames with the same resolution are grouped into one group. Since n-1 resolution adjustments are performed, the original image frame group without resolution adjustment is added. Therefore, there are n groups of image frame groups with different resolutions. Then feature extraction is performed on n groups of image frame groups with different resolutions to obtain n groups of image feature group sets.
此外,对于获取不同分辨率的图像帧特征组集合的方法,本实施例对其不作具体限定,包括但不限于:上述步骤301、302的实现过程,以及:对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率提取特征,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合。其中,共提取n次的特征且每次提取特征的分辨率均不相同,n为正整数且不小于2。具体地,可以采用卷积的方式,对第一图像帧与第三图像帧同时进行分辨率调整以及提取特征,以此得到上述步骤302中的图像帧特征组集合。In addition, this embodiment does not specifically limit the method for obtaining image frame feature groups of different resolutions, including but not limited to: the implementation process of the above steps 301 and 302, and: for any training image frame group, the The first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the same resolution is used to extract features for the first image frame and the third image frame at the same time, and each The features extracted from the two image frames after the second adjustment form each image frame feature group, and each image frame feature group forms an image frame feature group set. Among them, the features are extracted n times in total and the resolution of each extracted feature is different, and n is a positive integer and not less than 2. Specifically, convolution may be used to simultaneously perform resolution adjustment and feature extraction on the first image frame and the third image frame, so as to obtain the image frame feature group set in step 302 above.
对于得到第一图像帧的对齐结果及第三图像帧的对齐结果的顺序,本申请实施例对其不做具体限定,可以先得到第一图像帧的对齐结果,再得到第二图像帧的对齐结果。也可以先得到第二图像帧的对齐结果,再得到第一图像帧的对齐结果。还可以同时得到第一图像帧的对齐结果以及第二图像帧的对齐结果。The order of obtaining the alignment result of the first image frame and the alignment result of the third image frame is not specifically limited in the embodiment of the present application, and the alignment result of the first image frame can be obtained first, and then the alignment result of the second image frame can be obtained result. Alternatively, the alignment result of the second image frame may be obtained first, and then the alignment result of the first image frame may be obtained. It is also possible to obtain the alignment result of the first image frame and the alignment result of the second image frame at the same time.
此外,将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理的过程与将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理的过程相同。In addition, the process of cross-scale aligning the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set is the same as aligning the third image frame in the image frame feature set The process of performing cross-scale alignment processing of the corresponding features in the set to the corresponding features of the first image frame in the image frame feature group set is the same.
上述步骤306中,重建处理指的是根据双向信息融合结果回归出预估中间图像帧。具体地,先对对双向信息融合结果进行处理,然后将处理结果进行输入进行单层卷积,最后输出预估中间图像帧。In the above step 306, the reconstruction process refers to regressing the estimated intermediate image frame according to the two-way information fusion result. Specifically, the two-way information fusion result is processed first, then the processing result is input for single-layer convolution, and finally the estimated intermediate image frame is output.
比如,如图3所示,先对双向信息融合结果F 0输入至第一层(Layer1)中进行处理,然后将处理结果进行输入至第二层(Layer2)中进行单层卷积,最后输出预估中间图像帧
Figure PCTCN2022105652-appb-000003
图3中,“40×RB(128)”表示用了40个“RB(128)”,RB(128)表示通道维度为128的残差块。“Conv(128,3,3,1)”表示单层卷积,其中,输入和输出为128,3,以及卷积核为3,卷积步长为1。
For example, as shown in Figure 3, the two-way information fusion result F 0 is first input to the first layer (Layer1) for processing, and then the processing result is input to the second layer (Layer2) for single-layer convolution, and finally output Estimating intermediate image frames
Figure PCTCN2022105652-appb-000003
In FIG. 3 , "40×RB(128)" indicates that 40 "RB(128)" are used, and RB(128) indicates a residual block with a channel dimension of 128. "Conv(128,3,3,1)" represents a single-layer convolution, where the input and output are 128, 3, the convolution kernel is 3, and the convolution step is 1.
本申请实施例提供的方法,通过将每一训练图像组的第一个图像帧及第三个图像输入视频插帧模型中,输出每一训练图像组对应的预估中间图像帧,以此来对视频插帧模型进行训练,可以调整视频插帧模型的参数,从而提高视频插帧模型输出图像帧的质量。In the method provided in the embodiment of the present application, by inputting the first image frame and the third image of each training image group into the video frame interpolation model, and outputting the estimated intermediate image frame corresponding to each training image group, in order to By training the video frame interpolation model, the parameters of the video frame interpolation model can be adjusted, thereby improving the quality of the output image frames of the video frame interpolation model.
结合上述实施例的内容,在一些实施例中,图像帧特征组集合中图像帧特征组对应的分辨率按顺序是依次变大的;将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐(Cross-scale Pyramid Alignment)处理,得到第一图像帧的对齐结果,包括:In combination with the content of the above-mentioned embodiments, in some embodiments, the resolutions corresponding to the image frame feature groups in the image frame feature group set increase sequentially; the corresponding features of the first image frame in the image frame feature set set Perform cross-scale alignment (Cross-scale Pyramid Alignment) processing to the features corresponding to the third image frame in the image frame feature group set to obtain the alignment result of the first image frame, including:
对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐(Alignment Block,AB)处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值(Bilinear Upsampling,BU)计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合(Cross-scale Fusion,CSF)处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结 果作为第一图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment (Alignment Block, AB) processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation (Bilinear Upsampling, BU) calculation results and the third image frame at the i-th The corresponding features in each image frame feature group are processed by cross-scale fusion (Cross-scale Fusion, CSF), and the i-th cross-scale fusion processing result is obtained, and the i-th cross-scale fusion processing result is combined with the first image frame at the i-th The corresponding features in the image frame feature groups are aligned to obtain the i-th alignment processing result; repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result As the alignment result of the first image frame; wherein, i is a positive integer not less than 1 and not greater than n;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
具体地,对于图像帧特征组集合中图像帧特征组对应的分辨率的大小,本实施例对齐不作具体限定。而且,在进行跨尺度对齐处理过程中,进行对齐处理的次数与上述步骤302中的图像帧特征组集合中图像帧特征组的组数相同。Specifically, the alignment in this embodiment does not specifically limit the size of the resolution corresponding to the image frame feature group in the image frame feature group set. Moreover, during the cross-scale alignment process, the number of alignment processes is the same as the number of image frame feature groups in the set of image frame feature groups in step 302 above.
比如,若图像帧特征组集合中有4组图像帧特征组,则在将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理的过程中,需要进行4次的对齐处理。For example, if there are 4 groups of image frame feature groups in the image frame feature group collection, then the corresponding feature of the first image frame in the image frame feature group collection is transferred to the corresponding feature of the third image frame in the image frame feature group collection In the process of cross-scale alignment processing, four alignment processings are required.
此外,将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理的过程中进行对齐处理的处理次数应不小于2次。In addition, the number of times of alignment processing in the process of performing cross-scale alignment processing from the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set should not be less than 2 Second-rate.
对于将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理的过程,以图像帧特征组集合中包含3组图像帧特征组进行举例说明。如图4中的(a)所示,其中,
Figure PCTCN2022105652-appb-000004
为图像帧特征组集合中分辨率最高的图像帧特征组,即该图像特征组的分辨率与进行分辨率调整前的图像帧的分辨率相同。图4中的(a)的
Figure PCTCN2022105652-appb-000005
为图像帧特征组集合中分辨率大小排第二的图像帧特征组,图4中的(a)的
Figure PCTCN2022105652-appb-000006
为图像帧特征组集合中分辨率最小的图像帧特征组。另外,
Figure PCTCN2022105652-appb-000007
Figure PCTCN2022105652-appb-000008
表示第一图像帧经过2次分辨率调整后提取到的3个图像帧特征,
Figure PCTCN2022105652-appb-000009
Figure PCTCN2022105652-appb-000010
表示第三图像帧经过2次分辨率调整后提取到的3个图像帧特征。
Figure PCTCN2022105652-appb-000011
表示第一图像帧的对齐结果。
For the process of cross-scale alignment of the features corresponding to the first image frame in the image frame feature set to the features corresponding to the third image frame in the image frame feature set, the image frame feature set contains three sets of images The image frame feature group is illustrated as an example. As shown in (a) in Figure 4, where,
Figure PCTCN2022105652-appb-000004
is the image frame feature group with the highest resolution in the image frame feature group set, that is, the resolution of the image feature group is the same as the resolution of the image frame before the resolution adjustment. (a) in Figure 4
Figure PCTCN2022105652-appb-000005
Be the second image frame feature group in the image frame feature group collection, the resolution of (a) in Fig. 4
Figure PCTCN2022105652-appb-000006
is the image frame feature group with the smallest resolution in the set of image frame feature groups. in addition,
Figure PCTCN2022105652-appb-000007
and
Figure PCTCN2022105652-appb-000008
Indicates the 3 image frame features extracted from the first image frame after 2 resolution adjustments,
Figure PCTCN2022105652-appb-000009
and
Figure PCTCN2022105652-appb-000010
Indicates the features of the three image frames extracted from the third image frame after two resolution adjustments.
Figure PCTCN2022105652-appb-000011
Indicates the alignment result of the first image frame.
其中,对齐处理的过程如图4中的(b)所示,首先将两个输入的图像帧特征拼接(Concatenation)起来,然后将拼接结果依次输入到单层卷积“Conv3×3”,5个串行的残差块“Res.block×5”,和另一个卷积层“Conv3×3”中,得到权重张量
Figure PCTCN2022105652-appb-000012
Figure PCTCN2022105652-appb-000013
最后采用形变卷积处理得到此次对齐的结果
Figure PCTCN2022105652-appb-000014
其中l为分辨率调整处理的次数。
Among them, the process of alignment processing is shown in (b) in Figure 4. First, the two input image frame features are concatenated (Concatenation), and then the concatenation results are sequentially input into the single-layer convolution "Conv3×3", 5 A serial residual block "Res.block×5", and another convolutional layer "Conv3×3", get the weight tensor
Figure PCTCN2022105652-appb-000012
and
Figure PCTCN2022105652-appb-000013
Finally, deformed convolution processing is used to obtain the result of this alignment
Figure PCTCN2022105652-appb-000014
Where l is the number of times of resolution adjustment processing.
本申请实施例提供的方法,通过将分辨率相同的特征进行对齐处理,并且增加了跨尺度融合处理过程,可以从多个尺度的图像帧中提取到有效的重建信号,从而提升输出的第一图像帧的对齐结果的精度,可以全面且有效地利用多尺度的信息。The method provided in the embodiment of the present application can extract effective reconstruction signals from image frames of multiple scales by aligning features with the same resolution and adding a cross-scale fusion process, thereby improving the output first The accuracy of the alignment results of image frames can comprehensively and effectively utilize multi-scale information.
在一些实施例中,将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果,包括:In some embodiments, the features corresponding to the third image frame in the image frame feature set are subjected to cross-scale alignment processing to the features corresponding to the first image frame in the image frame feature set to obtain the alignment result of the third image frame ,include:
对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果。For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above processing for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame.
需要说明的是,得到第三图像帧的对齐结果与得到第三图像帧的对齐结果的处理方法相同,此处不再对其进行说明。得到第三图像帧的对齐结果的具体处理过程参见上述得到第一图像帧的对齐结果的处理过程。It should be noted that the processing method of obtaining the alignment result of the third image frame is the same as that of obtaining the alignment result of the third image frame, which will not be described here again. For the specific processing procedure of obtaining the alignment result of the third image frame, refer to the above-mentioned processing procedure of obtaining the alignment result of the first image frame.
本申请实施例提供的方法,通过将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,可以得到第三图像帧的对齐结果。In the method provided in the embodiment of the present application, the third image can be obtained by performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set The alignment result of the frame.
结合上述实施例的内容,在一些实施例中,将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合(Attention-based Fusion),得到双向信息融合结果,包括:In combination with the content of the foregoing embodiments, in some embodiments, two-way information fusion (Attention-based Fusion) is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, including:
401、将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果。401. Convolute the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result.
402、对卷积结果进行计算,得到融合权重。402. Calculate the convolution result to obtain fusion weights.
403、根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。403. Perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight, to obtain a two-way information fusion result.
上述步骤402中,对于对卷积结果进行计算的计算方式选择,本实施例对其不作具体限定,包括但不限于:Sigmoid函数等。In the above step 402, this embodiment does not specifically limit the selection of the calculation method for calculating the convolution result, including but not limited to: Sigmoid function and so on.
具体地,先将第一图像帧的对齐结果及第三图像帧的对齐结果进行单层卷积后,得到卷积处理结果,然后将卷积处理结果经过函数激活,得到融合权重,然后根据融合权重,对第一图像帧的对齐结果及第三图像帧的结果进行如公式(2)计算处理,得到双向信息融合结果。Specifically, the alignment result of the first image frame and the alignment result of the third image frame are subjected to single-layer convolution to obtain the convolution processing result, and then the convolution processing result is activated by a function to obtain the fusion weight, and then according to the fusion The weight is calculated and processed according to formula (2) on the alignment result of the first image frame and the result of the third image frame to obtain a two-way information fusion result.
Figure PCTCN2022105652-appb-000015
Figure PCTCN2022105652-appb-000015
公式(2)中,M为融合权重,
Figure PCTCN2022105652-appb-000016
为第一图像帧的对齐结果,
Figure PCTCN2022105652-appb-000017
为第三图像帧的 对齐结果,F 0为双向信息融合结果。
In formula (2), M is the fusion weight,
Figure PCTCN2022105652-appb-000016
is the alignment result of the first image frame,
Figure PCTCN2022105652-appb-000017
is the alignment result of the third image frame, and F 0 is the result of two-way information fusion.
本申请实施例提供的方法,通过将第一图像帧的对齐结果与第三图像帧的结果进行双向信息融合,可以得到双向信息融合结果,从而可以提高视频插帧模型输出的图像帧的质量。The method provided in the embodiment of the present application can obtain the bidirectional information fusion result by performing bidirectional information fusion on the alignment result of the first image frame and the result of the third image frame, thereby improving the quality of the image frame output by the video frame interpolation model.
结合上述实施例的内容,在一些实施例中,第一差异及第三差异均为相似度;第一差异及第三差异的确定过程,包括:Combining the content of the above-mentioned embodiments, in some embodiments, both the first difference and the third difference are similarities; the determination process of the first difference and the third difference includes:
501、对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数。501. For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, and select any training image frame according to the central pixel of any t*t pixel The position in the estimated intermediate image frame corresponding to the group is determined in the first image frame of any training image frame group and the first target pixel of t*t is determined in the third image frame in any training image frame group The third target pixel of t*t; wherein, t is an odd number not equal to 1.
502、根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合。502. Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel.
503、根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与第一目标像素之间的相似度,作为第一差异;确定任一像素与第三目标像素之间的相似度,作为第三差异。503. According to the first character set, the second character set, and the third character set, determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel Similarity, as a third difference.
具体地,对于任一训练图像帧组,当视频插帧模型输出对应的预估中间图像帧
Figure PCTCN2022105652-appb-000018
后,会从该预估中间图像帧
Figure PCTCN2022105652-appb-000019
中选择任一t*t的像素块
Figure PCTCN2022105652-appb-000020
其中t为不为1的奇数,比如,3、5、7等,x为这个像素块的中心像素的二维坐标。然后根据像素块
Figure PCTCN2022105652-appb-000021
的中心像素的二维坐标x,确定所述任一训练图像帧组中第一个图像帧的第一目标像素及在任一训练图像帧组中第三个图像帧中确定第三目标像素。
Specifically, for any training image frame group, when the video frame interpolation model outputs the corresponding estimated intermediate image frame
Figure PCTCN2022105652-appb-000018
After that, the intermediate image frame is estimated from the
Figure PCTCN2022105652-appb-000019
Select any t*t pixel block in
Figure PCTCN2022105652-appb-000020
Where t is an odd number other than 1, for example, 3, 5, 7, etc., and x is the two-dimensional coordinate of the central pixel of the pixel block. Then according to the pixel block
Figure PCTCN2022105652-appb-000021
The two-dimensional coordinates x of the central pixel of , determine the first target pixel in the first image frame in any training image frame group and determine the third target pixel in the third image frame in any training image frame group.
再根据二维坐标x在任一训练图像帧组中第一个图像帧中的确定t*t的第一目标像素以及任一训练图像帧组中第三个图像帧中的确定t*t的第三目标像素。将t*t的第一目标像素以及第三目标像素作为待匹配像素。Then according to the two-dimensional coordinate x in the first image frame in any training image frame group, determine the first target pixel of t*t and the third image frame in any training image frame group to determine the first target pixel of t*t Three target pixels. The first target pixel and the third target pixel of t*t are used as pixels to be matched.
Figure PCTCN2022105652-appb-000022
和待匹配像素经过CT(Census Transform)变换,确定第一字符集合、第一字符集合及第三字符集合。最后,根据第一字符集合、第二字符集合及第三字符集合,确定任一t*t的像素与第一目标像素之间的相似度,将其作为第一差异,确定任一t*t的像素与第三目标像素之间的相似度,将其作为第三差异。
Will
Figure PCTCN2022105652-appb-000022
and the pixels to be matched undergo CT (Census Transform) transformation to determine the first character set, the first character set and the third character set. Finally, according to the first character set, the second character set and the third character set, determine the similarity between any t*t pixel and the first target pixel, and use it as the first difference to determine any t*t The similarity between the pixel of and the third target pixel is taken as the third difference.
比较第一差异与第三差异,确定所述任一t*t的像素的中心像素的最佳匹配像素,即根据第一差异与第三差异,从t*t的第一目标像素和t*t的第三目标像素中选出与所述任一t*t 的像素的中心像素的最佳匹配像素。然后根据最佳匹配像素,通过纹理一致性损失函数(Texture Consistency Loss,TCL)计算所述任一t*t的像素的中心像素在x处的纹理一致性损失,并根据纹理一致性损失对视频插帧模型进行监督训练。Compare the first difference with the third difference, and determine the best matching pixel of the central pixel of any t*t pixel, that is, according to the first difference and the third difference, from the first target pixel of t*t and t* Select the best matching pixel with the central pixel of any t*t pixel from the third target pixel of t. Then according to the best matching pixel, the texture consistency loss of the central pixel of any t*t pixel at x is calculated by the texture consistency loss function (Texture Consistency Loss, TCL), and the video is analyzed according to the texture consistency loss The interpolation model is supervised for training.
结合上述内容,当从t*t的第一目标像素和t*t的第三目标像素中选出与所述任一t*t的像素的中心像素对应的最佳匹配像素之后,可以计算所述任一t*t的像素的中心像素的纹理一致性损失函数。In combination with the above content, after selecting the best matching pixel corresponding to the central pixel of any t*t pixel from the first target pixel of t*t and the third target pixel of t*t, the calculated The texture consistency loss function of the central pixel of any t*t pixel.
其中,纹理一致性损失是通过比较所述任一t*t的像素的中心像素与最佳匹配像素的RGB值确定的。Wherein, the texture consistency loss is determined by comparing the RGB values of the central pixel of any t*t pixel with the best matching pixel.
以从预估中间图像帧
Figure PCTCN2022105652-appb-000023
中选择任一3*3的像素
Figure PCTCN2022105652-appb-000024
为例来对本申请实施例提供的方法进行说明:
to estimate intermediate image frames from
Figure PCTCN2022105652-appb-000023
Select any 3*3 pixel in
Figure PCTCN2022105652-appb-000024
As an example to illustrate the method provided in the embodiment of this application:
(1)对于
Figure PCTCN2022105652-appb-000025
中任一3*3的像素
Figure PCTCN2022105652-appb-000026
(x表示这个图像块中心点的二维坐标),需要从第一个图像帧I -1,以及第三个图像帧I 1中通过匹配算法找到最佳匹配像素
Figure PCTCN2022105652-appb-000027
(1) For
Figure PCTCN2022105652-appb-000025
Any 3*3 pixel in
Figure PCTCN2022105652-appb-000026
(x represents the two-dimensional coordinates of the center point of the image block), it is necessary to find the best matching pixel from the first image frame I -1 and the third image frame I 1 through a matching algorithm
Figure PCTCN2022105652-appb-000027
(2)使用最佳匹配像素
Figure PCTCN2022105652-appb-000028
对预估得到的
Figure PCTCN2022105652-appb-000029
进行监督,其中,t *∈{-1,1}表示图像帧的标号,即最佳匹配像素来自的是第一个图像帧或第三个图像帧,y*表示最佳匹配像素的二维坐标。
(2) Use the best matching pixel
Figure PCTCN2022105652-appb-000028
to the estimated
Figure PCTCN2022105652-appb-000029
Supervision, where t * ∈ {-1, 1} represents the label of the image frame, that is, the best matching pixel comes from the first or third image frame, and y* represents the two-dimensional value of the best matching pixel coordinate.
匹配过程如图5所示,共分为4个步骤:The matching process is shown in Figure 5 and is divided into four steps:
1、对于任一训练图像帧组,输入任一训练图像帧组对应的预估中间图像帧
Figure PCTCN2022105652-appb-000030
中任一3*3的像素
Figure PCTCN2022105652-appb-000031
第一个图像帧I -1以及第三个图像帧I 1
1. For any training image frame group, input the estimated intermediate image frame corresponding to any training image frame group
Figure PCTCN2022105652-appb-000030
Any 3*3 pixel in
Figure PCTCN2022105652-appb-000031
The first image frame I −1 and the third image frame I 1 .
2、以任一3*3的像素
Figure PCTCN2022105652-appb-000032
的中心二维坐标x所在位置的像素为中心像素,在一定范围d内分别从第一个图像帧I -1以及第三个图像帧I 1内取得所有的待匹配的的像素f y t。其中,d的取值为不小于3的奇数,比如,3、5、7等,t∈{-1,1},t表示待匹配的像素来自第一个图像帧还是第三个图像帧,y表示了待匹配像素f y t的二维坐标。二维坐标φ(x)确定公式如公式(3)所示:
2. Take any 3*3 pixels
Figure PCTCN2022105652-appb-000032
The pixel at the position of the two-dimensional coordinate x of the center is the center pixel, and all the pixels f y t to be matched are respectively obtained from the first image frame I −1 and the third image frame I 1 within a certain range d. Among them, the value of d is an odd number not less than 3, such as 3, 5, 7, etc., t∈{-1, 1}, t indicates that the pixel to be matched comes from the first image frame or the third image frame, y represents the two-dimensional coordinates of the pixel f y t to be matched. The formula for determining the two-dimensional coordinate φ(x) is shown in formula (3):
φ(x)={y||y-x|<d}    (3)φ(x)={y||y-x|<d} (3)
3、将
Figure PCTCN2022105652-appb-000033
和所有的待匹配3*3像素经过CT(Census Transform)变换,变换得到第二字符串
Figure PCTCN2022105652-appb-000034
以及从第一字符串和第三字符串中选出的字符串
Figure PCTCN2022105652-appb-000035
CT变换公式为:
3. will
Figure PCTCN2022105652-appb-000033
And all the 3*3 pixels to be matched are transformed by CT (Census Transform) to obtain the second string
Figure PCTCN2022105652-appb-000034
and a string selected from the first string and the third string
Figure PCTCN2022105652-appb-000035
The CT transformation formula is:
Figure PCTCN2022105652-appb-000036
Figure PCTCN2022105652-appb-000036
公式(4)中,f x(x)为任一3*3的像素
Figure PCTCN2022105652-appb-000037
的中心位置的像素的RGB值,f x(x+x n)为其他待匹配的像素的RGB值,x为所述中心位置的像素的坐标为(0,0),x n为其他待匹配的像素的二维坐标,R为除中心位置的像素的其他八个像素的坐标。
In formula (4), f x (x) is any 3*3 pixel
Figure PCTCN2022105652-appb-000037
The RGB value of the pixel at the center position of , f x (x+x n ) is the RGB value of other pixels to be matched, x is the coordinate of the pixel at the center position (0, 0), and x n is other to be matched The two-dimensional coordinates of the pixel, R is the coordinates of the other eight pixels except the pixel at the center position.
其中,R={(-1,-1),(-1,0),(-1,1),(1,-1),(1,1),(1,0),(0,1),(0,-1)}。Where, R={(-1,-1), (-1,0), (-1,1), (1,-1), (1,1), (1,0), (0,1 ), (0, -1)}.
4、每个像素经过CT变换后,按照公式(5)进行相似度的匹配,得到最佳匹配像素的二维坐标y*和对应的图像帧的标号t*。4. After each pixel undergoes CT transformation, similarity matching is performed according to formula (5), and the two-dimensional coordinate y* of the best matching pixel and the label t* of the corresponding image frame are obtained.
Figure PCTCN2022105652-appb-000038
Figure PCTCN2022105652-appb-000038
公式(5)中,L2为进行相似度匹配的匹配函数。In formula (5), L2 is a matching function for similarity matching.
本申请实施例提供的方法,由于采用了纹理一致性函数,可以缓解因为图像帧中物体的运动歧义性带来的过约束问题,从而使得训练好的视频插帧模型输出的图像帧的纹理更加清晰,更逼近于输入图像帧的纹理结构,避免生成模糊和纹理不清晰的内容。The method provided in the embodiment of the present application can alleviate the over-constraint problem caused by the motion ambiguity of the object in the image frame due to the use of the texture consistency function, so that the texture of the image frame output by the trained video frame interpolation model is more accurate. Clear, closer to the texture structure of the input image frame, avoiding blurry and texture-unclear content.
结合上述实施例的内容,在一些实施例中,第二差异的确定过程,包括:In combination with the content of the above embodiments, in some embodiments, the process of determining the second difference includes:
对于任一训练图像帧组,根据任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一训练图像帧组对应的标签中间图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。For any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
具体地,对于任一训练图像帧组,在确定第二差异之前,需要先确定任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值。然后将所述标签中间图像帧中的所有像素与所述预估中间图像帧中所有像素中二维坐标相同的像素的的RGB值一一比较,确定所述标签中间图像帧中的所有像素与所述预估中间图像帧中所有像素的RGB值的差值。将所有像素的RGB值的差值求和后再求平均值,此平均值即可作为第二差异。Specifically, for any training image frame group, before determining the second difference, it is necessary to determine the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the corresponding pre-set value of any training image frame group. Estimate the RGB values of all pixels in the intermediate image frame. Then all the pixels in the intermediate image frame of the label are compared with the RGB values of the pixels with the same two-dimensional coordinates in all the pixels in the estimated intermediate image frame one by one, and it is determined that all the pixels in the intermediate image frame of the label are consistent with The estimated difference between RGB values of all pixels in the intermediate image frame. The differences of the RGB values of all pixels are summed and then averaged, and the average value can be used as the second difference.
本申请实施例提供的方法,通过比较任一训练图像帧组对应的标签中间图像帧与对应的预估中间图像帧之间的差异,可以利用此差异实现对视频插帧模型的监督训练,从而可以提高视频插帧模型输出图像帧的准确度,进而可以提高视频的流畅度与清晰度。In the method provided in the embodiment of the present application, by comparing the difference between the label intermediate image frame corresponding to any training image frame group and the corresponding estimated intermediate image frame, this difference can be used to realize the supervised training of the video frame interpolation model, thereby The accuracy of the image frame output by the video frame interpolation model can be improved, thereby improving the fluency and clarity of the video.
结合上述实施例的内容,在一些实施例中,在视频插帧模型训练好之后,包括:In combination with the content of the above-mentioned embodiments, in some embodiments, after the video frame interpolation model is trained, it includes:
601、获取待处理视频中两个图像帧。601. Acquire two image frames in the video to be processed.
602、将两个图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。602. Input the two image frames into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
具体地,视频插帧模型的训练过程如图6所示,当视频插帧模型训练完成之后,对其使用的过程为,获取需要进行视频插帧的视频,然后对该视频进行图像帧提取,从提取的图像帧中选择出两个图像帧。将两个图像帧输入至已训练好的视频插帧模型中,经过视频插帧模型处理后,可以输出两个图像帧的中间图像帧。Specifically, the training process of the video frame interpolation model is shown in Figure 6. After the training of the video frame interpolation model is completed, the process of using it is to obtain the video for which video frame interpolation needs to be performed, and then perform image frame extraction on the video, Two image frames are selected from the extracted image frames. Input two image frames into the trained video frame interpolation model, and after processing by the video frame interpolation model, the intermediate image frame of the two image frames can be output.
值得一提的是,本实施例中训练好的视频插帧模型,既可以完成单帧视频内插和外插,又可以完成多帧视频内插。即本实施例中已训练好的视频插帧模型可以用于生成两个图像帧中间图像帧,也可以用于生成置于两个图像帧之后的未来图像帧,同时也可以用于多个图像帧生成一个中间图像帧。It is worth mentioning that the video frame interpolation model trained in this embodiment can not only complete single-frame video interpolation and extrapolation, but also complete multi-frame video interpolation. That is, the trained video frame interpolation model in this embodiment can be used to generate an intermediate image frame between two image frames, or it can be used to generate a future image frame placed after the two image frames, and it can also be used for multiple images frame generates an intermediate image frame.
视频插帧模型输出的结果与相关技术实现的插帧结果相比,生成的图像帧能有效提升视频超分辨率的性能,具体地,通过本申请实施例提供的视频插帧模型与通过相关技术得到的图像帧的对比示意图如图7a至图7j所示。Compared with the frame interpolation results achieved by related technologies, the output results of the video frame interpolation model can effectively improve the performance of video super-resolution. The comparison diagrams of the obtained image frames are shown in Fig. 7a to Fig. 7j.
其中,图7a为单帧视频内插的对比评测结果图,其中,视频插帧模型的输入图像帧为2个,输出1个中间图像帧。图7b为多帧视频内插的对比评测结果图,其中,视频插帧模型的输入图像帧为4个,输出1个中间图像帧。图7c为单帧视频外插的对比评测结果图,其中,视频插帧模型的输入图像帧为2个,输出1个未来图像帧。图7d为将已训练好的视频插帧模型集成在一个视频超分辨率模型后的可视化效果对比图。图7e为单帧视频内插的可视化对比图。图7f为多帧视频内插的可视化对比图。图7g为单帧视频外插的可视化对比图。图7h为单帧视频内插对视频超分的影响对比图。图7i为加入TCL损失函数的单个可视化对比图。图7j为加入TCL损失函数的多个可视化对比图。Among them, Fig. 7a is a comparison and evaluation result diagram of single-frame video interpolation, wherein, the video frame interpolation model has two input image frames and outputs one intermediate image frame. Fig. 7b is a diagram of comparative evaluation results of multi-frame video interpolation, where the video frame interpolation model has 4 input image frames and outputs 1 intermediate image frame. Fig. 7c is a comparative evaluation result diagram of single-frame video extrapolation, in which, the video frame interpolation model takes 2 input image frames and outputs 1 future image frame. Fig. 7d is a comparison diagram of visual effect after integrating the trained video frame interpolation model into a video super-resolution model. Fig. 7e is a visual comparison diagram of single-frame video interpolation. Fig. 7f is a visual comparison diagram of multi-frame video interpolation. Figure 7g is a visual comparison diagram of single-frame video extrapolation. Fig. 7h is a comparison diagram of the impact of single-frame video interpolation on video super-resolution. Figure 7i is a single visualization comparison diagram with TCL loss function added. Figure 7j is a comparison of multiple visualizations with TCL loss function added.
本申请实施提供的方法,通过训练好的视频插帧模型对待处理的视频进行处理,能够输出高清晰度的图像帧,从而可以有效地提升视频超分辨率的性能,与相关技术方法相比,本实施例提供的方法取得了最高的峰值信噪比(Peak Signal to Noise Ratio,PSNR)和结构相似性(Structural Similarity,SSIM)。The method provided by this application implements the trained video frame interpolation model to process the video to be processed, and can output high-definition image frames, thereby effectively improving the performance of video super-resolution. Compared with related technical methods, The method provided in this embodiment achieves the highest peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) and structural similarity (Structural Similarity, SSIM).
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确 的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction for the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的视频插帧模型训练方法的视频插帧模型训练装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个视频插帧模型训练装置实施例中的具体限定可以参见上文中对于视频插帧模型训练方法的限定,在此不再赘述。Based on the same inventive concept, an embodiment of the present application also provides a video frame interpolation model training device for implementing the above-mentioned video frame interpolation model training method. The solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the video frame interpolation model training device provided below can be referred to above for video frame interpolation The limitation of the model training method will not be repeated here.
在一些实施例中,如图8所示,提供了一种视频插帧模型训练装置,包括:获取模块、视频插帧模块和调整模块,其中:In some embodiments, as shown in FIG. 8 , a video frame insertion model training device is provided, including: an acquisition module, a video frame insertion module and an adjustment module, wherein:
获取模块801,用于获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;The obtaining module 801 is used to obtain the training image frame group, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as Label intermediate image frames corresponding to each training image frame group;
视频插帧模块802,用于将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;The video frame interpolation module 802 is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
调整模块803,用于基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度,大于第一差异或第三差异与参数调整之间的关联度。An adjustment module 803, configured to use a label corresponding to each training image frame group based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The second difference between the intermediate image frame and the estimated intermediate image frame corresponding to each training image frame group, and the estimated intermediate image frame corresponding to each training image frame group corresponding to the third image frame in each training image frame group The third difference between the image frames adjusts the parameters in the video frame interpolation model until the training stop condition is satisfied and the training ends; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference and Correlation between parameter adjustments.
在一些实施例中,所述视频插帧模块802,包括:In some embodiments, the video frame insertion module 802 includes:
调整子模块,用于对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,其中,n为正整数且不小于2;The adjustment sub-module is used for any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, for the first image frame Adjusting with the same resolution as the third image frame at the same time; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, wherein n is a positive integer and not less than 2;
特征提取子模块,用于对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征 组集合;The feature extraction sub-module is used to perform feature extraction on the two image frames after each adjustment, each image frame feature group is formed by the features extracted from the two image frames after each adjustment, and each image frame feature Grouping constitutes an image frame feature set;
第一对齐子模块,用于将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果;The first alignment sub-module is used to perform cross-scale alignment processing of the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the first image frame alignment result;
第二对齐子模块,用于将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果;The second alignment sub-module is used to perform cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the third image frame alignment result;
双向信息融合子模块,用于将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;The two-way information fusion sub-module is used to perform two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
重建模块,对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。The reconstruction module performs reconstruction processing on the two-way information fusion result to obtain the estimated intermediate image frame corresponding to each training image frame group.
在一些实施例中,所述第一对齐子模块,包括:In some embodiments, the first alignment submodule includes:
第一重复单元,用于对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,若i不为1,则将前i-1个双线性插值计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第一图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;The first repeating unit is used for the feature group of the i-th image frame, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame in the i-th image frame The corresponding features in the feature group are aligned to obtain the i-th alignment result. If i is not 1, the first i-1 bilinear interpolation calculation results and the third image frame are placed in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the first image frame; wherein, i is A positive integer not less than 1 and not greater than n;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,所述第二对齐子模块,包括:In some embodiments, the second alignment submodule includes:
第二重复单元,用于对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果;其中,i为不小于1且 不大于n的正整数;The second repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame in the i-th image frame The corresponding features in the feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first image frame in the i-th image frame feature group Perform cross-scale fusion processing on the corresponding features in to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, Obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and use the n-th alignment processing result as the alignment result of the third image frame; wherein, i is A positive integer not less than 1 and not greater than n;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,所述双向信息融合子模块,包括:In some embodiments, the two-way information fusion submodule includes:
第一获取单元,用于将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果;The first acquisition unit is configured to convolve the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
第二获取单元,用于对卷积结果进行计算,得到融合权重;The second acquisition unit is used to calculate the convolution result to obtain the fusion weight;
第一处理单元,用于根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。The first processing unit is configured to perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
在一些实施例中,所述调整模块803,包括:In some embodiments, the adjustment module 803 includes:
第一确定单元,用于对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;The first determining unit is used for selecting any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group for any training image frame group, according to the central pixel of any t*t pixel In the position in the estimated intermediate image frame corresponding to any training image frame group, determine the first target pixel of t*t in the first image frame in any training image frame group and the third target pixel in any training image frame group Determine the third target pixel of t*t in an image frame; Wherein, t is an odd number not equal to 1;
第二确定单元,用于根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合;The second determination unit is used to determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the first character set according to any t*t pixel two-character set;
第三确定单元,用于根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与第一目标像素之间的相似度,作为第一差异,确定任一像素与第三目标像素之间的相似度,作为第三差异。The third determination unit is used to determine the similarity between any pixel and the first target pixel according to the first character set, the second character set and the third character set, as the first difference, determine any pixel and the third character set The similarity between target pixels, as the third difference.
在一些实施例中,所述调整模块803,还包括:In some embodiments, the adjustment module 803 further includes:
第四确定单元,用于对于任一训练图像帧组,根据任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一训练图像帧组对应的标签中间图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。The fourth determination unit is used for any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group The RGB values of all pixels determine the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group as the second difference.
在一些实施例中,还包括:In some embodiments, also include:
比较模块,用于比较第一差异与第三差异,确定任一t*t的像素的中心像素的最佳匹配像素,根据最佳匹配像素,通过纹理一致性损失函数计算任一t*t的像素的中心像素的纹理 一致性损失;其中,纹理一致性损失用于对视频插帧模型进行训练。The comparison module is used to compare the first difference with the third difference, determine the best matching pixel of the central pixel of any t*t pixel, and calculate the value of any t*t through the texture consistency loss function according to the best matching pixel The texture consistency loss of the central pixel of the pixel; where the texture consistency loss is used to train the video frame interpolation model.
在一些实施例中,还包括:In some embodiments, also include:
图像帧获取模块,用于获取待处理视频中两个待处理图像帧;An image frame acquisition module, configured to acquire two image frames to be processed in the video to be processed;
输入模块,用于将两个待处理图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。The input module is used to input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
上述视频插帧模型装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned video frame interpolation model device can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
在一些实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种视频插帧模型方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In some embodiments, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 9 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, a video frame insertion model method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请实施例方案相关的部分结构的框图,并不构成对本申请实施例方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a part of the structure related to the embodiment of the application, and does not constitute a limitation on the computer equipment to which the embodiment of the application is applied. The computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一些实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现以下步骤:In some embodiments, a computer device is provided, including a memory and a processor, a computer program is stored in the memory, and the processor implements the following steps when executing the computer program:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧 之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度,大于第一差异或第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is satisfied; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the correlation between the first difference or the third difference and the parameter adjustment Spend.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,n为正整数且不小于2;For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the alignment result of the first image frame;
将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the alignment result of the third image frame;
将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。A reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第一图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果;Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
对卷积结果进行计算,得到融合权重;Calculate the convolution result to obtain the fusion weight;
根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。According to the fusion weight, the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合;Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel;
根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与所述第一目标像素之间的相似度,作为第一差异,确定任一像素与第三目标像素之间的相似度,作为第三差异。According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel Similarity, as a third difference.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
对于任一训练图像帧组,根据任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一 训练图像帧组对应的标签中间图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。For any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
比较第一差异与第三差异,确定任一t*t的像素的中心像素的最佳匹配像素,根据最佳匹配像素,通过纹理一致性损失函数计算任一t*t的像素的中心像素的纹理一致性损失;其中,纹理一致性损失用于对视频插帧模型进行训练。Compare the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel, and calculate the center pixel of any t*t pixel through the texture consistency loss function according to the best matching pixel Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.
在一些实施例中,处理器执行计算机程序时还实现以下步骤:In some embodiments, the following steps are also implemented when the processor executes the computer program:
获取待处理视频中两个待处理图像帧;Obtain two image frames to be processed in the video to be processed;
将两个待处理图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。Input the two image frames to be processed into the trained video frame interpolation model to obtain the intermediate image frame of the two image frames to be processed.
在一些实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或第三差异。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group The third difference is to adjust the parameters in the video frame interpolation model until the training stop condition is satisfied, and the training is ended; wherein, the degree of correlation between the second difference and the parameter adjustment is greater than the first difference or the third difference.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,n为正整数且不小于2;For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; among them, there are a total of n-1 adjustments and the resolution used for each adjustment is different, and n is a positive integer and not less than 2;
对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中 对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果;The feature corresponding to the first image frame in the image frame feature group set is carried out to the feature corresponding to the third image frame in the image frame feature set set for cross-scale alignment processing to obtain the alignment result of the first image frame;
将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the alignment result of the third image frame;
将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。A reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,若i不为1,则将前i-1个双线性插值计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第一图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing, to get the i-th alignment processing result, if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the alignment result of the first image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果;Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
对卷积结果进行计算,得到融合权重;Calculate the convolution result to obtain the fusion weight;
根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。According to the fusion weight, the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合;Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel;
根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与第一目标像素之间的相似度,作为第一差异,确定任一像素与第三目标像素之间的相似度,作为第三差异。According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel, as the first difference, determine the similarity between any pixel and the third target pixel , as the third difference.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,根据任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一训练图像帧组对应的标签中间图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。For any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
比较第一差异与第三差异,确定任一t*t的像素的中心像素的最佳匹配像素,根据最佳匹配像素,通过纹理一致性损失函数计算任一t*t的像素的中心像素的纹理一致性损失;其中,纹理一致性损失用于对视频插帧模型进行训练。Compare the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel, and calculate the center pixel of any t*t pixel through the texture consistency loss function according to the best matching pixel Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
获取待处理视频中两个待处理图像帧;Obtain two image frames to be processed in the video to be processed;
将两个待处理图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。Input the two image frames to be processed into the trained video frame interpolation model to obtain the intermediate image frame of the two image frames to be processed.
在一些实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现以下步骤:In some embodiments, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the following steps:
获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构 成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整视频插帧模型中的参数,直至满足训练停止条件时结束训练;其中,第二差异与参数调整之间的关联度大于,第一差异或所述第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is met; wherein, the correlation between the second difference and the parameter adjustment is greater than that between the first difference or the third difference and the parameter adjustment degree of relevance.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,将任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对第一图像帧与第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,其中,n为正整数且不小于2;For any training image frame group, the first image frame and the third image frame in any training image frame group are respectively used as the first image frame and the third image frame, and the first image frame and the third image frame are simultaneously The same resolution is used for adjustment; wherein, a total of n-1 adjustments are made and the resolution used for each adjustment is different, where n is a positive integer and not less than 2;
对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
将第一图像帧在图像帧特征组集合中对应的特征向第三图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第一图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the first image frame in the image frame feature set set to the features corresponding to the third image frame in the image frame feature set set, to obtain the alignment result of the first image frame;
将第三图像帧在图像帧特征组集合中对应的特征向第一图像帧在图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到第三图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the alignment result of the third image frame;
将第一图像帧的对齐结果及第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
对双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。A reconstruction process is performed on the result of bidirectional information fusion to obtain an estimated intermediate image frame corresponding to each training image frame group.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于第i个图像帧特征组,若i为1,则将第一图像帧在第i个图像帧特征组中对应的特征与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,若i不为1,则将前i-1个双线性插值计算结果及第三图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n 个对齐处理结果作为第一图像帧的对齐结果,其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the first image frame in the i-th image frame feature group and the corresponding features of the third image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result, if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the third image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the first image frame in the i-th image frame feature group, and obtain the i-th alignment processing result , repeat the above process for each image frame feature group until all image frame feature groups are processed, and take the nth alignment processing result as the alignment result of the first image frame, where i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于第i个图像帧特征组,若i为1,则将第三图像帧在第i个图像帧特征组中对应的特征与第一图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及第一图像帧在第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将第i个跨尺度融合处理结果与第三图像帧在第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为第三图像帧的对齐结果;其中,i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the corresponding features of the third image frame in the i-th image frame feature group and the corresponding features of the first image frame in the i-th image frame feature group are used as Alignment processing to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the corresponding features of the first image frame in the i-th image frame feature group are cross-scale Fusion processing, to obtain the i-th cross-scale fusion processing result, align the i-th cross-scale fusion processing result with the corresponding features of the third image frame in the i-th image frame feature group, and obtain the i-th alignment processing result ; Repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and the nth alignment processing result is used as the alignment result of the third image frame; wherein, i is not less than 1 and not greater than n a positive integer;
其中,对于前i-1个双线性插值计算结果中第j个双线性插值计算结果,第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,j为不小于1且小于i的正整数。Among them, for the j-th bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the j-th bilinear interpolation calculation result is to perform i-j bilinear interpolation consecutively on the j-th alignment processing result Calculated, j is a positive integer not less than 1 and less than i.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
将第一图像帧的对齐结果及第三图像帧的对齐结果进行卷积,得到卷积结果;Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
对卷积结果进行计算,得到融合权重;Calculate the convolution result to obtain the fusion weight;
根据融合权重,对第一图像帧的对齐结果及第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。According to the fusion weight, the alignment result of the first image frame and the alignment result of the third image frame are fused to obtain a bidirectional information fusion result.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,从任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据任一t*t的像素的中心像素在任一训练图像帧组对应的预估中间图像帧中的位置,分别在任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel corresponding to any training image frame group The position in the estimated intermediate image frame of , determine the first target pixel of t*t in the first image frame in any training image frame group and determine t* in the third image frame in any training image frame group The third target pixel of t; wherein, t is an odd number not equal to 1;
根据t*t的第一目标像素,确定第一字符集合;根据t*t的第三目标像素,确定第三字符集合;根据任一t*t的像素,确定第二字符集合;Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any t*t pixel;
根据第一字符集合、第二字符集合及第三字符集合,确定任一像素与第一目标像素之间的相似度,作为第一差异;确定任一像素与第三目标像素之间的相似度,作为第三差异。According to the first character set, the second character set and the third character set, determine the similarity between any pixel and the first target pixel as the first difference; determine the similarity between any pixel and the third target pixel , as the third difference.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
对于任一训练图像帧组,根据任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定任一训练图像帧组对应的标签中间图像帧与任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为第二差异。For any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to any training image frame group and the RGB values of all pixels in the estimated intermediate image frame corresponding to any training image frame group, determine The RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group is used as the second difference.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
比较第一差异与第三差异,确定任一t*t的像素的中心像素的最佳匹配像素,根据最佳匹配像素,通过纹理一致性损失函数计算任一t*t的像素的中心像素的纹理一致性损失;其中,纹理一致性损失用于对视频插帧模型进行训练。Compare the first difference with the third difference, determine the best matching pixel of the center pixel of any t*t pixel, and calculate the center pixel of any t*t pixel through the texture consistency loss function according to the best matching pixel Texture consistency loss; among them, the texture consistency loss is used to train the video frame interpolation model.
在一些实施例中,计算机程序被处理器执行时还实现以下步骤:In some embodiments, when the computer program is executed by the processor, the following steps are also implemented:
获取待处理视频中两个待处理图像帧;Obtain two image frames to be processed in the video to be processed;
将两个待处理图像帧输入至已训练好的视频插帧模型中,得到两个待处理图像帧的中间图像帧。Input the two image frames to be processed into the trained video frame interpolation model to obtain the intermediate image frame of the two image frames to be processed.
需要说明的是,本申请实施例所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in the embodiments of this application , are all information and data authorized by the user or fully authorized by all parties.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请实施例所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。本申请实施例所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可 包括基于区块链的分布式数据库等,不限于此。本申请实施例所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided in the embodiments of the present application may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the various embodiments provided in the embodiments of the present application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in the embodiments of the present application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc. Not limited to this.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.
以上所述实施例仅表达了本申请实施例的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请实施例专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请实施例构思的前提下,还可以做出若干变形和改进,这些都属于本申请实施例的保护范围。因此,本申请实施例的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the embodiments of the present application. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concepts of the embodiments of the present application, and these all belong to the protection scope of the embodiments of the present application. Therefore, the scope of protection of the embodiments of the present application should be determined by the appended claims.

Claims (20)

  1. 一种视频插帧模型训练方法,其特征在于,包括:A video frame interpolation model training method, characterized in that, comprising:
    获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;Obtain training image frame groups, each training image frame group is composed of three consecutive image frames in the video arranged in order, and the second image frame in each training image frame group is used as each training image frame group The corresponding label intermediate image frame;
    将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;Input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the estimated intermediate image frame corresponding to each training image frame group;
    基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整所述视频插帧模型中的参数,直至满足训练停止条件时结束训练;所述第二差异与参数调整之间的关联度大于,所述第一差异或所述第三差异与参数调整之间的关联度。Based on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, the label intermediate image frame corresponding to each training image frame group and each The second difference between the estimated intermediate image frames corresponding to the training image frame group, and the third difference between the third image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group Three differences, adjust the parameters in the video frame interpolation model, and end the training until the training stop condition is satisfied; the correlation between the second difference and the parameter adjustment is greater than that of the first difference or the third difference and Correlation between parameter adjustments.
  2. 根据权利要求1所述的方法,其特征在于,所述将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧,包括:The method according to claim 1, wherein the first image frame and the third image frame in each training image frame group are input to the video frame interpolation model, and the output corresponding to each training image frame group is Estimated intermediate image frames, including:
    对于任一训练图像帧组,将所述任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对所述第一图像帧与所述第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,所述n为正整数且不小于2;For any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, for the first image frame and the third image frame The third image frame is adjusted with the same resolution at the same time; wherein, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and the n is a positive integer and not less than 2;
    对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;Feature extraction is performed on the two image frames after each adjustment, each image frame feature group is composed of the features extracted from the two image frames after each adjustment, and the image frame feature group set is composed of each image frame feature group ;
    将所述第一图像帧在所述图像帧特征组集合中对应的特征向所述第三图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第一图像帧的对齐结果;performing cross-scale alignment processing on features corresponding to the first image frame in the image frame feature set set to features corresponding to the third image frame in the image frame feature set set, to obtain the first image frame alignment result;
    将所述第三图像帧在所述图像帧特征组集合中对应的特征向所述第一图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第三图像帧的对齐结果;performing cross-scale alignment processing on the features corresponding to the third image frame in the image frame feature set set to the features corresponding to the first image frame in the image frame feature set set, to obtain the third image frame alignment result;
    将所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;performing two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
    对所述双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。Reconstruction processing is performed on the two-way information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.
  3. 根据权利要求2所述的方法,其特征在于,所述图像帧特征组集合中图像帧特征组对应的分辨率按顺序是依次变大的;所述将所述第一图像帧在所述图像帧特征组集合中对应的特征向所述第三图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第一图像帧的对齐结果,包括:The method according to claim 2, characterized in that, the resolutions corresponding to the image frame feature groups in the image frame feature group set are sequentially increased; the first image frame in the image The corresponding features in the frame feature group set perform cross-scale alignment processing to the corresponding features of the third image frame in the image frame feature group set to obtain the alignment result of the first image frame, including:
    对于第i个图像帧特征组,若i为1,则将所述第一图像帧在所述第i个图像帧特征组中对应的特征与所述第三图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及所述第三图像帧在所述第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将所述第i个跨尺度融合处理结果与所述第一图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为所述第一图像帧的对齐结果;其中,所述i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame in the i-th image The corresponding features in the frame feature group are aligned to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the third image frame in the i-th The corresponding features in the image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the first image frame in the i-th image Align the corresponding features in the frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the The alignment result of the first image frame; wherein, the i is a positive integer not less than 1 and not greater than n;
    其中,对于所述前i-1个双线性插值计算结果中第j个双线性插值计算结果,所述第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,所述j为不小于1且小于i的正整数。Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
  4. 根据权利要求2所述的方法,其特征在于,所述图像帧特征组集合中图像帧特征组对应的分辨率按顺序是依次变大的;所述将所述第三图像帧在所述图像帧特征组集合中对应的特征向所述第一图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第三图像帧的对齐结果,包括:The method according to claim 2, wherein the resolutions corresponding to the image frame feature groups in the set of image frame feature groups are sequentially increased; The corresponding features in the frame feature group set perform cross-scale alignment processing to the corresponding features of the first image frame in the image frame feature group set, and obtain the alignment result of the third image frame, including:
    对于第i个图像帧特征组,若i为1,则将所述第三图像帧在所述第i个图像帧特征组中对应的特征与所述第一图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及所述第一图像帧在所述第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将所述第i个跨尺度融合处理结果与所述第三图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为所述第三图像帧的对齐结果;其中,所述i为不小于1且不大于n的正整数;For the i-th image frame feature group, if i is 1, then the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame in the i-th image The corresponding features in the frame feature group are aligned to obtain the i-th alignment processing result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first image frame in the i-th The corresponding features in the image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the third image frame in the i-th image Align the corresponding features in the frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group until all image frame feature groups are processed, and use the nth alignment processing result as the The alignment result of the third image frame; wherein, the i is a positive integer not less than 1 and not greater than n;
    其中,对于所述前i-1个双线性插值计算结果中第j个双线性插值计算结果,所述第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的, 所述j为不小于1且小于i的正整数。Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
  5. 根据权利要求3或4所述的方法,其特征在于,所述将所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果,包括:The method according to claim 3 or 4, wherein the two-way information fusion is performed on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result, comprising:
    将所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行卷积,得到卷积结果;Convolving the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
    对所述卷积结果进行计算,得到融合权重;Calculating the convolution result to obtain fusion weights;
    根据所述融合权重,对所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。Perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
  6. 根据权利要求1所述的方法,其特征在于,所述第一差异及所述第三差异均为相似度;所述第一差异及所述第三差异的确定过程,包括:The method according to claim 1, wherein both the first difference and the third difference are similarities; the process of determining the first difference and the third difference includes:
    对于任一训练图像帧组,从所述任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据所述任一t*t的像素的中心像素在所述任一训练图像帧组对应的预估中间图像帧中的位置,分别在所述任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在所述任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;For any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to the center pixel of any t*t pixel in the described The positions in the estimated intermediate image frames corresponding to any training image frame group, respectively determine the first target pixel of t*t in the first image frame in the any training image frame group and in the any training image frame group Determine the third target pixel of t*t in the third image frame in the image frame group; wherein, t is an odd number not equal to 1;
    根据所述t*t的第一目标像素,确定第一字符集合;根据所述t*t的第三目标像素,确定第三字符集合;根据所述任一t*t的像素,确定第二字符集合;Determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; determine the second character set according to any pixel of t*t set of characters;
    根据所述第一字符集合、所述第二字符集合及所述第三字符集合,确定所述任一像素与所述第一目标像素之间的相似度,作为所述第一差异;确定所述任一像素与所述第三目标像素之间的相似度,作为所述第三差异。According to the first character set, the second character set and the third character set, determine the similarity between the any pixel and the first target pixel as the first difference; determine the The similarity between any pixel and the third target pixel is used as the third difference.
  7. 根据权利要求1所述的方法,其特征在于,所述第二差异的确定过程,包括:The method according to claim 1, wherein the determining process of the second difference comprises:
    对于任一训练图像帧组,根据所述任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及所述任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定所述任一训练图像帧组对应的标签中间图像帧与所述任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为所述第二差异。For any training image frame group, according to the RGB values of all the pixels in the label intermediate image frame corresponding to the any training image frame group and the estimated values of all pixels in the intermediate image frame corresponding to the any training image frame group RGB value, determining the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group, as the second difference.
  8. 根据权利要求6所述的方法,其特征在于,还包括:The method according to claim 6, further comprising:
    比较所述第一差异与所述第三差异,确定所述任一t*t的像素的中心像素的最佳匹配像素;根据所述最佳匹配像素,通过纹理一致性损失函数计算所述任一t*t的像素的中心像素的纹理一致性损失;其中,所述纹理一致性损失用于对所述视频插帧模型进行训练。Comparing the first difference with the third difference, determining the best matching pixel of the central pixel of any t*t pixel; according to the best matching pixel, calculating the any A texture consistency loss of the central pixel of t*t pixels; wherein, the texture consistency loss is used to train the video frame interpolation model.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1-8, further comprising:
    获取待处理视频中两个待处理图像帧;Obtain two image frames to be processed in the video to be processed;
    将两个所述待处理图像帧输入至已训练好的视频插帧模型中,得到两个所述待处理图像帧的中间图像帧。Input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
  10. 一种视频插帧模型训练装置,其特征在于,包括:A video frame interpolation model training device, characterized in that it comprises:
    获取模块,用于获取训练图像帧组,每一训练图像帧组是由视频中连续的三个图像帧按顺序排布所构成的,每一训练图像帧组中的第二个图像帧作为每一训练图像帧组对应的标签中间图像帧;The acquisition module is used to obtain training image frame groups, each training image frame group is formed by sequential arrangement of three consecutive image frames in the video, and the second image frame in each training image frame group is used as each a label intermediate image frame corresponding to the training image frame group;
    视频插帧模块,用于将每一训练图像帧组中第一个图像帧及第三个图像帧输入至视频插帧模型,输出每一训练图像帧组对应的预估中间图像帧;The video frame interpolation module is used to input the first image frame and the third image frame in each training image frame group to the video frame interpolation model, and output the corresponding estimated intermediate image frame of each training image frame group;
    调整模块,用于基于每一训练图像帧组中的第一个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第一差异、每一训练图像帧组对应的标签中间图像帧与每一训练图像帧组对应的预估中间图像帧之间的第二差异、以及每一训练图像帧组中的第三个图像帧与每一训练图像帧组对应的预估中间图像帧之间的第三差异,调整所述视频插帧模型中的参数,直至满足训练停止条件时结束训练;所述第二差异与参数调整之间的关联度大于,所述第一差异或所述第三差异与参数调整之间的关联度。The adjustment module is used to base on the first difference between the first image frame in each training image frame group and the estimated intermediate image frame corresponding to each training image frame group, and the label intermediate corresponding to each training image frame group The second difference between the image frame and the estimated intermediate image frame corresponding to each training image frame group, and the estimated intermediate image corresponding to the third image frame in each training image frame group and each training image frame group The third difference between frames, adjust the parameters in the video frame interpolation model until the training stops when the training stop condition is satisfied; the degree of correlation between the second difference and parameter adjustment is greater than that of the first difference or the The degree of correlation between the third difference and parameter adjustment.
  11. 根据权利要求10所述的装置,其特征在于,所述视频插帧模块包括:The device according to claim 10, wherein the video frame insertion module comprises:
    调整子模块,用于对于任一训练图像帧组,将所述任一训练图像帧组中第一个图像帧与第三个图像帧分别作为第一图像帧及第三图像帧,对所述第一图像帧与所述第三图像帧同时采用相同分辨率进行调整;其中,共调整n-1次且每次调整所采用的分辨率均不相同,所述n为正整数且不小于2;The adjustment sub-module is used for any training image frame group, using the first image frame and the third image frame in any training image frame group as the first image frame and the third image frame respectively, and the The first image frame and the third image frame are adjusted at the same resolution at the same time; wherein, a total of n-1 adjustments are made and the resolutions used for each adjustment are different, and the n is a positive integer and not less than 2 ;
    特征提取子模块,用于对每次调整后的两个图像帧各自进行特征提取,由每次调整后的两个图像帧所提取的特征组成每一图像帧特征组,由每一图像帧特征组构成图像帧特征组集合;The feature extraction sub-module is used to perform feature extraction on the two image frames after each adjustment, each image frame feature group is formed by the features extracted from the two image frames after each adjustment, and each image frame feature Grouping constitutes an image frame feature set;
    第一对齐子模块,用于将所述第一图像帧在所述图像帧特征组集合中对应的特征向所述第三图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第一图像帧的对齐结果;A first alignment submodule, configured to perform cross-scale alignment of features corresponding to the first image frame in the image frame feature set to features corresponding to the third image frame in the image frame feature set processing to obtain the alignment result of the first image frame;
    第二对齐子模块,用于将所述第三图像帧在所述图像帧特征组集合中对应的特征向所述第一图像帧在所述图像帧特征组集合中对应的特征进行跨尺度对齐处理,得到所述第三图像帧的对齐结果;The second alignment submodule is configured to perform cross-scale alignment of the features corresponding to the third image frame in the image frame feature set to the features corresponding to the first image frame in the image frame feature set processing to obtain the alignment result of the third image frame;
    双向信息融合子模块,用于将所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行双向信息融合,得到双向信息融合结果;A two-way information fusion sub-module, configured to perform two-way information fusion on the alignment result of the first image frame and the alignment result of the third image frame to obtain a two-way information fusion result;
    重建模块,用于对所述双向信息融合结果进行重建处理,得到每一训练图像帧组对应的预估中间图像帧。The reconstruction module is configured to perform reconstruction processing on the bidirectional information fusion result to obtain an estimated intermediate image frame corresponding to each training image frame group.
  12. 根据权利要求11所述的装置,其特征在于,所述第一对齐子模块包括:The device according to claim 11, wherein the first alignment submodule comprises:
    第一重复单元,用于对于第i个图像帧特征组,若i为1,则将所述第一图像帧在所述第i个图像帧特征组中对应的特征与所述第三图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及所述第三图像帧在所述第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将所述第i个跨尺度融合处理结果与所述第一图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为所述第一图像帧的对齐结果;其中,所述i为不小于1且不大于n的正整数;The first repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the first image frame in the i-th image frame feature group and the third image frame The features corresponding to the i-th image frame feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the third The features corresponding to the image frame in the i-th image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the first image frame Perform alignment processing on the corresponding features in the i-th image frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and the first n alignment processing results are used as alignment results of the first image frame; wherein, the i is a positive integer not less than 1 and not greater than n;
    其中,对于所述前i-1个双线性插值计算结果中第j个双线性插值计算结果,所述第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,所述j为不小于1且小于i的正整数。Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
  13. 根据权利要求11所述的装置,其特征在于,所述第二对齐子模块包括:The device according to claim 11, wherein the second alignment submodule comprises:
    第二重复单元,用于对于第i个图像帧特征组,若i为1,则将所述第三图像帧在所述第i个图像帧特征组中对应的特征与所述第一图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果;若i不为1,则将前i-1个双线性插值计算结果及所述第一图像帧在所述第i个图像帧特征组中对应的特征作跨尺度融合处理,得到第i个跨尺度融合处理结果,将所述第i个跨尺度融合处理结果与所述第三图像帧在所述第i个图像帧特征组中对应的特征作对齐处理,得到第i个对齐处理结果,重复上述对每一图像帧特征组的处理过程,直至所有图像帧特征组处理完毕,将第n个对齐处理结果作为所述第三图像帧的对齐结果;其中,所述i为不小于1且不大于n的正整数;The second repeating unit is used for, for the i-th image frame feature group, if i is 1, the feature corresponding to the third image frame in the i-th image frame feature group and the first image frame The features corresponding to the i-th image frame feature group are aligned to obtain the i-th alignment result; if i is not 1, the first i-1 bilinear interpolation calculation results and the first The features corresponding to the image frame in the i-th image frame feature group are subjected to cross-scale fusion processing to obtain the i-th cross-scale fusion processing result, and the i-th cross-scale fusion processing result is combined with the third image frame Perform alignment processing on the corresponding features in the i-th image frame feature group to obtain the i-th alignment processing result, repeat the above-mentioned processing process for each image frame feature group, until all image frame feature groups are processed, and the first n alignment processing results are used as alignment results of the third image frame; wherein, the i is a positive integer not less than 1 and not greater than n;
    其中,对于所述前i-1个双线性插值计算结果中第j个双线性插值计算结果,所述第j个双线性插值计算结果是对第j个对齐处理结果连续进行i-j次双线性插值计算所得到的,所述j为不小于1且小于i的正整数。Wherein, for the jth bilinear interpolation calculation result among the first i-1 bilinear interpolation calculation results, the jth bilinear interpolation calculation result is performed i-j times consecutively on the jth alignment processing result Obtained by bilinear interpolation calculation, the j is a positive integer not less than 1 and less than i.
  14. 根据权利要求12或13所述的装置,其特征在于,所述双向信息融合子模块包括:The device according to claim 12 or 13, wherein the two-way information fusion submodule comprises:
    第一获取单元,用于将所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行卷积,得到卷积结果;A first acquisition unit, configured to convolve the alignment result of the first image frame and the alignment result of the third image frame to obtain a convolution result;
    第二获取单元,用于对所述卷积结果进行计算,得到融合权重;a second acquisition unit, configured to calculate the convolution result to obtain fusion weights;
    第一处理单元,用于根据所述融合权重,对所述第一图像帧的对齐结果及所述第三图像帧的对齐结果进行融合处理,得到双向信息融合结果。The first processing unit is configured to perform fusion processing on the alignment result of the first image frame and the alignment result of the third image frame according to the fusion weight to obtain a two-way information fusion result.
  15. 根据权利要求10所述的装置,其特征在于,所述调整模块还包括:The device according to claim 10, wherein the adjustment module further comprises:
    第一确定单元,用于对于任一训练图像帧组,从所述任一训练图像帧组对应的预估中间图像帧中选取任一t*t的像素,根据所述任一t*t的像素的中心像素在所述任一训练图像帧组对应的预估中间图像帧中的位置,分别在所述任一训练图像帧组中第一个图像帧中确定t*t的第一目标像素及在所述任一训练图像帧组中第三个图像帧中确定t*t的第三目标像素;其中,t为不等于1的奇数;The first determination unit is configured to, for any training image frame group, select any t*t pixel from the estimated intermediate image frame corresponding to any training image frame group, according to any t*t pixel The position of the central pixel of the pixel in the estimated intermediate image frame corresponding to any training image frame group, the first target pixel of t*t is determined in the first image frame in any training image frame group respectively And determine the third target pixel of t*t in the third image frame in any one of the training image frame groups; wherein, t is an odd number not equal to 1;
    第二确定单元,用于根据所述t*t的第一目标像素,确定第一字符集合;根据所述t*t的第三目标像素,确定第三字符集合;根据所述任一t*t的像素,确定第二字符集合;The second determination unit is used to determine the first character set according to the first target pixel of t*t; determine the third character set according to the third target pixel of t*t; The pixels of t determine the second character set;
    第三确定单元,用于根据所述第一字符集合、所述第二字符集合及所述第三字符集合,确定所述任一像素与所述第一目标像素之间的相似度,作为所述第一差异;确定所述任一像素与所述第三目标像素之间的相似度,作为所述第三差异。A third determining unit, configured to determine the similarity between any pixel and the first target pixel according to the first character set, the second character set, and the third character set, as the the first difference; determining the similarity between the any pixel and the third target pixel as the third difference.
  16. 根据权利要求10所述的装置,其特征在于,所述调整模块还包括:The device according to claim 10, wherein the adjustment module further comprises:
    第四确定单元,用于对于任一训练图像帧组,根据所述任一训练图像帧组对应的标签中间图像帧中的所有像素的RGB值以及所述任一训练图像帧组对应的预估中间图像帧中所有像素的RGB值,确定所述任一训练图像帧组对应的标签中间图像帧与所述任一训练图像帧组对应的预估中间图像帧之间的RGB值差异,作为所述第二差异。The fourth determining unit is used for any training image frame group, according to the RGB values of all pixels in the label intermediate image frame corresponding to the any training image frame group and the corresponding estimated value of the any training image frame group The RGB values of all pixels in the intermediate image frame determine the RGB value difference between the label intermediate image frame corresponding to any training image frame group and the estimated intermediate image frame corresponding to any training image frame group, as the Describe the second difference.
  17. 根据权利要求10-16任一项所述的装置,其特征在于,还包括:The device according to any one of claims 10-16, further comprising:
    图像帧获取模块,用于获取待处理视频中两个待处理图像帧;An image frame acquisition module, configured to acquire two image frames to be processed in the video to be processed;
    输入模块,用于将两个所述待处理图像帧输入至已训练好的视频插帧模型中,得到两个所述待处理图像帧的中间图像帧。The input module is configured to input the two image frames to be processed into the trained video frame interpolation model to obtain an intermediate image frame of the two image frames to be processed.
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至9中任一项所述的方法的步骤。A computer device, comprising a memory and a processor, the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 9 when executing the computer program.
  19. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至9中任一项所述的方法的步骤。A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the steps of the method according to any one of claims 1 to 9 when the computer program is executed by a processor.
  20. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序产品被处理器执行时实现权利要求1至9中任一项所述的方法的步骤。A computer program product, comprising a computer program, characterized in that, when the computer program product is executed by a processor, the steps of the method according to any one of claims 1 to 9 are realized.
PCT/CN2022/105652 2021-12-06 2022-07-14 Video frame interpolation model training method and apparatus, and computer device and storage medium WO2023103378A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111477500.0 2021-12-06
CN202111477500.0A CN113891027B (en) 2021-12-06 2021-12-06 Video frame insertion model training method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023103378A1 true WO2023103378A1 (en) 2023-06-15

Family

ID=79015618

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105652 WO2023103378A1 (en) 2021-12-06 2022-07-14 Video frame interpolation model training method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN113891027B (en)
WO (1) WO2023103378A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113891027B (en) * 2021-12-06 2022-03-15 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium
CN115103147A (en) * 2022-06-24 2022-09-23 马上消费金融股份有限公司 Intermediate frame image generation method, model training method and device
KR20240103345A (en) * 2022-12-27 2024-07-04 씨제이올리브네트웍스 주식회사 Image information generation system and method using AI

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111327926A (en) * 2020-02-12 2020-06-23 北京百度网讯科技有限公司 Video frame insertion method and device, electronic equipment and storage medium
CN111464814A (en) * 2020-03-12 2020-07-28 天津大学 Virtual reference frame generation method based on parallax guide fusion
CN111898701A (en) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 Model training, frame image generation, frame interpolation method, device, equipment and medium
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
US20210067735A1 (en) * 2019-09-03 2021-03-04 Nvidia Corporation Video interpolation using one or more neural networks
US20210073589A1 (en) * 2019-09-09 2021-03-11 Apple Inc. Method for Improving Temporal Consistency of Deep Neural Networks
CN113393562A (en) * 2021-06-16 2021-09-14 黄淮学院 Animation middle picture intelligent generation method and system based on visual transmission
WO2021217653A1 (en) * 2020-04-30 2021-11-04 京东方科技集团股份有限公司 Video frame insertion method and apparatus, and computer-readable storage medium
CN113891027A (en) * 2021-12-06 2022-01-04 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8818175B2 (en) * 2010-03-08 2014-08-26 Vumanity Media, Inc. Generation of composited video programming
AU2018101526A4 (en) * 2018-10-14 2018-11-29 Chai, Xipeng Mr Video interpolation based on deep learning
US10896356B2 (en) * 2019-05-10 2021-01-19 Samsung Electronics Co., Ltd. Efficient CNN-based solution for video frame interpolation
CN113132664B (en) * 2021-04-19 2022-10-04 科大讯飞股份有限公司 Frame interpolation generation model construction method and video frame interpolation method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210067735A1 (en) * 2019-09-03 2021-03-04 Nvidia Corporation Video interpolation using one or more neural networks
US20210073589A1 (en) * 2019-09-09 2021-03-11 Apple Inc. Method for Improving Temporal Consistency of Deep Neural Networks
CN111327926A (en) * 2020-02-12 2020-06-23 北京百度网讯科技有限公司 Video frame insertion method and device, electronic equipment and storage medium
CN111464814A (en) * 2020-03-12 2020-07-28 天津大学 Virtual reference frame generation method based on parallax guide fusion
WO2021217653A1 (en) * 2020-04-30 2021-11-04 京东方科技集团股份有限公司 Video frame insertion method and apparatus, and computer-readable storage medium
CN111898701A (en) * 2020-08-13 2020-11-06 网易(杭州)网络有限公司 Model training, frame image generation, frame interpolation method, device, equipment and medium
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN113393562A (en) * 2021-06-16 2021-09-14 黄淮学院 Animation middle picture intelligent generation method and system based on visual transmission
CN113891027A (en) * 2021-12-06 2022-01-04 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANG HAOXIAN; WANG RONGGANG; ZHAO YANG: "Multi-Frame Pyramid Refinement Network for Video Frame Interpolation", IEEE ACCESS, IEEE, USA, vol. 7, 1 January 1900 (1900-01-01), USA , pages 130610 - 130621, XP011747003, DOI: 10.1109/ACCESS.2019.2940510 *
ZHANG QIAN, JIANG FENG: "Video interpolation based on deep learing", INTELLIGENT COMPUTER AND APPLICATIONS., vol. 9, no. 4, 1 July 2019 (2019-07-01), pages 252 - 257, 262, XP093069281 *

Also Published As

Publication number Publication date
CN113891027B (en) 2022-03-15
CN113891027A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
WO2023103378A1 (en) Video frame interpolation model training method and apparatus, and computer device and storage medium
CN110136056B (en) Method and device for reconstructing super-resolution image
US11347308B2 (en) Method and apparatus with gaze tracking
CN111242846B (en) Fine-grained scale image super-resolution method based on non-local enhancement network
AU2021354030B2 (en) Processing images using self-attention based neural networks
US11823322B2 (en) Utilizing voxel feature transformations for view synthesis
CN117597703B (en) Multi-scale converter for image analysis
WO2021164269A1 (en) Attention mechanism-based disparity map acquisition method and apparatus
Guan et al. Srdgan: learning the noise prior for super resolution with dual generative adversarial networks
WO2020000877A1 (en) Method and device for generating image
Huang et al. Fast blind image super resolution using matrix-variable optimization
CN115272250B (en) Method, apparatus, computer device and storage medium for determining focus position
WO2022081226A1 (en) Dual-stage system for computational photography, and technique for training same
Weber et al. Observer dependent lossy image compression
WO2017112086A1 (en) Multi-stage image super-resolution with reference merging using personalized dictionaries
Chen et al. High-order relational generative adversarial network for video super-resolution
CN115272082A (en) Model training method, video quality improving method, device and computer equipment
CN114821140A (en) Image clustering method based on Manhattan distance, terminal device and storage medium
WO2024032331A9 (en) Image processing method and apparatus, electronic device, and storage medium
CN103226818A (en) Single-frame image super-resolution reconstruction method based on manifold regularized sparse support regression
Liu et al. CNN-Enhanced graph attention network for hyperspectral image super-resolution using non-local self-similarity
Wang et al. VPU: A video-based point cloud upsampling framework
CN116862762A (en) Video superdivision method, device, equipment and storage medium
US12148123B2 (en) Multi-stage multi-reference bootstrapping for video super-resolution
Chen et al. Deep Feature Statistics Mapping for Generalized Screen Content Image Quality Assessment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22902813

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE