Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For ease of understanding, the frame rate standard of the video source in the related art will be described first. At present, whether 2D or 3D video materials are shot, most of the video materials still adopt a lower frame rate standard, for example, television media widely adopt 25 frames or 30 frames, and some live broadcast platforms adopt a lower frame rate of 15 frames for video shooting and transmission. For example, when video shooting is performed outdoors or a mobile terminal is used for video shooting, due to network bandwidth limitation, a video with a lower frame rate is generally used for transmission, so as to reduce video transmission code rate and save network bandwidth.
However, with the advance of computer vision and multimedia technology, digital media has been gradually updated from low definition to high definition, and even ultra high definition 4K (the resolution of 4K video can reach 2160 x 4096). The existing video source has been unable to meet the increasing demand for video impression, so it is required to provide better viewing experience for users by increasing the real-time frame rate of the video. By performing real-time frame interpolation at the video receiving end, the problems of unsmooth video playing and the like caused by too low original video frame rate can be avoided.
The current video frame insertion method mainly comprises three types:
the first type is a video frame interpolation method based on image interpolation; the method obtains an intermediate frame by using block search and motion compensation between two previous video frames and two subsequent video frames.
The second category is video frame interpolation methods based on optical flow estimation; the method comprises the steps of firstly, calculating an optical flow vector between two adjacent video frames, and performing motion compensation based on the motion vector to generate a new intermediate video frame; according to the calculation method of the optical flow vector, the method can be divided into a frame interpolation method based on unidirectional motion estimation and a frame interpolation method based on bidirectional motion estimation.
The third method is a video frame insertion method based on deep learning; the method is based on a convolution structure of a Convolutional Neural Network (CNN) to extract video characteristics of two adjacent frames, and then a certain fusion strategy is adopted to organically combine information of the two adjacent frames to obtain an intermediate frame.
The video frame interpolation method generally adopts a static frame interpolation mode to perform video frame interpolation, namely, videos of any scene and any video complexity are subjected to video frame interpolation by adopting preset fixed multiples, and video frames with the same number are interpolated in two adjacent frames. However, for the same video needing frame interpolation, the video complexity of different video segments is different, and for the segments with slow interframe change (the video complexity is low), if too many frames are inserted, not only human eyes cannot perceive the video watching effect caused by the inserted frames, that is, cannot improve the subjective impression of human eyes, but also the video volume is increased, and the video transmission bandwidth is increased; for a segment with fast interframe change (high video complexity), if the frame interpolation is performed by the same multiple as that of a video frame with slow interframe change, the video fluency cannot be improved to the maximum extent.
In summary, the video frame interpolation methods in the related art are not reasonable and flexible, the frame interpolation effect is not good, and the video viewing experience of the user is affected, and accordingly, embodiments of the present invention provide a video frame interpolation method, an apparatus, and a server, and the technology can be applied to video frame interpolation processes of various video sources, such as 2D video, 3D video, and the like.
First, referring to a flow chart of a video frame interpolation method shown in fig. 1, the method includes the following steps:
step S100, based on the video frame sequence to be processed, determining the current forward reference frame and backward reference frame.
The video frame sequence may be a plurality of images arranged in a certain order; when the continuous image changes by more than 24 frames per second, the human eye cannot distinguish a single still picture according to the persistence of vision principle, but feels a smooth continuous visual effect, in which case a video is formed. In the specific implementation process, a video frame interpolation method is usually adopted to increase the real-time frame rate, that is, the number of video frames played in unit time is increased; therefore, every two adjacent images of the video frame sequence can be respectively used as a forward reference frame and a backward reference frame, or two images can be extracted according to a preset interval to be used as a forward reference frame and a backward reference frame, and then video frame interpolation is performed between the forward reference frame and the backward reference frame, so that the frame rate is increased. Typically, the forward reference frame is played earlier than the backward reference frame.
Step S102, determining the frame similarity between the forward reference frame and the backward reference frame.
The frame similarity is mainly used for measuring the similarity between the forward reference frame and the backward reference frame. Because the forward reference frame and the backward reference frame are both images, the determination process of the frame similarity between the forward reference frame and the backward reference frame is communicated with the determination process of the similarity between the two images to a certain extent; the frame similarity can be determined in a manner of determining the image similarity in the related art.
Specifically, the frame similarity may be obtained by calculating Structural Similarity (SSIM), cosine similarity, histogram similarity, or the like of the forward reference frame and the backward reference frame. The structural similarity measures the image similarity from three aspects of brightness, contrast and structure; the cosine similarity calculation process is that the images are expressed into a vector, and the similarity of the two images is represented by calculating the cosine distance between the vectors; the histogram similarity determines the similarity of two images based on the global distribution of colors in the histogram description image.
In practical implementation, one of the above manners may be selected to determine the frame similarity according to the practical requirement of the frame interpolation effect, or multiple manners of the above manners may be selected to perform calculation, and the final frame similarity is obtained by respectively weighting the calculation results.
And step S104, determining the number of the inserted frames based on the frame similarity.
The number of the inserted frames can be understood as the number of frames to be inserted; the number of the interpolation frames is an integer greater than or equal to zero; when the frame similarity between the forward reference frame and the backward reference frame is large, when a user watches two reference frames for continuous playing, the switching process of the two images cannot be distinguished by human eyes, the video is smooth, the film watching effect is good, and at the moment, a small amount of insertion frames can be inserted between the two reference frames or the insertion frames are not inserted between the two reference frames; when the frame similarity between the forward reference frame and the backward reference frame is small, when a user watches two reference frames for continuous playing, the switching process of two images may be distinguished by human eyes, the video is relatively pause, and the film watching effect is poor, and at the moment, more insertion frames need to be inserted between the two reference frames.
In a specific implementation process, the range of the number of the inserted frames of the video frame sequence can be determined according to the preset video fluency; the above process of determining the range of the number of interpolated frames may be implemented by historical data, or by experiment. The range typically includes upper and lower limits; the upper limit value is the maximum number of the inserted frames, and when the inserted frame corresponding to the upper limit value is inserted between two video frames with the lowest frame similarity in the video, the preset video fluency can be at least reached when the segment is played; the lower limit value is the minimum number of the inserted frames, and when the inserted frame corresponding to the lower limit value is inserted between two video frames with the highest frame similarity in the video, the preset video fluency can be at least achieved when the segment is played.
For example, assuming that the frame similarity has a value range of [0,1], when the frame similarity of two video frames is larger, it indicates that the two video frames are more similar; when the frame similarity of two video frames is 1, the two video frames are completely the same; when the frame similarity of two adjacent video frames in the video frame sequence of the frame to be inserted is 1, the lower limit value of the number of the inserted frames can be set to 0, that is, other video frames do not need to be inserted into the two video frames; when the minimum frame similarity of two adjacent video frames in the video frame sequence is 0.1, the minimum lower limit value of the number of the inserted frames can be set to 12 according to historical experience or experiments, and the preset video fluency can be achieved. The frame similarity value of other two adjacent video frames is in the range of (0.1,1), and the number of the inserted frames corresponding to the frame similarity can be obtained according to a preset proportional relation, a logarithmic relation and the like; the proportional relationship, the logarithmic relationship, or the like may be determined by historical data or experiments.
And step S106, inserting video frames corresponding to the number of the inserted frames between the forward reference frame and the backward reference frame.
After the number of the interpolation frames is determined, when the number of the interpolation frames is zero, the interpolation frame processing between the forward reference frame and the backward reference frame is not needed; if the number of the inserted frames is not zero, firstly, a preset video frame inserting mode can be adopted to carry out frame inserting processing on the forward reference frame and the backward reference frame to obtain a first video frame; the preset video frame interpolation method can be a video frame interpolation method based on image interpolation, optical flow estimation or deep learning; then inserting the obtained first video frame between a forward reference frame and a backward reference frame; if the number of frames to be interpolated is one, the interpolation between the current forward reference frame and the backward reference frame is completed at this time.
If the number of the inserted frames is more than 1, the insertion position of the next video frame needs to be determined, and a forward reference frame and a backward reference frame of the next video frame are determined at the same time; at this time, the insertion position may be between the forward reference frame and the first video frame, or between the first video frame and the backward reference frame. In a specific implementation process, the frame similarity between a forward reference frame and a first video frame and the frame similarity between the first video frame and a backward reference frame can be respectively calculated, the similarity of the two frames is compared, and the position of an inserted frame is determined to be between the two video frames with smaller frame similarity; if the frame similarity between the forward reference frame and the first video frame is 0.3 and the frame similarity between the first video frame and the backward reference frame is 0.2, determining the frame interpolation position as the position between the first video frame and the backward reference frame, wherein the first video frame is the forward reference frame of the next video frame and the backward reference frame is the forward reference frame of the next video frame; and performing frame interpolation processing on the first video frame and the backward reference frame by adopting a preset video frame interpolation mode to obtain a next video frame, and interpolating the next video frame between the first video frame and the backward reference frame.
When the number of the video frames inserted between the initial forward reference frame and the backward reference frame does not reach the number of the inserted frames, the step of determining the position of the inserted frame of the next video frame and performing the frame insertion processing can be repeated until the number of the video frames inserted between the initial forward reference frame and the backward reference frame reaches the number of the inserted frames.
Further, the steps of determining the current forward reference frame and the backward reference frame may be continued until the determined backward reference frame is the last video frame in the sequence of video frames. After the frame interpolation process is performed between the current forward reference frame and the backward reference frame, the forward reference frame and the backward reference frame can be re-determined based on the above-mentioned video frame sequence to be processed. As an example, the current backward reference frame may be used as a forward reference frame, and a certain video frame in the video frame sequence after the current backward reference frame may be used as a backward reference frame, and the steps of determining the similarity between the two frames, determining the number of interpolated frames, and performing the interpolation processing may be performed again. And when the video frame behind the current backward reference frame does not exist in the video frame sequence after the frame interpolation processing, namely the current backward reference frame is the last video frame in the video frame sequence, ending the video frame interpolation process.
The video frame interpolation method comprises the steps of firstly determining a current forward reference frame and a current backward reference frame based on a video frame sequence to be processed, determining the frame similarity between the forward reference frame and the backward reference frame, then determining the number of interpolated frames based on the frame similarity, and further inserting video frames corresponding to the number of interpolated frames between the forward reference frame and the backward reference frame. The method can reasonably determine the number of the inserted frames based on the frame similarity of the forward reference frame and the backward reference frame, improves the flexibility of frame insertion processing of the video, and improves the frame insertion effect, thereby improving the video watching experience of users.
The embodiment of the invention also provides another video frame insertion method, which is realized on the basis of the method in the embodiment; the method mainly describes a specific process for determining the frame similarity of a forward reference frame and a backward reference frame, and the process is specifically realized by the following steps S202-S206; the method also describes a specific implementation process for determining the number of the inserted frames based on the frame similarity, which is specifically implemented by the following steps S208 and S210; as shown in fig. 2, the method comprises the steps of:
step S200, based on the video frame sequence to be processed, determining the current forward reference frame and backward reference frame.
Step S202, extracting a first feature vector of the forward reference frame and a second feature vector of the backward reference frame through the pre-trained feature extraction network.
The feature extraction network may be established by a machine learning method or a convolutional neural network structure. In a specific implementation process, the feature extraction network may be composed of a convolution feature extraction module, a multi-scale feature extraction module and a full connection layer, which are connected in sequence, and a schematic structural diagram of the feature extraction network is shown in fig. 3; the convolution feature extraction module can be realized by one or more convolution layers and one or more pooling layers, and is used for performing convolution calculation and average pooling calculation on an input video frame and outputting an initial feature matrix; the multi-scale feature extraction module can be realized by an inclusion module layer or a plurality of convolution layers with different convolution kernels, and is used for extracting multi-scale features of the initial feature matrix through a plurality of preset convolution kernels so as to obtain a multi-scale feature matrix; the full connection layer is used for carrying out comprehensive characteristic processing on the multi-scale characteristic matrix to obtain a characteristic vector of the input video frame; in particular, the fully-connected layer may convert the multi-scale feature matrix into one-dimensional features, i.e., feature vectors of the forward reference frame or the backward reference frame described above.
In another implementation, the feature extraction network has 7 layers in total, including 2 convolutional layers (conv), 2 average pooling layers (AvgPool), 2 inclusion module layers, and 1 full-connected layer (FC layer), and a workflow diagram of the feature extraction network is shown in fig. 4. In the network, the first layer is a convolutional layer, the convolutional kernel size of the convolutional layer can be set to be 7x7, the number of feature maps can be 64, and the step size can be 1; the second layer is a pooling layer which can adopt an average pooling mode, and the step length can be 2; the third layer is a convolutional layer, the convolutional kernel size of the convolutional layer can be set to be 3x3, the number of feature maps can be 128, and the step size can be 1; the fourth layer is a pooling layer which can adopt an average pooling mode, and the step length can be 2; the fifth layer is an inclusion module layer, and the number of characteristic diagrams can be 128; the sixth layer is an inclusion module layer, and the number of the characteristic diagrams can be 128; the seventh layer is a fully connected layer, and the dimension of the output feature map is 500x 1.
Wherein the inclusion module layers of the fifth and sixth layers are shown in fig. 5; a Previous layer can be understood as the upper layer; for example, for the inclusion module layer of the fifth layer, the upper layer is the pooling layer, and for the inclusion module layer of the sixth layer, the upper layer is the inclusion module layer of the fifth layer. The inclusion module layers include 4 convolutional layers with a convolutional kernel size of 1x1, 1 convolutional layer with a convolutional kernel size of 3x3, 1 convolutional layer with a convolutional kernel size of 5x5, 1 MaxPool with a convolutional kernel size of 3x3, and a series filter.
Before extracting feature vectors of an input image through the feature extraction network, the input image is generally required to be preprocessed; for example, the image input into the feature extraction network should be a three-channel image of 224 × 3. Therefore, for any size image, before inputting into the network, the image needs to be resized to a fixed size of 224 × 3, so that the feature vectors of the same length can be obtained finally, thereby facilitating subsequent processing.
In the process of extracting the feature vector, if the forward reference frame is not the first video frame in the video frame sequence, the second feature vector of the last backward reference frame can be used as the first feature vector of the current forward reference frame; and extracting a second feature vector of the current backward reference frame by adopting a feature extraction network. When the characteristic vector of the reference frame is extracted in the mode, the characteristic vector needs to be extracted twice when the frame is inserted between the first video frame and the second video frame, and each frame insertion operation only needs to calculate the characteristic vector of one video frame, so that the operation amount is reduced, and the processing speed is increased.
Step S204, calculating the feature similarity of the first feature vector and the second feature vector.
In a specific implementation process, parameters such as cosine similarity or pearson correlation coefficient can be adopted to represent the feature similarity of the image. When the feature Similarity is expressed by using the cosine Similarity, assuming that the first feature vector of the forward reference frame extracted by the above feature extraction network is F1 and the second feature vector of the backward reference frame is F2, the feature Similarity of the feature vectors F1 and F2 can be calculated by the following formula:
wherein denotes a multiplication operation; n denotes the vector dimension. The Similarity represents the magnitude of the cosine Similarity, and the numeric range of the calculated Similarity is usually [0,1 ].
Step S206, determining the feature similarity as the frame similarity of the forward reference frame and the backward reference frame; specifically, the value of the similarity may be used as the frame similarity of the forward reference frame and the backward reference frame, and the value range of the frame similarity is also [0,1 ].
And step S208, determining the frame interpolation times according to the preset frame interpolation number range and the frame similarity.
Specifically, the number of frame insertions can be calculated by the following formula:
n=Round((B1-B2)·Similarity+B2) (2)
where n is the number of frame insertions, Round () represents a rounding operation, B1Is a preset minimum number of frame insertions, B2The number of frame interpolation times is a preset maximum number of frame interpolation times; similarity is the frame Similarity.
Above-mentioned [ B1,B2]I.e. the above frame interpolation times range, B2Greater than B1(ii) a The frame interpolation time range can be set by a user according to the video scene and the complexity of the frame to be interpolated; the user can determine the specific value of the frame insertion frequency range according to historical experience or experiments.
Step S210, determining the frame interpolation quantity according to the frame interpolation times and the preset single frame interpolation quantity.
Specifically, the single frame interpolation amount may be a fixed value, and if the frame interpolation time is set to 1, then when the frame interpolation time is 2, performing frame interpolation for 2 times between the current forward reference frame and the current backward reference frame, and interpolating 1 video frame each time, where the number of the frame interpolation is 2. The value of the single frame interpolation amount may also be related to the frame interpolation times, for example, it may be set in the process of processing a pair of forward reference frames and backward reference frames in the video frame sequence, and the single frame interpolation amount is the same as the frame interpolation times, for example, 1 video frame may be inserted in the first frame interpolation, 2 video frames may be inserted in the second frame interpolation, and so on. As an example, when it is determined that the number of frame interpolation times is 2, 2 frame interpolation is performed between the current forward reference frame and the backward reference frame at this time, 1 video frame is inserted for the first time, 2 video frames are inserted for the second time, and the total number of frame interpolation is 3.
It is assumed that all motion processes are displayed smoothly when the frame rate of the video of the ball motion is 90 frames according to historical experience; if the video frame rate of the currently obtained ball game is 15 frames, the maximum frame interpolation number can be set to 6, and if the single frame interpolation number is set to 1, the corresponding maximum frame interpolation times is 6; there are some video segments of intermittent rest of the game in the video of the ball game, in these video segments, the picture has hardly changed, the minimum number of inserted frames can be set as 0, namely, the video frame is not inserted; the resulting number of frame insertions ranges from 0, 6.
In step S212, video frames corresponding to the number of the interpolated frames are inserted between the forward reference frame and the backward reference frame.
Step S214, judging whether the backward reference frame is the last video frame in the video frame sequence; if not, executing step S200; if so, ending.
If the frame interpolation is performed between every two video frames in the video frame sequence in a sequential manner, and the current backward reference frame is the last video frame in the video frame sequence, the frame interpolation process can be considered to be completed when the frame interpolation process of the whole video frame sequence is completed, and the frame interpolation process is ended; if the current backward reference frame is not the last video frame in the sequence of video frames, which indicates that the frame interpolation process for the entire sequence of video frames is not yet completed, the above steps S200 to S212 need to be repeatedly performed until the frame interpolation process for the entire sequence of video frames is completed.
According to the video frame interpolation method, through a pre-trained feature extraction network, a first feature vector of a forward reference frame and a second feature vector of a backward reference frame are extracted, feature similarity of the first feature vector and the second feature vector is calculated, the number of obtained interpolation frames is calculated, frame interpolation processing is carried out on the forward reference frame and the backward reference frame, and video frames corresponding to the number of the interpolation frames are interpolated. According to the method, reasonable frame interpolation processing is carried out on the reference frames with different frame change speeds by calculating the feature similarity of the feature vectors and the frame interpolation number of the current reference frames, so that the video frame interpolation effect is ensured, and meanwhile, the flexibility of frame interpolation processing is improved.
The embodiment of the invention also provides another video frame insertion method, which is realized on the basis of the method in the embodiment; the method mainly describes a specific implementation process of inserting video frames corresponding to the number of the inserted frames between a forward reference frame and a backward reference frame, and when the number of the inserted frames is one, the process is specifically implemented by the following steps S610 and S612; when the number of the inserted frames is greater than one, the process is specifically realized by the following steps S614 to S622; as shown in fig. 6, the method includes the steps of:
step S600, determining a current forward reference frame and a current backward reference frame based on the video frame sequence to be processed.
Step S602, determining a frame similarity between the forward reference frame and the backward reference frame.
In step S604, the number of interpolated frames is determined based on the frame similarity.
Step S606, judging whether the number of the inserted frames is larger than zero; if yes, go to step S608; if not, go to step 626; specifically, the number of the interpolation frames is an integer greater than or equal to zero; if the number of the interpolation frames is zero, the interpolation frame processing is not required to be carried out between the current forward reference frame and the current backward reference frame; frame interpolation processing can be continuously carried out on reference frames behind the current reference frame in the video frame sequence to be processed; when the current backward reference frame is the last video frame of the video frame sequence to be processed, the frame insertion processing of the whole video frame sequence is completed.
Step S608, determining whether the number of the inserted frames is greater than one; if not, executing step S610; if so, go to step S614.
Specifically, if the number of the interpolated frames is one, the processing of the interpolated frames between the current forward reference frame and the backward reference frame is completed; frame interpolation processing can be continuously carried out on reference frames behind the current reference frame in the video frame sequence to be processed; when the current backward reference frame is the last video frame of the video frame sequence to be processed, the frame insertion processing of the whole video frame sequence is completed.
Step S610, inputting the forward reference frame and the backward reference frame into a preset prediction model, and outputting a first prediction frame.
The prediction model can be a video frame interpolation model obtained through training; firstly, an initial model can be established according to a deep learning principle or a neural network, and then a plurality of groups of training data are input into the initial model; a set of training data comprising a forward reference frame and a backward reference frame; and finally obtaining the video frame interpolation model by continuously iteratively training the parameters of the initial model. In the above steps, the forward reference frame and the backward reference frame are input into a preset prediction model, and a first prediction frame serving as an insertion frame between the two reference frames can be obtained.
Step S612, inserting the first prediction frame between the forward reference frame and the backward reference frame to obtain a video frame sequence after frame insertion, and executing step S626; specifically, a first predicted frame is inserted as an insertion frame between a forward reference frame and a backward reference frame; at this time, if the video is played, firstly, the forward reference frame is played, then, the first prediction frame is played, and finally, the backward reference frame is played; since the similarity between the forward reference frame and the first predicted frame, and the similarity between the first predicted frame and the backward reference frame are higher than the similarity between the forward reference frame and the backward reference frame, the user feels the fluency of the segment is higher when viewing the segment than before the frame insertion. When the number of the interpolation frames is one, the interpolation frame processing between the current forward reference frame and the backward reference frame is already completed.
Step S614, inputting the forward reference frame and the backward reference frame into a preset prediction model, and outputting a first prediction frame.
Step S616, the first predicted frame is inserted between the forward reference frame and the backward reference frame to obtain the video frame sequence after frame insertion.
Step S618, determining the frame interpolation position of the next predicted frame according to the frame similarity between every two adjacent video frames between the forward reference frame and the backward reference frame in the video frame sequence after the frame interpolation.
Specifically, after a video frame is subjected to first frame interpolation, the video frame comprises a forward reference frame, a first prediction frame and a backward reference frame which are sequentially arranged; when the video frame needs to be interpolated for the second time, the frame similarity between the forward reference frame and the first predicted frame and the frame similarity between the first predicted frame and the backward reference frame can be obtained by calculation; the lower the frame similarity between the two images, the more obvious the visual seizure feeling of the user is when the two images are switched; therefore, the position between two images having low frame similarity is selected as the frame interpolation position of the second predicted frame.
Similarly, when determining the frame insertion position of the nth (n is an integer greater than 1) predicted frame, firstly determining the frame similarity between every two adjacent video frames between the forward reference frame and the backward reference frame, and then selecting the two video frames with the lowest frame similarity as the frame insertion position of the nth predicted frame; generally, only the frame similarity between the (n-1) th predicted frame and the previous video frame and the next video frame needs to be calculated, and the frame similarity between the two adjacent video frames is obtained in the previous frame interpolation process.
Step S620, inputting a previous video frame and a next video frame corresponding to the position of the inserted frame into a prediction model, and outputting a current prediction frame; specifically, the previous video frame may be used as a forward reference frame of the current prediction frame, and the next video frame may be used as a backward reference frame of the current prediction frame; the implementation process of this step is similar to step S610 described above.
Step S622, inserting the current predicted frame into the above-mentioned frame insertion position; specifically, the frame interpolation position is the frame interpolation position determined in step S618.
Step S624, determining whether the number of video frames inserted between the forward reference frame and the backward reference frame is equal to the number of inserted frames; if so, go to step 626; if not, go to step S618; when the number of video frames inserted between the current forward reference frame and the backward reference frame is equal to the number of inserted frames, the frame insertion process for the two reference frames is completed; if not, the steps S614-S622 are continued until the frame insertion process of the current two reference frames is completed.
Step S626, judging whether the backward reference frame is the last video frame in the video frame sequence; if not, executing step S600; if so, ending.
According to the video frame interpolation method, after the number of frames to be interpolated is determined according to the feature similarity of a first feature vector and a second feature vector, frame interpolation processing is carried out on a forward reference frame and a backward reference frame, when the number of frames to be interpolated is one, the forward reference frame and the backward reference frame are input into a preset prediction model, a first prediction frame is output, and the prediction frame is inserted between the forward reference frame and the backward reference frame; when the number of the inserted frames is more than one, after a first predicted frame is obtained through frame insertion processing, according to the frame similarity between two adjacent video frames between a forward reference frame and a backward reference frame in a video frame sequence after the frame insertion, the position of the inserted frame of the next predicted frame is determined, and frame insertion processing is continued until the video frame corresponding to the number of the inserted frames is inserted between the forward reference frame and the backward reference frame. According to the method, the frame insertion position of the predicted frame is determined through the frame similarity, frame insertion processing is carried out between two video frames with low frame similarity, and the fluency of the video is efficiently improved.
The embodiment of the invention also provides another video frame insertion method which is realized on the basis of the method in the embodiment. In the related technology, the whole video is subjected to frame interpolation by adopting a static frame interpolation mode, namely, the video frame rate is kept unchanged by adopting a preset fixed multiple for carrying out the video frame interpolation; the method has poor flexibility and unstable frame interpolation effect; the method provided by the embodiment of the invention adopts a dynamic frame interpolation mode to perform frame interpolation processing on the video, namely, the self-adaptive frame interpolation is performed according to the complexity of the video, and the video frame rate is dynamically changed. The method comprises the following steps:
(1) selecting two adjacent frame images S1 and S2 as reference frames; images S1 and S2 are taken as forward reference frame and backward reference frame, respectively; using a pre-trained feature extraction backbone network Net1 to perform feature extraction on the forward reference frame and the backward reference frame to obtain feature vectors which are F1 and F2 respectively; the structure of the feature extraction backbone network can be referred to fig. 4.
In a specific implementation process, in order to reduce repeated calculation, the feature vector F2 of the previous adjacent frame can be used as the feature vector F1 of the next adjacent frame, and except for the first frame, only the feature vector of one image needs to be calculated in the process of calculating the similarity during each frame interpolation.
(2) Calculating the similarity of the feature vectors F1 and F2 so as to determine the similarity of two adjacent frames; specifically, the similarity calculation may be represented by a cosine similarity (see formula 1 for a specific calculation process), or may be represented by a pearson correlation coefficient, or the like.
(3) Calculating the multiple of the frame to be interpolated according to the similarity of the video frames; the specific calculation process can be seen in formula (2); b is1Is a predetermined minimum number of interpolation frames, B2The number of the frames is the preset maximum frame insertion number; b is1And B2May be determined from different video scenes; for example, in some scenarios the following settings may be made: b is11 and B2=12。
(4) Performing n times of video frame interpolation prediction on adjacent video frames by adopting a trained network NET2 based on a deep learning method, thereby obtaining a video sequence after final frame interpolation; the NET2 can be established based on a SepConv network, or can be established according to other deep learning models or neural network models.
Specifically, if the number n of frames interpolated based on the above calculation is 0, no frame interpolation is performed; if n is 1, frame interpolation according to S1 and S2 obtains M1; if n is 2, firstly obtaining M1 by frame interpolation according to S1 and S2, then respectively calculating the Similarity between S1 and M1 and the Similarity between M1 and S2 to obtain Similarity1 and Similarity2, and if the Similarity1 is greater than the Similarity2, namely the Similarity between S1 and M1 is greater than the Similarity between M1 and S2, obtaining M2 by frame interpolation according to M1 and S2; and the rest is repeated, so that the frame interpolation result for n times is obtained.
In a specific implementation process, before using NET2 for frame interpolation prediction, NET2 usually needs to be trained; specifically, a current frame can be used as a target frame, two frames before and after the current frame are used as input data, the input data are sent to an initial model of the NET2, model parameters are learned through continuous iterative training of the network, and therefore a final network NET2 is obtained, so that the function of predicting two adjacent frame data of any input video and obtaining an insertion frame is achieved.
The video frame interpolation method comprises the steps of firstly, selecting two adjacent frame images as reference frames, and performing feature extraction on the reference frames by using a feature backbone network to obtain feature vectors of the reference frames; then calculating the similarity of the two eigenvectors, and calculating to obtain an interpolation frame multiple according to the similarity of adjacent frames; and finally, performing frame interpolation processing by adopting a pre-trained deep learning network according to the frame interpolation multiple, performing frame interpolation or not performing frame interpolation on the video clips with lower complexity by adopting a lower multiple, and performing frame interpolation on the video clips with higher complexity by adopting a higher multiple. Thereby obtaining the final frame-inserted video sequence. The method solves the problems that the video smoothness cannot be improved to the maximum extent by a static frame interpolation method, and meanwhile, the video transmission bandwidth is saved.
Corresponding to the above video frame interpolation method embodiment, an embodiment of the present invention further provides a video frame interpolation apparatus, as shown in fig. 7, the apparatus includes:
a reference frame determination module 700 for determining a current forward reference frame and a current backward reference frame based on a sequence of video frames to be processed.
A frame similarity determining module 702, configured to determine frame similarity between a forward reference frame and a backward reference frame.
And an interpolated frame number determining module 704, configured to determine the number of interpolated frames based on the frame similarity.
And an inserting frame module 706, configured to insert video frames corresponding to the number of inserted frames between the forward reference frame and the backward reference frame.
The video frame interpolation device firstly determines the current forward reference frame and the current backward reference frame based on a video frame sequence to be processed, determines the frame similarity between the forward reference frame and the backward reference frame, then determines the frame interpolation quantity based on the frame similarity, and further inserts video frames corresponding to the frame interpolation quantity between the forward reference frame and the backward reference frame. The method can reasonably determine the number of the inserted frames based on the frame similarity of the forward reference frame and the backward reference frame, improves the flexibility of frame insertion processing of the video, and improves the frame insertion effect, thereby improving the video watching experience of users.
Further, the frame similarity determining module is further configured to: extracting a first feature vector of a forward reference frame and a second feature vector of a backward reference frame through a pre-trained feature extraction network; calculating the feature similarity of the first feature vector and the second feature vector; and determining the feature similarity as the frame similarity of the forward reference frame and the backward reference frame.
Further, the frame similarity determining module is further configured to: if the forward reference frame is the first video frame in the video frame sequence, respectively extracting a first feature vector of the forward reference frame and a second feature vector of the backward reference frame by adopting a feature extraction network; if the forward reference frame is a video frame except the first video frame in the video frame sequence, taking the second feature vector of the last backward reference frame as the first feature vector of the current forward reference frame; and extracting a second feature vector of the current backward reference frame by adopting a feature extraction network.
The feature extraction network comprises a convolution feature extraction module, a multi-scale feature extraction module and a full connection layer which are connected in sequence; the convolution characteristic extraction module is used for performing convolution calculation and average pooling calculation on the input video frame and outputting an initial characteristic matrix; the multi-scale feature extraction module is used for extracting multi-scale features of the initial feature matrix through various preset convolution kernels to obtain a multi-scale feature matrix; and the full connection layer is used for carrying out comprehensive characteristic processing on the multi-scale characteristic matrix to obtain the characteristic vector of the input video frame.
Further, the frame insertion quantity determining module is further configured to: determining the frame interpolation times according to the frame similarity and a preset frame interpolation time range; and determining the number of the inserted frames according to the frame inserting times and the preset single frame inserting amount.
Specifically, the frame interpolation number determining module is further configured to calculate the frame interpolation times according to the following formula:
n=Round((B1-B2)·Similarity+B2)
where n is the number of frame insertions, Round () represents a rounding operation, B1Is a preset minimum number of frame insertions, B2The number of frame interpolation times is a preset maximum number of frame interpolation times; similarity is the frame Similarity.
Further, the frame interpolation module is further configured to: if the number of the inserted frames is one, inputting the forward reference frame and the backward reference frame into a preset prediction model, and outputting a first prediction frame; and inserting the first prediction frame between the forward reference frame and the backward reference frame to obtain the video frame sequence after frame insertion.
Further, the frame interpolation module is further configured to: if the number of the inserted frames is more than one, inputting the forward reference frame and the backward reference frame into a preset prediction model, and outputting a first prediction frame; inserting a first prediction frame between a forward reference frame and a backward reference frame to obtain a video frame sequence after frame insertion; the following steps are executed in a circulating mode until the number of video frames inserted between the forward reference frame and the backward reference frame reaches the number of inserted frames: determining the frame interpolation position of the next predicted frame according to the frame similarity between every two adjacent video frames between a forward reference frame and a backward reference frame in the video frame sequence after frame interpolation; inputting a previous video frame and a next video frame corresponding to the frame insertion position into a prediction model, and outputting a second prediction frame; the second predicted frame is inserted at the interpolated frame position.
The video frame interpolation apparatus provided in the embodiment of the present invention has the same implementation principle and technical effect as those of the video frame interpolation method embodiment, and for brief description, reference may be made to corresponding contents in the video frame interpolation method embodiment for the sake of brevity.
The embodiment of the present invention further provides a server, referring to fig. 8, the server includes a processor 130 and a memory 131, the memory 131 stores machine executable instructions capable of being executed by the processor 130, and the processor 130 executes the machine executable instructions to implement the video frame insertion method.
Further, the server shown in fig. 8 further includes a bus 132 and a communication interface 133, and the processor 130, the communication interface 133 and the memory 131 are connected through the bus 132.
The Memory 131 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 133 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 132 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.
The processor 130 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 130. The Processor 130 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 131, and the processor 130 reads the information in the memory 131 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the video frame interpolation method.
The video frame interpolation method, the video frame interpolation device, and the computer program product of the server provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.