Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a network architecture diagram of a video data processing method according to an embodiment of the present application. As shown in fig. 1, after acquiring the video to be processed, the second device 100 may determine an initial frame sequence of the video to be processed, and determine the brightness of each pixel point of each image frame in the initial frame sequence. Further, the second device 100 may perform frame extraction processing on the initial frame sequence based on the brightness of each pixel point of each image frame in the initial frame sequence, and use the frame sequence after frame extraction as the first frame sequence, that is, the second device 100 may reduce the frame rate of the video to be processed by performing frame extraction processing on the initial frame sequence.
After obtaining the first frame sequence, the second device 100 may encode the first frame sequence to obtain encoded data of the video to be processed, and then send the encoded data to the first device through the network connection. The encoded data sent by the second device 100 carries the reference frame number of each reference frame, where each reference frame is an image frame adjacent to the target frame sequence extracted in the frame extraction process.
Correspondingly, after acquiring the encoded data sent by the second device 100, the first device 200 may decode the encoded data to obtain a first frame sequence, and determine, based on each reference frame sequence number carried by the encoded data, each group of associated reference frames in the reference frame frames corresponding to the first frame sequence. Wherein each group of associated reference frames comprises two reference frames adjacent to the target frame sequence extracted in the frame extraction process.
For each group of associated reference frames, the first device 200 may determine, based on a first reference frame and a second reference frame in the group of associated reference frames, a second predicted frame corresponding to a first predicted frame and a second reference frame corresponding to the first reference frame, and an occlusion weight and a reconstruction residual in a frame prediction process, and determine, based on the first predicted frame, the second predicted frame, the occlusion weight, and the reconstruction residual, a target predicted frame corresponding to the group of associated reference frames;
further, the second device 100 may perform frame interpolation on the target predicted frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain the played video based on the second frame sequence.
The second device 100 may be a video capture device, such as a camera device, a video generation device, or a video processing device, and may be determined based on the actual application scene requirement, which is not limited herein.
The second device 100 may be a video forwarding device, a video playing device, or the like, and may be determined based on the actual application scene requirement, which is not limited herein.
Referring to fig. 2, fig. 2 is a schematic flow chart of a video data processing method according to an embodiment of the present application. When the video data processing method provided by the embodiment of the present application is applied to the second device, the method may specifically include the following steps:
step S21, determining an initial frame sequence of the video to be processed, and determining the brightness of each pixel point of each image frame in the initial frame sequence.
In some feasible embodiments, the video to be processed may be a video acquired by the second device in real time, may also be a video generated by the second device based on video processing software, may also be a video acquired by the second device from a network, a local storage space, a cloud storage space, or the like, and may specifically be determined based on requirements of an actual application scene, which is not limited herein.
In some possible embodiments, for the video to be processed, an initial frame sequence of the video to be processed may be determined, and the brightness of each pixel point of each image frame in the initial frame sequence may be determined. Specifically, for any image frame in the initial frame sequence, the pixel value of each color channel of each pixel point of the image frame may be determined, and for each pixel point of the image frame, the brightness of the pixel point may be determined based on the pixel value of each color channel of the pixel point.
Step S22, based on the brightness of each pixel point of each image frame in the initial frame sequence, performs frame extraction processing on the initial frame sequence, and takes the frame sequence after frame extraction as the first frame sequence.
In some possible embodiments, based on the brightness of each pixel point of each image frame in the initial frame sequence, the initial frame sequence is subjected to frame extraction, and the frame sequence after frame extraction is taken as the first frame sequence. By performing frame extraction processing on the initial frame sequence, the initial frame sequence with a high frame rate of the video to be processed can be converted into the first frame sequence with a low frame rate.
Specifically, for any image frame in the initial frame sequence, the luminance difference between each pixel point of the image frame and the corresponding pixel point of the previous image frame may be determined, and further, the total luminance difference between the image frame and the previous image frame may be determined based on the luminance difference corresponding to each pixel point of the image frame.
For any image frame in the initial frame sequence, the brightness difference between one pixel point of the image frame and the corresponding pixel point of the previous image frame is the absolute value of the brightness difference between the two image frames.
For example, for any image frame in the initial frame sequence, the total brightness difference Δ between the image frame and the previous image framelightComprises the following steps:
Δlight=∑|Ii,j(t)-Ii,j(t-1)|
wherein I and j are used to represent the position of the pixel point, Ii,j(t) luminance, I, of a pixel point (I, j) of the t-th frame image framei,j(t-1) represents the luminance, | I, of a pixel point (I, j) of the t-1 th frame image framei,j(t)-Ii,j(t-1) | represents the absolute value of the luminance difference between the t-th frame image frame and the t-1 th frame image frame at the pixel point (i, j).
Further, if the total brightness difference between any image frame and the previous image frame in the initial frame sequence of the video to be processed is greater than the first threshold, it indicates that the scene change of the image frame is greater than that of the previous image frame, and therefore, the image frame of which the total brightness difference is greater than the first threshold in the initial frame sequence of the video to be processed may be determined as the active frame. If the total brightness difference between any image frame and the previous image frame in the initial frame sequence of the video to be processed is smaller than or equal to the first threshold, it indicates that the scene change of the image frame is smaller than that of the previous image frame, and therefore, the image frame of which the total brightness difference is smaller than or equal to the first threshold in the initial frame sequence of the video to be processed may be determined as a still frame.
As an example, for any image frame of an initial sequence of frames, the image frame may be marked to distinguish the image frame as either an active frame or a still frame.
As shown in the above formula, K (t) represents the mark of the t-th frame, if the total brightness difference Δ of the t-th frame image frame is compared with the previous frame image framelightAbove the first threshold T1, the image frame of the T-th frame is marked as 1 to indicate that the image frame of the T-th frame is an active frame. If the total brightness difference delta of the t-th frame image frame is compared with the previous frame image framelightNot greater than the first threshold T1, the image frame of the T-th frame is marked as 0 to indicate that the image frame of the T-th frame is a still frame.
Further, after determining the active frames and the static frames in the initial frame sequence, the initial frame sequence may be decimated based on the active frames and the static frames in the initial frame sequence, so as to obtain the first frame sequence. Specifically, a continuous active frame sequence and a continuous static frame sequence in the initial frame sequence can be determined, and the continuous active frame sequence and the continuous static frame sequence in the initial frame sequence are subjected to frame extraction processing.
As an example, any other frame image frame and/or any number of consecutive frame images of the consecutive active frames except the first frame image frame and the last frame image frame may be used as the target frame sequence, and the target frame sequence may be extracted from the consecutive active frames. Similarly, any other frame image frame and/or any number of consecutive frame images in the consecutive still frames except the first frame image frame and the last frame image frame may be used as the target frame sequence, and the target frame sequence may be extracted from the consecutive still frames. After the target frame sequence is extracted from the continuous active frame sequence and the continuous static frame sequence based on the above manner, the initial frame sequence after the target frame sequence is extracted may be used as the first frame sequence.
In some possible embodiments, to perform frame extraction on videos in the same scene in a centralized manner, an initial frame rate corresponding to a video to be processed may be determined, the initial frame rate of the video to be processed is divided into at least one subframe sequence based on the initial frame rate of the video to be processed, a continuous active frame sequence and a continuous static frame sequence in each subframe sequence are further determined, and a target frame sequence is extracted from each continuous active frame sequence and continuous static frame sequence.
The target frame sequence extracted from the continuous active frame sequence and the static frame sequence in each sub-frame sequence is also any one frame image frame or any continuous several frames image frames except the first frame and the last frame in the corresponding continuous active frame sequence or static frame sequence.
For example, the duration of the video to be processed is 10s, and the initial frame rate is 24 Hz. The initial frame sequence may be divided into 10 subframe sequences of duration 1s, each subframe sequence comprising 24 frame images, and each subframe sequence may be decimated.
In some possible embodiments, for each continuous active frame in each subframe sequence, if the number of active frames in the continuous active frame sequence is greater than the second threshold, the continuous active frame sequence is subjected to a frame extraction process. Namely, any other frame image frame except the first frame image frame and the last frame image frame and/or any number of continuous image frames in the continuous moving frames are used as a target frame sequence, and then the target frame sequence is extracted from the continuous moving frames. And for each continuous static frame in each subframe sequence, if the number of the static frames in the continuous static frame sequence is greater than a third threshold value, performing frame extraction processing on the continuous static frame sequence. Namely, any other frame image frame except the first frame image frame and the last frame image frame and/or any number of continuous image frames in the continuous still frames are used as the target frame sequence, and then the target frame sequence is extracted from the continuous still frames. For any subframe sequence, if the number of active frames in consecutive active frames of the subframe sequence is less than or equal to the second threshold and the number of static frames in consecutive static frames is less than or equal to the third threshold, no extraction is performed on the subframe sequence. The specific frame extraction method can be as follows:
where P denotes an extracted target frame sequence, N (k) (T) 1 denotes the number of active frames in a continuous active frame sequence, N (k) (T) 0 denotes the number of still frames in a continuous still frame sequence, T2 and T3 denote a second threshold value and a third threshold value, respectively, and I denotes{2,3,4,…,last-1}(k (t) ═ 1) denotes any intermediate frame in a sequence of consecutive active frames, I{2,3,4,…,last-1}(k (t) ═ 0) denotes any intermediate frame in the continuous sequence of still frames.
For example, a certain subframe sequence is shown in table 1:
TABLE 1
Marking
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
Frame number
|
1
|
2
|
3-20
|
21
|
22
|
23
|
24
|
25
|
26-35
|
36-65
|
66
|
67
|
Whether to extract frames
|
Whether or not
|
Whether or not
|
Is that
|
Whether or not
|
Whether or not
|
Whether or not
|
Whether or not
|
Whether or not
|
Is that
|
Is that
|
Whether or not
|
Whether or not |
In table 1, frame numbers 3-20 correspond to consecutive active frames, frame numbers 26-35 correspond to consecutive still frames, and frame numbers 36-65 correspond to consecutive active frames. If the second threshold and the third threshold are 4, any one frame image frame or continuous multi-frame image frames except the first frame and the last frame in the continuous active frames corresponding to the frame numbers 3-20, the continuous static frames corresponding to the frame numbers 26-35 and the continuous active frames corresponding to the frame numbers 36-65 can be extracted, so that the subframe sequence after frame extraction is determined as the first frame sequence.
And step S23, coding the first frame sequence to obtain coded data of the video to be processed, wherein the coded data carries the reference frame serial number of each reference frame.
In some possible embodiments, after the first frame sequence is obtained by performing the frame decimation processing on the initial frame sequence, the first frame sequence may be encoded to obtain encoded data of the video to be processed.
Specifically, the encoding modes adopted in encoding the first frame sequence include, but are not limited to, h.264, h.265, AVS2, AV1, and the like, and may be determined based on actual application scenario requirements, and are not limited herein.
The coded data of the video to be processed carries the reference frame serial number of each reference frame, and each reference frame is an image frame adjacent to the target frame sequence extracted in the frame extraction process. That is, after the target frame sequence is extracted from the initial frame sequence to obtain the first frame sequence, two image frames adjacent to the extracted target frame sequence in the first frame sequence may be determined as reference frames, and the frame number of each reference frame may be determined.
As shown in table 1, in the case where the target frame sequence extracted from the active frame sequence is the image frame of frame number 4 to frame number 5 in the active frame sequence of frame number 3 to frame number 20, the image frames adjacent to the target frame sequence are the image frame of frame number 3 and the image frame of frame number 6, and are determined as two reference frames.
When the first frame sequence is encoded, the reference frame sequence number of each reference frame can be encoded together with the first frame sequence, so that the encoded data of the video to be processed carries each reference frame sequence number. Or after the first frame sequence is encoded to obtain the encoded data of the video to be processed, the reference frame sequence numbers and the encoded data are further processed, so that the encoded data of the video to be processed carries the reference frame sequence numbers.
Step S24, sending the encoded data to the first device, so that the first device determines a second frame sequence based on the encoded data and the reference frames corresponding to the reference frame numbers, and determines to play the video based on the second frame sequence.
In some possible embodiments, after obtaining the encoded data carrying the reference frame sequence numbers of the reference frames, the encoded data may be sent to the first device 200, so that the first device 200 may determine the second frame sequence based on the encoded data and the reference frames corresponding to the reference frame sequence numbers, and determine to play the video based on the second frame sequence.
Specifically, the specific manner of sending the encoded data to the first device 200 by the second device 100 includes, but is not limited to, Content Delivery Network (CDN) transmission technology, Peer-to-Peer (P2P) Network transmission technology, and PCDN transmission technology combining CDN and P2P
The second device 100 transmits the encoded data to the first device 200, and also transmits the frame numbers of the reference frames to the first device 200.
In the embodiment of the application, by performing frame extraction processing on the initial frame sequence of the video to be processed, the initial frame sequence with the high frame rate can be converted into the first frame sequence with the low frame rate, so that the size of video data can be greatly reduced by encoding the first frame sequence, and further, data traffic consumed by data transmission is correspondingly reduced and encoded, thereby achieving the effect of saving bandwidth cost.
In some possible embodiments, after the second device 100 sends the encoded data carrying the reference frame numbers to the first device 200, a specific processing manner of the encoded data by the first device 200 may be as shown in fig. 3, where fig. 3 is another schematic flow diagram of the video data processing method provided by the embodiment of the present application. When the video data processing method provided in the embodiment of the present application is applied to the first device 200, the method may specifically include the following steps:
step S31, obtaining the encoded data sent by the second device 100, and decoding the encoded data to obtain a first frame sequence.
In some possible embodiments, after acquiring the encoded data sent by the second device 100, the first device 200 may decode the encoded data based on the encoding technique adopted by the second device 100 to obtain the first frame sequence. The first frame sequence is a frame sequence obtained by performing frame extraction on an initial frame sequence of the video to be processed by the second device 100, that is, a residual frame sequence after a target frame sequence is extracted from the initial frame sequence.
Step S32, determining each group of associated reference frames in the reference frames corresponding to the first frame sequence based on each reference frame sequence number carried by the encoded data.
In some possible embodiments, the sequence number of each image frame in the first frame sequence obtained by the first device 200 after decoding the encoded data is the frame sequence number in the initial frame sequence corresponding to the video to be processed. Based on this, the first device 200 may determine the reference frames in the first frame sequence based on the reference frame sequence numbers after acquiring the reference frame sequence numbers.
Further, the first device 200 may determine groups of associated reference frames from the reference frames in the first frame sequence, where each group of associated reference frames includes two reference frames adjacent to the target frame sequence extracted by the second device 100 during the frame extraction process performed on the initial frame sequence. The first device 200 may further determine a target prediction frame corresponding to each group of associated reference frames, and perform frame interpolation on the target prediction frame to obtain a second frame sequence. The specific implementation manner of determining the target prediction frame corresponding to each group of associated reference frames and performing interpolation processing on the target prediction frame to obtain the second frame sequence by the first device 200 is described below, and will not be described herein.
Step S33, for each group of associated reference frames, determining a second predicted frame corresponding to a first predicted frame and a second predicted frame corresponding to the first reference frame and a second reference frame in the group of associated reference frames, and an occlusion weight and a reconstruction residual in the frame prediction process, based on the first predicted frame, the second predicted frame, the occlusion weight, and the reconstruction residual, and determining a target predicted frame corresponding to the group of associated reference frames.
In some possible embodiments, for each set of associated reference frames, a first predicted frame corresponding to the first reference frame and a second predicted frame corresponding to the second reference frame may be determined based on the first reference frame and the second reference frame in the set of associated reference frames.
For each group of associated reference frames, the first reference frame in the group of associated reference frames is a reference frame with a smaller reference frame number, and the second reference frame is a reference frame with a larger reference frame number. The first predicted frame and the second predicted frame are both image frames between the first reference frame and the second reference frame.
Specifically, for each set of associated reference frames, a first optical flow field corresponding to the first reference frame and a second optical flow field corresponding to the second reference frame may be determined based on the first reference frame and the second reference frame in the set of associated reference frames.
The optical flow field is a two-dimensional instantaneous velocity field formed by all pixel points in an image, and the two-dimensional instantaneous velocity field comprises the change of pixels in a time domain and the correlation between adjacent frames to find the corresponding relation between the previous frame and the current frame.
For each group of associated reference frames, when the optical flow fields corresponding to the first reference frame and the second reference frame are determined, feature extraction can be performed on the first reference frame to obtain a first initial feature, feature extraction can be performed on the second reference frame to obtain a second initial feature, and then the associated features corresponding to the first reference frame and the second reference frame are obtained based on the first initial feature and the second initial feature.
When feature extraction is performed on the first reference frame and the second reference frame, feature extraction can be performed on each reference frame based on a neural network and the like to obtain corresponding initial features. After the first initial feature and the second initial feature are obtained, the associated features of the first initial feature and the second initial feature may be obtained based on feature splicing, feature fusion, or further processing based on other neural network models, and a specific implementation manner may be determined based on requirements of an actual application scenario, which is not limited herein.
For the first reference frame, a first context feature of the first reference frame may be determined, and based on the first context feature and the association feature, a first optical flow field corresponding to the first reference frame may be determined. For the second reference frame, a second context feature of the second reference frame may be determined, and based on the second context feature and the above-mentioned associated feature, a second optical flow field corresponding to the second reference frame may be determined.
Wherein the context feature of each reference frame may be implemented based on a context feature extraction network. Referring to fig. 4, fig. 4 is a schematic diagram of a scenario for determining a contextual characteristic according to an embodiment of the present application. The context feature extraction network shown in fig. 4 includes a plurality of convolutional layer and activation function combinations connected in series. For any one of the first reference frame and the second reference frame, the reference frame may be convolved based on the first convolution layer to obtain a first convolution feature, and the first convolution feature is processed through the first activation function to obtain a feature map of the reference frame. And further, continuously performing convolution processing on the feature map obtained by the last activation function based on the second convolution layer to obtain a second convolution feature, and processing the second convolution feature through the second activation function to obtain a second feature map of the reference frame. By analogy, the feature map obtained by each activation function in fig. 4 can be determined as the context feature of the reference frame.
The number of convolutional layers and activation functions in the feature extraction network shown in fig. 4 may be specifically determined based on the requirements of the actual application scenario, and is not limited herein.
When determining the Optical Flow fields corresponding to the first reference frame and the second reference frame of each group of associated reference frames, the determination can be performed based on an Optical Flow Field estimation model (RAFT).
As an example, referring to fig. 5, fig. 5 is a light provided by the embodiment of the present applicationAnd (3) a structural schematic diagram of the flow field estimation model. As shown in FIG. 5, for any set of associated reference frames, the first reference frame I in the set of associated reference framesaAnd a second reference frame IdInputting a feature coding module to respectively code the first reference frame I based on the featureaAnd a second reference frame IdAnd performing feature extraction to obtain a first initial feature and a second initial feature, and further performing feature association based on the first initial feature and the second initial feature to obtain an associated feature. Wherein, I represents the reference frame, and a and d are respectively the position information (such as frame number, time domain position, etc.) of the reference frame.
For the first reference frame I
aThe first reference frame I may be determined based on a context feature extraction network
aFirst context feature C of
a. Wherein the first context feature C
aCan be represented as C
a:
And
respectively, a feature map based on a convolutional layer and an activation function. A first reference frame I
aFirst context feature C of
aAnd inputting the associated features into a recurrent neural network to obtain a first reference frame I
aFirst optical flow field F
b→aAnd b is the position information of the target predicted frame corresponding to the group of associated reference frames, wherein a is larger than b and larger than d.
For the same reason, for the second reference frame I
dThe second reference frame I may be determined by the context feature extraction network
dSecond context feature C
d. Wherein the second context feature C
dCan be represented as C
d:
And
respectively, a feature map based on a convolutional layer and an activation function. Second reference frame I
dSecond context feature C
dAnd inputting the associated features into a recurrent neural network to obtain a second reference frame I
dSecond optical flow field F
b→dAnd b is the position information of the target predicted frame corresponding to the group of associated reference frames, wherein a is larger than b and larger than d.
Further, the first reference frame is subjected to backward mapping based on the first optical flow field to obtain a first prediction frame corresponding to the first reference frame, and the second reference frame is subjected to backward mapping based on the second optical flow field to obtain a second prediction frame corresponding to the second reference frame. E.g. based on the first optical flow field F
b→aFor the first reference frame I
aBackward mapping is carried out to obtain a first predicted frame
Based on the second optical flow field F
b→dFor the second reference frame I
dPerforming backward mapping to obtain a second predicted frame
In some possible embodiments, for each set of associated reference frames, occlusion weights and reconstruction residuals in the frame prediction process are determined based on the first reference frame and the second reference frame in the set of associated reference frames. The reconstruction residual error is used for reducing the gradient descent problem in the frame prediction process, and the shielding weight is used for reducing the influence caused by the shaking of a moving object, the edge blurring and the like in the frame prediction process.
Specifically, a third contextual characteristic of the first reference frame and a fourth contextual characteristic of the second reference frame may be determined first. The third contextual feature of the first reference frame and the fourth contextual feature of the second reference frame may be determined based on the manner shown in fig. 4, or may be determined based on other contextual feature extraction networks, and may specifically be determined based on requirements of an actual application scenario, which is not limited herein.
Further, the occlusion weight and the reconstruction residual in the frame prediction process may be determined based on the first optical flow field, the second optical flow field, the first predicted frame, the second predicted frame, the third context feature, and the fourth context feature. If the first optical flow field, the second optical flow field, the first prediction frame, the second prediction frame, the third context feature and the fourth context feature are input into the deep neural network, the occlusion weight and the reconstruction residual error in the frame prediction process are obtained. The deep neural network includes, but is not limited to, fusion Net and U-Net, and may be determined based on the requirements of the actual application scenario, which is not limited herein.
In some possible embodiments, when determining the occlusion weight and the reconstructed residual in the frame prediction process based on the first optical flow field, the second optical flow field, the first predicted frame, the second predicted frame, the third context feature, and the fourth context feature, the residual feature may be determined based on the first optical flow field, the second optical flow field, the first predicted frame, and the second predicted frame.
As an example, referring to fig. 6, fig. 6 is a schematic view of a scenario for determining residual features according to an embodiment of the present application. As shown in fig. 6, the first optical flow field, the second optical flow field, the first prediction frame, and the second prediction frame are input to the convolution layer, processed by the convolutional neural network and the activation function, and then input to the residual block to obtain the residual features.
Further, a fused feature may be determined based on the third context feature, the fourth context feature, and the residual feature. Referring to fig. 7, fig. 7 is a schematic view of a scenario for determining a fusion feature according to an embodiment of the present application. As shown in fig. 7, the residual feature is compared with the first context feature in the third context and the fourth context
And
and splicing to obtain a first splicing characteristic, inputting the first splicing characteristic into the convolution layer, and performing downsampling convolution processing to obtain a first convolution characteristic. Associating the first convolution feature with the second context feature in the third context and the fourth context
And
and splicing to obtain a second splicing characteristic, and inputting the second splicing characteristic into the convolution layer to perform downsampling convolution processing to obtain a second convolution characteristic. The second convolution signature is associated with a third context signature in a third context and a fourth context
And
and splicing to obtain a third splicing characteristic, inputting the third splicing characteristic into the convolutional layer, and performing downsampling convolution processing to obtain a third convolution characteristic. Comparing the third convolution characteristic with a fourth context characteristic in the third context and the fourth context
And
and splicing to obtain a fourth splicing characteristic, and inputting the fourth splicing characteristic into the convolution layer for downsampling convolution processing to obtain a fourth convolution characteristic. Comparing the fourth convolution characteristic with the third context and a fifth context characteristic in the fourth context
And
splicing to obtain a fifth splicing characteristic, and outputting the fifth splicing characteristicAnd (5) putting the convolution layer into the convolution layer for up-sampling convolution processing to obtain a fifth convolution characteristic.
And further, splicing the fifth convolution characteristic and the third convolution characteristic to obtain a sixth splicing characteristic, and inputting the sixth splicing characteristic into the convolution layer to perform upsampling processing to obtain a sixth convolution characteristic. And splicing the sixth convolution characteristic and the second convolution characteristic to obtain a seventh splicing characteristic, and inputting the seventh splicing characteristic into the convolution layer for up-sampling processing to obtain a seventh convolution characteristic. And splicing the seventh convolution characteristic and the first convolution characteristic to obtain an eighth splicing characteristic, and inputting the eighth splicing characteristic into the convolution layer to perform upsampling processing to obtain a ninth convolution characteristic. And splicing the ninth convolution characteristic and the residual error characteristic to obtain a fusion characteristic.
In the method of determining the fusion feature shown in fig. 7, the number of convolution layers for performing the upsampling process is the same as that of convolution layers for performing the downsampling process, and the specific number is the same as that of the feature map in the third context feature or the fourth context feature.
In some feasible embodiments, when determining the fusion feature based on the third context feature, the fourth context feature and the residual feature, each feature map in the third context feature may be backward mapped based on the first optical flow field to obtain a fifth context feature, and each feature map in the fourth context feature may be backward mapped based on the second optical flow field to obtain a sixth context feature. Furthermore, the fusion feature is determined based on the fifth context feature, the sixth context feature and the residual feature, and the specific determination manner is the same as the implementation manner shown in fig. 7, and will not be described here.
As an example, referring to fig. 8, fig. 8 is a schematic diagram of another scenario for determining a contextual characteristic provided in an embodiment of the present application. As shown in fig. 8, after determining each feature map of the first reference frame based on the combination of each convolution layer and the activation parameter to obtain the third context feature, each feature map in the third context feature may be respectively mapped backward based on the first optical flow field to obtain a mapping feature map corresponding to each feature map, and then each mapping feature map may be determined as the fifth context feature.
Optionally, since the sizes of the feature maps in the third context feature of the first reference frame and the fourth context feature of the second reference frame are different, for each feature map in the third context feature, the optical flow field weight corresponding to the feature map may be determined, so as to determine, based on the optical flow field weight and the first optical flow field, a new optical flow field corresponding to when the feature map is subjected to backward mapping. And then, for each feature map in the third context feature, the feature map can be mapped backwards based on the new optical flow field corresponding to the feature map, so as to obtain a mapping feature map corresponding to the feature map. And determining a fifth context feature based on the mapping feature map corresponding to each feature map in the third context feature.
As an example, referring to fig. 9, fig. 9 is a schematic diagram of another scenario for determining a contextual feature provided in an embodiment of the present application. As shown in fig. 9, after the third context feature of the first reference frame is obtained, the optical flow field weights, such as 1, 0.5, 0.25, 0.125, 0.0625, and the like, corresponding to each feature map in the third context feature may be determined, and then new optical flow fields, such as optical flow field 1, optical flow field 2, optical flow field 3, optical flow field 4, and optical flow field 5, corresponding to each feature map may be determined based on the first optical flow field and the optical flow field weights. And then mapping each feature map backwards based on the new optical flow field corresponding to each feature map to obtain a mapping feature map corresponding to each feature map, and determining each mapping feature map as a fifth context feature.
Similarly, for each feature map in the fourth context feature, the optical flow field weight corresponding to the feature map may be determined, so as to determine a new optical flow field corresponding to backward mapping of the feature map based on the optical flow field weight and the second optical flow field. And then, for each feature map in the fourth context feature, the feature map may be mapped backwards based on the new optical flow field corresponding to the feature map, so as to obtain a mapping feature map corresponding to the feature map. And determining a sixth context feature based on the mapping feature map corresponding to each feature map in the fourth context feature.
The optical flow field weights corresponding to the feature maps in the third context feature and the fourth context feature may be specifically determined based on an actual application scenario, which is not limited herein.
In some possible embodiments, after determining the fused feature based on the reconstructed residual, the fused feature may be further processed to obtain the target feature. Specifically, as shown in fig. 10, fig. 10 is a scene schematic diagram for determining a reconstruction residual and an occlusion weight according to an embodiment of the present application. The fused features may be input to a convolution layer for further processing of the fused features and sub-pixel convolution of the processed results to obtain high resolution target features. And then determining the occlusion weight and the reconstruction residual error in the frame prediction process based on the target characteristics.
Specifically, when determining the occlusion weight and the reconstruction residual in the frame prediction process based on the target feature, the number of channels corresponding to the target feature and the feature value corresponding to each channel may be determined. And determining the characteristic value of the last channel as the shielding weight in the frame prediction process, and determining the reconstruction residual error in the frame prediction process based on the characteristic values of other channels. And e.g. splicing the corresponding characteristic values of other channels except the last channel to obtain a reconstructed residual error in the frame prediction process.
The occlusion weight and the reconstructed residual error in the frame prediction process are determined based on the first optical flow field, the second optical flow field, the first predicted frame, the second predicted frame, the third context feature and the fourth context feature, which are further described below with reference to fig. 11. FIG. 11 is a schematic diagram of another scene for determining occlusion weights and reconstructing residuals according to an embodiment of the present application. That is, the residual features are determined based on the first optical flow field, the second optical flow field, the first predicted frame, and the second predicted frame in the manner shown in fig. 6, and the fusion features are determined based on the residual features, the third context features, and each feature map in the fourth context features in the manner shown in fig. 7. In the manner shown in fig. 10, the reconstructed residual and the occlusion weight in the frame prediction process are determined based on the residual features.
In some possible embodiments, after determining the occlusion weight and the reconstructed residual in the frame prediction process, the target prediction frame corresponding to the set of associated reference frames may be determined based on the first prediction frame, the second prediction frame, the occlusion weight and the reconstructed residual. The specific determination method can be as follows:
wherein,
which represents the first predicted frame, is,
representing the second predicted frame, M representing the occlusion weight, a representing the reconstructed residual,
indicating a target predicted frame, an indicates a dot product operation.
And step S34, performing frame interpolation processing on the target predicted frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtaining a played video based on the second frame sequence.
In some possible embodiments, for each set of associated reference frames, a target prediction frame between a first reference frame and a second reference frame of the set of associated reference frames is predicted based on a target prediction frame corresponding to the associated reference frame. Based on the above, the target predicted frames corresponding to each group of associated reference frames can be subjected to frame insertion processing, and the target predicted frame corresponding to each associated reference frame is inserted between the first reference frame and the second reference frame of the associated reference frame, so as to obtain the second frame sequence based on the first frame sequence.
Further, after obtaining the second frame sequence, the first device may determine to play the video based on the second frame sequence, that is, the second frame sequence is a frame sequence corresponding to the video played by the first device.
A scene diagram of determining a target prediction frame according to the embodiment of the present application is provided below with reference to fig. 12. Fig. 12 is a schematic diagram of a scene of determining a target prediction frame according to an embodiment of the present application. As shown in fig. 12, a first optical flow field corresponding to a first reference frame and a second optical flow field corresponding to a second reference frame are determined by using a RAFT model, the first reference frame is backward mapped based on the first optical flow field to obtain a first predicted frame, and the second reference frame is backward mapped based on the second optical flow field to obtain a second predicted frame.
And respectively determining a third context feature corresponding to the first reference frame and a fourth context feature corresponding to the second reference frame through a context feature extraction network (ContextNet), carrying out backward mapping on each feature graph in the third context feature based on the first optical flow field to obtain a fifth context feature, and carrying out backward mapping on each feature graph in the fourth context feature based on the second optical flow field to obtain a sixth context feature.
And inputting the fifth context feature, the sixth context feature, the first optical flow field, the second optical flow field, the first predicted frame and the second predicted frame into the U-NET network to obtain a reconstructed residual error and an occlusion weight in the frame prediction process, and determining the target predicted frame based on the reconstructed residual error, the occlusion weight, the first predicted frame and the second predicted frame.
In the embodiment of the application, for each group of associated reference frames in a first frame sequence obtained by decoding, by determining a first predicted frame and a second predicted frame corresponding to a first reference frame and a second reference frame in each group of associated reference frames, a first optical flow field and a second optical flow field, and an occlusion weight and a reconstruction residual in a frame prediction process, occlusion information in the frame prediction process, detail information of each image frame, and optical flow field information can be fully considered, and problems of object jitter, edge blur and the like in the frame prediction process can be effectively solved, so that video definition is improved, and video viewing experience is improved.
Referring to fig. 13, fig. 13 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application. The video data processing apparatus provided by the embodiment of the present application includes:
a brightness determining module 41, configured to determine an initial frame sequence of a video to be processed, and determine brightness of each pixel point of each image frame in the initial frame sequence;
a frame sequence determining module 42, configured to perform frame extraction processing on the initial frame sequence based on brightness of each pixel point of each image frame in the initial frame sequence, and use the frame sequence after frame extraction as a first frame sequence;
an encoding module 43, configured to encode the first frame sequence to obtain encoded data of the video to be processed, where the encoded data carries a reference frame number of each reference frame, and each reference frame is an image frame adjacent to a target frame sequence extracted in the frame extraction process;
a sending module 44, configured to send the encoded data to a first device, so that the first device determines a second frame sequence based on the encoded data and the reference frames corresponding to the reference frame numbers, and determines to play a video based on the second frame sequence.
In some possible embodiments, the frame sequence determining module 42 is configured to:
for any image frame in the initial frame sequence, determining the brightness difference between each pixel point of the image frame and the corresponding pixel point of the previous image frame, and determining the total brightness difference between the image frame and the previous image frame based on the brightness difference corresponding to each pixel point of the image frame;
determining image frames with the corresponding total brightness difference larger than a first threshold value in the initial frame sequence as active frames, and determining image frames with the corresponding total brightness difference smaller than or equal to the first threshold value as static frames;
and performing frame extraction processing on the initial frame sequence based on the active frame and the static frame.
In some possible embodiments, the frame sequence determining module 42 is configured to:
determining a continuous active frame sequence and a continuous static frame sequence in the initial frame sequence;
and performing frame extraction processing on the continuous active frame sequence and the continuous static frame sequence in the initial frame sequence.
In some possible embodiments, the frame sequence determining module 42 is configured to:
determining an initial frame rate corresponding to the video to be processed, and dividing the initial frame sequence into at least one subframe sequence based on the initial frame rate;
determining a continuous active frame sequence and a continuous static frame sequence in each subframe sequence;
for each continuous active frame sequence, if the number of active frames in the continuous active frame sequence is greater than a second threshold, performing frame extraction processing on the continuous active frame sequence;
for each of the consecutive still frame sequences, if the number of still frames in the consecutive still frame sequence is greater than a third threshold, performing frame decimation on the consecutive still frame sequence.
In a specific implementation, the video data processing apparatus may execute the implementation manners provided in the steps in fig. 1 through the built-in functional modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.
Referring to fig. 14, fig. 14 is a schematic structural diagram of another video data processing apparatus according to an embodiment of the present application. The video data processing apparatus provided by the embodiment of the present application includes:
a decoding module 51, configured to acquire encoded data sent by a second device, and decode the encoded data to obtain a first frame sequence, where the first frame sequence is a frame sequence obtained by performing frame extraction processing on an initial frame sequence of a video to be processed by the second device;
a reference frame determining module 52, configured to determine, based on the reference frame sequence numbers carried in the encoded data, groups of associated reference frames in the reference frames corresponding to the first frame sequence, where each group of associated reference frames includes two reference frames adjacent to a target frame sequence extracted in the frame extraction process;
a frame prediction module 53, configured to determine, for each group of associated reference frames, a first predicted frame corresponding to the first reference frame and a second predicted frame corresponding to the second reference frame, and an occlusion weight and a reconstructed residual in a frame prediction process based on a first reference frame and a second reference frame in the group of associated reference frames, and determine a target predicted frame corresponding to the group of associated reference frames based on the first predicted frame, the second predicted frame, the occlusion weight, and the reconstructed residual;
and a video determining module 54, configured to perform frame interpolation on the target prediction frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain a played video based on the second frame sequence.
In some possible embodiments, for each group of the associated reference frames, the frame prediction module 53 is configured to:
determining a first optical flow field corresponding to the first reference frame and a second optical flow field corresponding to the second reference frame based on a first reference frame and a second reference frame in the group of associated reference frames;
and performing backward mapping on the first reference frame based on the first optical flow field to obtain a first prediction frame corresponding to the first reference frame, and performing backward mapping on the second reference frame based on the second optical flow field to obtain a second prediction frame corresponding to the second reference frame.
In some possible embodiments, for each group of the associated reference frames, the frame prediction module 53 is configured to:
performing feature extraction on a first reference frame in the group of associated reference frames to obtain first initial features, performing feature extraction on a second reference frame in the group of associated reference frames to obtain second initial features, and determining associated features corresponding to the first reference frame and the second reference frame based on the first initial features and the second initial features;
determining a first context feature of the first reference frame, and determining a first optical flow field corresponding to the first reference frame based on the first context feature and the association feature;
and determining a second context feature of the second reference frame, and determining a second optical flow field corresponding to the second reference frame based on the second context feature and the association feature.
In some possible embodiments, for each group of the associated reference frames, the frame prediction module 53 is configured to:
determining a third contextual characteristic of the first reference frame and a fourth contextual characteristic of the second reference frame;
and determining the occlusion weight and the reconstruction residual in the frame prediction process based on the first optical flow field, the second optical flow field, the first prediction frame, the second prediction frame, the third context feature and the fourth context feature.
In some possible embodiments, the frame prediction module 53 is configured to:
determining residual features based on the first optical flow field, the second optical flow field, the first predicted frame, and the second predicted frame;
determining a fusion feature based on the third context feature, the fourth context feature, and the residual feature;
and determining the occlusion weight and the reconstruction residual error in the frame prediction process based on the fusion characteristics.
In some possible embodiments, the third contextual feature and the fourth contextual feature comprise a plurality of feature maps; the frame prediction module 53 is configured to:
determining an optical flow field weight corresponding to each feature map in the third context features, and mapping the feature map backwards based on the optical flow field weight corresponding to the feature map and the first optical flow field to obtain a mapping feature map corresponding to the feature map;
determining a mapping feature map corresponding to each feature map in the third context features as a fifth context feature of the first reference frame;
determining an optical flow field weight corresponding to each feature map in the fourth context feature, and performing backward mapping on the feature map based on the optical flow field weight corresponding to the feature map and the second optical flow field to obtain a mapping feature map corresponding to the feature map;
determining a mapping feature map corresponding to each feature map in the fourth context features as a sixth context feature of the second reference frame;
determining a fusion feature based on the fifth context feature, the sixth context feature, and the residual feature.
In some possible embodiments, the frame prediction module 53 is configured to:
performing feature processing on the fusion features to obtain target features, and determining the number of channels corresponding to the target features;
determining the characteristic value of the target characteristic corresponding to the last channel as an occlusion weight;
and determining a reconstruction residual error based on the characteristic values of the target characteristic corresponding to other channels.
In a specific implementation, the video data processing apparatus may execute the implementation manners provided in the steps in fig. 3 through the built-in functional modules, which may specifically refer to the implementation manners provided in the steps, and are not described herein again.
Referring to fig. 15, fig. 15 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. As shown in fig. 15, the electronic device 1000 in the present embodiment may include: the processor 1001, the network interface 1004, and the memory 1005, and the electronic device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 15, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the electronic device 1000 shown in fig. 15, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement the video data processing method performed by the first device and/or the second device.
It should be understood that in some possible embodiments, the processor 1001 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In a specific implementation, the electronic device 1000 may execute, through each built-in functional module thereof, the implementation manners provided in each step in fig. 2 and/or fig. 3, which may be referred to specifically for the implementation manners provided in each step, and are not described herein again.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and is executed by a processor to implement the method provided in each step in fig. 2 and/or fig. 3, which may specifically refer to the implementation manner provided in each step, and is not described herein again.
The computer-readable storage medium may be the video data processing apparatus and/or an internal storage unit of an electronic device provided in any of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), and the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided by the steps of fig. 2 and/or fig. 3.
The terms "first", "second", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not intended to limit the scope of the present application, which is defined by the appended claims.