Background
The large amount of 3D video data presents unprecedented difficulties and challenges to the storage and transmission of video. Therefore, how to realize efficient 3D video coding has important theoretical research significance and practical application value. Currently, 3D-HEVC (3D high efficiency video coding) developed by JCT-VC, an international standard organization for video coding consisting of MPEG (moving picture experts group) and VCEG (video coding experts group) jointly, is the latest 3D video coding standard. The 3D-HEVC introduces a new prediction technology and a new coding tool for non-independent viewpoints and depth videos, increases the technology suitable for multi-viewpoint video coding and depth video coding, and improves the coding efficiency. However, the inter-view prediction technique of multi-view video coding still has a great room for improvement.
Currently, people mainly focus on the acquisition of disparity vectors and disparity compensated prediction. Chen and the like calculate the disparity vector of the current block by utilizing the correlation between the disparity vector of the video space-time domain adjacent coding block and the disparity vector of the current block, and reduce the dependency of the disparity vector on the depth map coding. Zhang et al propose a method for obtaining disparity vector of a current coding block only from spatial domain and time domain neighboring blocks of the current coding block and performing inter-view prediction, which improves coding efficiency. Woontack et al propose an overlapped block disparity compensation and adaptive window stereo image coding scheme. Wong et al analyze the horizontal scaling and cropping phenomena existing between views in detail, and in order to implement disparity compensation prediction based on HSS (horizontal scaling and cropping) with minimum complexity, an efficient sub-sampling block matching technique is adopted, thereby effectively achieving bit saving of multi-view video coding.
Deep learning achieves great results on tasks such as processing classification and super-resolution reconstruction. With the above introduction, in recent years, researchers at home and abroad develop researches on planar color video coding based on deep learning, and the coding performance is remarkably improved in the aspects of intra-frame prediction, inter-frame prediction, post-filtering processing, end-to-end image coding and the like.
Cui et al propose a convolutional neural network for intra prediction that takes coded blocks around a predicted block as input, can sufficiently learn texture information of a reference block, and provides an accurate prediction result. Yan et al convert fractional pixel motion compensation to the problem of regression between images, solve this problem with convolutional neural networks, and improve coding efficiency. Liu et al propose a motion compensation enhancement algorithm based on CNN (convolutional neural network), which not only performs motion compensation on the current block, but also further improves the accuracy of prediction by using the neighboring reconstruction region of the current coding block. Li et al propose a frame-level dynamic metadata post-processing method, which first classifies video frames according to content complexity and quality, transmits classification flags to a decoding end as auxiliary information, and then performs image enhancement on the frames of different classes by using a 20-layer convolutional neural network, thereby improving filtering performance. However, there is little research on improving the 3D video coding performance by using the deep learning technique, and particularly, there is no intelligent inter-view coding method at present, so it is the fundamental idea of the present invention to propose an intelligent inter-view coding method.
In the process of implementing the invention, the inventor finds that at least the following disadvantages and shortcomings exist in the prior art:
the existing algorithm for modeling the parallax relation can perform operations such as stretching and shearing on a reference image, although the accuracy of prediction can be effectively improved, the applicability is poor, and the effect is poor for videos with complex textures and rich actions; at present, an intelligent inter-view coding method is lacked.
Disclosure of Invention
The invention provides an intelligent inter-viewpoint coding method, which designs an intelligent inter-viewpoint prediction model based on a convolutional neural network by fully mining the high similarity between adjacent viewpoints and utilizing prediction information in a 3D video coding process to obtain more accurate prediction of a current coding block, thereby further improving the coding efficiency, and the details are described as follows:
a method of intelligent inter-view coding, the method comprising the steps of:
respectively acquiring a first prediction block obtained based on a disparity vector mode of an adjacent block and a second prediction block obtained based on a traditional disparity compensation prediction mode;
constructing a residual learning convolution neural network structure with variable convolution kernel size; the result of the cascade connection of the brightness component channels of the two prediction blocks is used as the input of a convolutional neural network, and the output of the convolutional neural network is the enhanced prediction block of the original image block;
training four different network models according to four different quantization parameters;
respectively extracting brightness components of the first prediction block and the second prediction block, performing channel cascade, and calling corresponding network models according to the size of the current coding block and a quantization parameter value of a cascaded result;
and finally, calculating rate distortion cost of the enhanced prediction block, comparing the rate distortion cost with the optimal mode cost of the current coding block, selecting the method as the optimal coding mode if the rate distortion cost is less than the optimal mode cost, and coding and transmitting the corresponding mark to a decoding end.
The obtaining of the first prediction block obtained based on the neighboring block disparity vector mode specifically includes:
if the disparity vector obtained based on the disparity vector mode of the adjacent block exists, setting a mark to record the position of the disparity vector in a final candidate list, storing a prediction block obtained by compensating the disparity vector, and outputting a prediction image after a complete frame image is coded;
and marking the block which exists and is predicted by using the disparity vector as 1, otherwise, marking the block as 0, outputting the mark in a form of a table, cutting the predicted image, and only reserving the predicted block which uses the disparity vector mode of the adjacent block as training data.
Further, the obtaining of the second prediction block obtained based on the conventional disparity compensated prediction mode specifically includes:
finding out an inter-view reference picture by judging a view index of the reference picture, performing parallax compensation by using the reference picture, optimizing rate distortion, and storing an optimal prediction result in different PU partition modes as a DCP prediction block;
and outputting a predicted image after the complete frame image is coded, cutting the predicted image by using the table, and outputting and storing the cut predicted image in sequence.
Wherein, the constructing of the residual learning convolutional neural network with variable convolutional kernel size can be expressed as:
wherein f (-) is a prediction function,
representing a cascade of channels, l { -,. represents a loss function for calculating the error between the predicted value and the true value, the first prediction block being denoted as P
1The second prediction block is denoted as P
2The original image block is designated as Label and is denoted as Y.
The technical scheme provided by the invention has the beneficial effects that:
1. according to the method, inter-viewpoint prediction information is fully utilized, a depth learning technology is introduced into the field of 3D video inter-viewpoint prediction coding, and a prediction block which is more accurate is generated by utilizing the learning characteristic of a convolutional neural network and is closer to an original image block, so that the residual error between the original image block and the prediction block is reduced, the bit overhead required by coding transmission is reduced, and the coding efficiency is further improved;
2. the invention adopts a data driving mode and trains a convolutional neural network by utilizing abundant and diverse data, so that the invention has better generalization.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
The embodiment of the invention provides an intelligent inter-view coding method, which fuses a prediction block obtained in an NBDV (neighboring block disparity vector) mode and a prediction block obtained in a DCP (traditional disparity compensated prediction) mode in a coding process, and generates a more accurate prediction block by utilizing the learning characteristic of a convolutional neural network, and the specific implementation steps are as follows:
firstly, acquiring training data
The prediction based on NBDV and the DCP are two different inter-view prediction coding modes in 3D video coding, however, some blocks to be coded do not have NBDV vectors, and therefore, in the process of acquiring training data, a prediction block obtained by the NBDV process and a DCP prediction block corresponding to the prediction block need to be acquired respectively. The size of the prediction block obtained in the embodiment of the present invention is 64 × 64, and a smaller prediction block is subsequently tested.
1) Acquisition of NBDV data
The NBDV mode exists in a Merge mode, a disparity vector (marked as IvDC) obtained through an NBDV process can be used as a prediction candidate in a Merge candidate list, each candidate in the list can be traversed in an encoding process, at the moment, if the IvDC exists, an extra mark is set to record the position of the disparity vector in a final candidate list, and meanwhile, a prediction block obtained through compensation by using the IvDC vector is stored.
However, not all blocks to be coded can obtain disparity vectors by NBDV, nor all IvDC vectors can be stored in the final candidate list, so the block that exists and is predicted using IvDC is marked as 1, otherwise it is marked as 0. And outputting a prediction image after the complete frame image is coded, outputting the marks in a form of a table, finally cutting the output prediction result by using matlab in combination with the corresponding table, and outputting the prediction result in sequence, wherein only the prediction block using the NBDV mode is reserved as training data.
2) Acquisition of DCP data
Disparity compensated prediction and motion compensated prediction have similar concepts, and can be understood as a method of inter prediction. But the reference frames of the two are essentially different. The reference frames for Motion Compensated Prediction (MCP) are encoded frames from the same view at different time instances, while Disparity Compensated Prediction (DCP) refers to encoded frames from different views at the same time instance. Because the DCP and the MCP have the function of heteroworking, the DCP mode and the MCP mode are mixed in the inter-frame prediction mode. The reference picture list is traversed during the motion estimation process of inter-frame prediction/inter-view prediction, at this time, an inter-view reference picture is found by judging the view index of the reference picture, and a prediction block obtained by compensating the reference picture is stored.
In addition, considering that the Merge mode can only be applied to a prediction block with a PU size of 2N × 2N, for a coding unit CU of the same size, the inter prediction mode traverses all PU partition modes, including 2N × 2N, 2N × N, and N × N. Here, in order to obtain a prediction block with higher quality, the embodiment of the present invention employs a rate distortion optimization technique to compare prediction results in different PU partition modes, and stores an optimal prediction result in different PU partition modes as a DCP prediction block.
In order to find the DCP prediction block corresponding to the NBDV prediction block from the output prediction graph, the table obtained in the process of obtaining NBDV data is reused, and the table is cut and sequentially output and stored.
Second, construction of prediction network
The embodiment of the invention is based on a network structure of VRCNN (variable convolution kernel size residual learning convolution neural network), NBDV data and DCP data are obtained by using the method in the previous step, and a prediction block obtained through an NBDV mode is marked as P1The prediction block obtained by the DCP mode is denoted as P2And taking the corresponding original image block as a Label and recording the Label as Y. Through the network, a better prediction is obtained, so that the prediction block is as close to the original image as possible, and the formula is:
wherein f (-) is a prediction function,
representing the channel cascade, and l {. and · } representing a loss function for calculating an error between a predicted value and a true value. From a regression point of view, the formula can be written as follows:
wherein r (-) represents a convolutional neural network, θ is a network parameter, PiFor input data of the network, YiIs its corresponding Label. The loss function l {, · } can be specifically written as:
three, network model training
The embodiment of the invention takes the result of the concatenation of the NBDV prediction block and the DCP prediction block channel as the input of the network, and the output of the network is an enhanced prediction block which is closer to the original image block. The network comprises four convolution layers in total, wherein each layer except the last layer uses ReLU as a nonlinear activation function, and f (x) is max (0, x), the second layer and the third layer are divided into two branches, characteristics are extracted by convolution kernels with different sizes and are fused, and the fused characteristics are used as input of the next layer. And finally, linearly superposing the input two-channel data by adopting a residual error learning mode, and adding the linearly superposed two-channel data and the output of the last layer of the network to obtain a final enhanced prediction block, so that the convergence of the network is facilitated, and the accuracy of network prediction can be improved. The specific network structure is shown in fig. 2.
The embodiment of the invention trains on a Caffe platform, different models (25_34, 30_39, 35_42 and 40_45) are trained aiming at four different QPs, and the size of the trained block is 64 multiplied by 64. Different models, such as 32 × 32, 16 × 16, 8 × 8, are then trained on blocks of different sizes. The training and testing in the embodiments of the present invention are performed only for the luminance component.
Fourth, embedded coding frame
The coding platform used in the embodiment of the invention is HTM16.2, a network and a coding frame are integrated, firstly, a prediction block obtained in an NBDV mode and a brightness component of a corresponding DCP prediction block are respectively extracted, channel cascade is carried out, then, a corresponding network model is called according to the size of a current block and a QP (quantization parameter) value according to the cascade result, finally, rate distortion cost is calculated for the prediction block output by the network and compared with the optimal mode cost of the current coding block, if the rate distortion cost is less than the optimal mode cost, the method provided by the invention is selected as an optimal coding mode, and a corresponding mark is coded and transmitted to a decoding end.
Wherein, the mode with the minimum cost is taken as the optimal mode.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.