[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Deep Learning-Based Intra Mode Derivation for Versatile Video Coding

Published: 17 February 2023 Publication History

Abstract

In intra coding, Rate Distortion Optimization (RDO) is performed to achieve the optimal intra mode from a pre-defined candidate list. The optimal intra mode is also required to be encoded and transmitted to the decoder side besides the residual signal, where lots of coding bits are consumed. To further improve the performance of intra coding in Versatile Video Coding (VVC), an intelligent intra mode derivation method is proposed in this paper, termed as Deep Learning based Intra Mode Derivation (DLIMD). In specific, the process of intra mode derivation is formulated as a multi-class classification task, which aims to skip the module of intra mode signaling for coding bits reduction. The architecture of DLIMD is developed to adapt to different quantization parameter settings and variable coding blocks including non-square ones, where only one single trained model is required. Different from the existing deep learning based classification problems, the hand-crafted features are also fed into intra mode derivation network besides the learned features from feature learning network. To compete with traditional methods, one additional binary flag is utilized in the video codec to indicate the selected scheme with RDO. Extensive experimental results reveal that the proposed method can achieve 2.28%, 1.74%, and 2.18% bit rate reduction on average for Y, U, and V components on the platform of VVC test model, which outperforms the state-of-the-art works.

1 Introduction

With the rapid development of information technology, videos have been applied to the fields of entertainment, surveillance, education, and so on. To adapt to more applications in our daily life, the videos have evolved in various dimensions in the last decade, including High Definition (HD), Wide Color Gamut (WCG) [14], High Dynamic Range (HDR) [14], Multi-view Video plus Depth (MVD) [28], 360 degree video [43], light field image/video [11], and dynamic point cloud [33]. Unfortunately, from low dimension to high dimension, the dramatically increased video data challenges the limited storage space and transmission bandwidth. From H.264/Advanced Video Coding (AVC) [39], High Efficiency Video Coding (HEVC) [35], to the state-of-the-art Versatile Video Coding (VVC) [7] that was issued in 2020, although a large compression ratio has been achieved, it still cannot catch up with the increase of video data. Advanced video compression algorithm is always desired to maximize the visual quality at a given bandwidth budget.
In the framework of existing hybrid video coding, the modules mainly consist of intra/inter prediction, transform, quantization, entropy encoding and in-loop filtering. To improve compression efficiency, a variety of novel coding tools have been developed in the issued standards, including QuadTree plus Multi-Type Tree (QT+MTT) structure [17] for coding block partition, Matrix-based Intra Prediction (MIP) [34] and Cross-Component Linear Model (CCLM) [45] for intra luma and chroma prediction, History-based Motion Vector Prediction (HMVP) [46] and Decoder-side MV Refinement (DMVR) [15] for motion estimation/compensation, Multiple Transform Selection (MTS) [8] for transform, CABAC engine with multi-hypothesis probability estimation for entropy encoding, and Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF) for in-loop filtering. These mentioned coding tools have achieved significant coding gains.
One of the most important modules is intra prediction [19], which aims to remove spatial redundancy as much as possible. Parts of the available neighboring blocks are weighted to produce the predicted block. Traditionally, intra modes include Planar, DC, and angular modes. To achieve more accurate prediction result, various algorithms have been developed. In [4], intra prediction was analyzed in frequency domain, and the frequency components were selectively discarded to improve the performance. Li et al. [20] presented a bi-intra prediction method based on the binary combination of existing uni-intra prediction modes. Rather than regular out-block reference pixels, the in-block ones were employed in [2] to perform intra prediction for screen content, and an additional in-loop residual signal was used. An iterative filtering method was employed for intra prediction in addition to the traditional intra prediction in [10]. To achieve more reference pixels, the multi-line based scheme was presented in [21], where six more lines of pixels located at the above and left neighbors were collected. Different from fixed scan order, an adaptive block coding order [49] was proposed for intra prediction to better exploit spatial correlations. In analogous to motion estimation in inter coding, Intra Block Copy (IBC) [42] was introduced for screen content, which aims to exploit long distance correlations in an image. Two modes with high probability from gradient histogram were combined to generate a new intra mode in [1]. In [47], the local and nonlocal correlations were exploited for hybrid intra prediction, where the adaptive template matching prediction, combined local and nonlocal prediction, combined neighboring modes prediction were performed. These methods mentioned above exploit spatial redundancy from neighbors with manually designed functions, which may limit the performance. Advanced schemes are desired to adapt to diverse video contents.
To further improve compression efficiency of intra coding, the problem of signal processing is formulated as an artificial intelligence task, where powerful neural network is adopted [25, 48] and a training database for deep video compression is provided in [24]. In specific, the problem of intra luma prediction was formulated as an inpainting task [50], and the problem of intra chroma prediction was modeled as a colorization task [23, 52]. An iterative training strategy for neural network was presented in [12], where training blocks were collected from previous iteration to further improve performance. Wang et al. [38] proposed a multi-scale convolutional network based intra prediction approach, in which the neighboring reconstructed L-shape was fed to the network as well as the traditional angular intra prediction result to make more accurate prediction. With conditional autoencoder [6], multi-mode intra prediction was performed for luma and chroma components. Sun et al. [36] proposed two enhanced intra prediction schemes with multiple neural networks, where the appending scheme was to replace the traditional modes and the substitution scheme was to replace the highest and lowest probable traditional modes. In [16], a progressive spatial recurrent neural network was presented for intra prediction, which was able to produce prediction by passing information along from previous output. To adapt to variable coding blocks in intra prediction, fully connected and convolutional neural networks were carefully designed [13] for small and large blocks, respectively. Most of these existing learning based methods aim to make more accurate luma and chroma predictions from a regression perspective to achieve coding gains, while the module of intra mode derivation has not been exploited from a classification perspective with deep learning tools.
In intra coding, the intra mode is also required to be encoded and transmitted to the decoder side besides residual signal. For intra mode signaling, Most Probable Mode (MPM) list, which is constructed from the neighboring blocks, plays an important role and saves significant coding bits. In [18], two MPM construction methods were presented for VVC, where one was extended from HEVC, and the other was sorted according to the probability of each candidate. Besides the nearest neighboring lines, Chang et al. [9] extended MPM mechanism to Multi-Reference Line (MRL) scheme for better performance. A conditional random field model was established to re-construct the MPM list in [22], where the short and long range correlations were considered. In addition, decision tree was utilized to exploit multiple dynamic lists of intra mode signaling [32]. By investigating the occurrences of intra modes in the neighboring blocks, Most Frequent Mode (MFM) list [44] was derived to compete with the existing MPM list. To skip intra mode signaling and save coding bits, Xu et al. [41] proposed a predictive coding scheme, in which the angular correlation in spatial domain was calculated with modulo-N arithmetic operations. Additionally, template based [40], histogram of gradients based [30], and texture analysis based [29] intra mode derivation methods were presented in a manual manner. For depth video coding, a coding tool [27] was presented to reduce intra mode signaling bitrate, in which the texture intra modes were inherited for the depth intra modes. Basically, the MPM list construction and intra mode derivation have been investigated by traditional statistics and experience, which can be further improved with advanced learning based schemes.
In this work, to skip the module of intra mode signaling and save coding bits, the process of intra mode derivation is formulated as a multi-class classification task. The main contributions of this work are listed as follows.
(1)
The process of intra mode derivation in intra coding is modeled as a multi-class classification task, termed as Deep Learning based Intra Mode Derivation (DLIMD), which is used to skip the module of intra mode signaling for saving bits.
(2)
In DLIMD, the learned features and hand-crafted features are combined together for intra mode derivation. Additionally, the proposed DLIMD can be applied to variable coding blocks (including non-square blocks) and any different Quantization Parameter (QP) settings.
(3)
To further improve the performance, one additional binary flag is utilized to indicate the finally selected scheme from Rate Distortion (RD) cost competition. The proposed method achieves superior performance when compared with the state-of-the-art algorithms.
The remainder of this work is organized as follows. Motivation is presented in Section 2. The proposed DLIMD for video coding is discussed in detail in Section 3. The experiments are conducted and the results are analyzed in Section 4. Section 5 concludes this work.

2 Motivation

In VVC, intra coding modes/tools [31] include DC, Planar, 65 angular modes, Wide Angle Intra Prediction (WAIP), MRL, Position Dependent Prediction Combination (PDPC), MIP, Intra-Sub Partition (ISP), and CCLM. It should be mentioned that the intra mode is also required to be encoded and transmitted to the decoder side. To effectively signal these intra modes to the decoder side, the derivation is performed with intra modes from neighbors, where six of them are produced and accommodated to the MPM list. Generally, the first one in the MPM list is always fixed, i.e., Planar mode, which is encoded with two-bit length. The other five MPMs are achieved according to spatial correlation from the neighbors, and encoded with three-bit to six-bit length. The non-MPM modes are divided into two parts which contain 3 and 58 modes, respectively. Truncated binary encoding is performed for them with six-bit and seven-bit length. The detailed intra modes signaling can be found in Figure 1. In addition, statistical experiments are conducted under the platform of VVC Test Model version 5.0 (VTM 5.0) to present coding Bits Per intra Mode (BPM), where ten sequences with various contents from different classes are encoded under All Intra (AI) configuration. The value of BPM is calculated by the total coding bits of intra mode against the number of intra blocks, where the coding bits are collected after CABAC entropy encoding. The statistical results are shown in the left columns of Table 1 and the values of BPM are 3.35, 3.48, 3.44, and 3.39 on average under four QP settings.
Fig. 1.
Fig. 1. 67 intra mode signaling in VVC.
Table 1.
ClassSequenceCoding bits per intra mode (BPM)
\(\alpha\)
Percentage of coding bits of intra mode
\(\beta\)
QP = 22QP = 27QP = 32QP = 37QP = 22QP = 27QP = 32QP = 37
ATango22.182.902.922.975.39%10.9%14.7%18.3%
FoodMarket42.452.612.632.717.14%10.4%12.9%14.9%
BBasketballDrive2.942.922.932.856.60%11.5%15.6%18.3%
BQTerrace3.403.573.533.445.27%9.46%14.1%18.9%
CBQMall3.843.863.753.658.95%12.2%16.3%20.8%
BasketballDrill3.523.553.583.7614.6%18.4%21.0%25.1%
DBlowingBubbles4.344.254.173.827.88%11.5%16.3%20.6%
BasketballPass3.854.043.933.678.82%11.8%16.4%22.1%
EFourPeople3.593.623.573.539.80%13.8%17.7%21.8%
Johnny3.373.433.433.508.39%13.1%18.7%23.2%
AVERAGE3.353.483.443.398.28%12.3%16.4%20.4%
Table 1. Statistical Results of Intra Mode Signaling
Furthermore, to demonstrate how many bits are spent in the module of intra mode signaling, the percentage of coding bits of intra mode in a frame is collected, and illustrated in the right columns of Table 1. It can be found that this percentage increases from 8.28% to 20.4% on average as QP value increases. In the case of small QP settings, the percentage is limited, because the coding bits of residue (the difference between prediction and source) are much larger than those of intra mode, while in the case of large QP settings, the coding bits of residue become limited, which results in a high percentage of coding bits of intra mode. From these results, we can conclude that if more advanced intra mode signaling approach is presented, the coding performance can be further improved.

3 Proposed Deep Learning Based Intra Mode Derivation for Video Coding

3.1 Problem Formulation and Framework

In this work, we focus on the optimization of DC, Planar, and 65 angular modes signaling for the luma component, while the chroma component is ignored. According to Figure 1, the straightforward idea of improving coding performance is to predict the best intra mode from all 67 candidates and place it to the first in MPM list. This intra mode derivation can achieve promising performance, because the intra mode signaling only consumes two bits, which is less than other cases. However, it still can be improved by skipping the RD checking process and intra mode signaling to save coding bits.
The optimal intra mode of current block is finally selected based on the minimum RD cost by checking the candidate list. This process can be represented by the following equation,
\begin{equation} n^* = \mathop {\arg \min }_{n}\big \lbrace D_n + \lambda \big (R_n^{r} + R_n^{m} + R_n^{o}\big)\big \rbrace , \end{equation}
(1)
where \(n\) indicates the index of intra mode, \(n \in [0, 66]\) for Planar, DC, and 65 angular modes, \(D_n\) is the distortion, \(\lambda\) is the Lagrange Multiplier, \(R_n^r\), \(R_n^m\), and \(R_n^o\) indicate the coding bits of residue, intra mode, and other information, respectively.
According to Equation (1), to achieve the optimal one from a pre-defined candidate list, this process can be formulated as a multi-class classification task. Generally, the construction of MPM list can be regarded as a manual classification scheme, and top-6 intra modes are manually selected. To further improve the performance, we aim to solve this multi-class classification task with a deep learning approach. In specific, the optimal intra mode can be derived directly instead of checking candidate list, and the module of intra mode signaling is expected to be skipped for coding bits reduction. Figure 2 illustrates the framework of proposed deep learning based intra mode derivation for video coding. T and Q indicate transform and quantization, T\(^{-1}\) and Q\(^{-1}\) indicate inverse transform and inverse quantization.
Fig. 2.
Fig. 2. Framework of proposed deep learning based intra mode derivation for video coding.
In the video encoder, intra mode derivation is performed, including the conventional intra mode checking from intra mode list and the proposed DLIMD. According to RD cost, only one of them will be finally selected. If DLIMD is selected, the strategy flag is set as 1 and the switch is opened for skipping the module of intra mode encoding; otherwise, the strategy flag is set as 0 and the switch is closed for activating the module of intra mode encoding. The strategy flag is always encoded and transmitted to indicate the selected scheme. It is worth mentioning that the other modules in video codec are not changed. The RD cost competition between DLIMD and traditional method (including DC, Planar, angular modes, MRL, MIP and ISP) can be represented by the following equation.
\begin{equation} S^* = \mathop {\arg \min }_{S}\big \lbrace D_S + \lambda \big (R_S^{r} + R_S^{m} + R_S^{f} + R_S^{o}\big)\big \rbrace , \end{equation}
(2)
where \(S^*\) indicates the selected scheme, i.e., \(S\in\) {DLIMD, traditional method}, \(D_S\) is the distortion under \(S\) scheme, \(\lambda\) is the Lagrange Multiplier, \(R_S^r\), \(R_S^m\), \(R_S^f\), and \(R_S^o\) indicate the coding bits of residue, intra mode, strategy flag, and other information, respectively. In addition, it should be noted that if the proposed DLIMD is selected, there are no coding bits for intra mode, i.e., \(R_S^m = 0\).
In the video decoder, the strategy flag is firstly decoded before intra prediction. If this strategy flag is 0, the intra mode will be decoded directly; otherwise, the intra mode will be derived by the proposed DLIMD. With the intra mode, intra prediction is performed accordingly. Finally, the prediction result plus decoded residual information produces the reconstruction.
To estimate the upper bound of performance under the proposed framework, we define that \(\alpha\) and \(\beta\) are the original value of BPM and the percentage of coding bits of intra mode in a frame, the statistical values of them are illustrated in Table 1, \(\gamma\) is the percentage of selected intra blocks under the proposed scheme. One additional binary flag is utilized for indication between the proposed scheme and the original scheme, which is encoded by context mode. Then, the value of BPM becomes \(- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})\). Accordingly, the bit saving can be calculated as follows,
\begin{equation} \eta = \frac{\alpha - [- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})]}{\alpha } \times \beta . \end{equation}
(3)
The condition of upper bound is \(\gamma = 100\%\), then the bit saving equals to \(\beta\). As such, the upper bound of bit saving can reach 8.28%, 12.3%, 16.4%, and 20.4% when QP value equals to {22, 27, 32, 37}, respectively.
Besides theoretical analysis, an experiment has been conducted to present the upper bound of bit saving under the proposed framework. At the encoding stage, the optimal intra mode of current block is achieved after rate distortion cost comparison among all the candidates. Then, this optimal intra mode is regarded as the predicted one to skip intra mode signaling for bits reduction. The sequences are encoded with small QPs {11, 16, 21, 26}, normal QPs {22, 27, 32, 37}, large QPs {33, 38, 43, 48}, and default AI configuration. The coding performance is measured by Bj\({\phi }\)ntegaard Delta Bit Rate (BD-BR) [3] with respect to the original VTM. From Table 2, it can be found that the upper bound of bit saving can reach 5.68%, 12.3%, and 19.5% on average for the luma component under small, normal, and large QP settings, which are close to those from theoretical analysis. However, this is the ideal case that cannot be achieved because the intra mode cannot be accurately predicted.
Table 2.
ClassSequenceSmall QPs {11, 16, 21, 26}Normal QPs {22, 27, 32, 37}Large QPs {33, 38, 43, 48}
YUVYUVYUV
ATango2–5.80–4.49–4.05–12.3–11.9–11.6–18.3–17.8–17.8
FoodMarket4–4.12–5.19–4.76–10.8–8.99–8.73–14.2–13.4–14.9
BBasketballDrive–3.00–2.31–3.45–10.4–9.63–10.7–18.2–19.0–21.1
BQTerrace–3.49–2.84–2.96–8.68–8.62–8.70–16.8–16.8–20.6
CBasketballDrill–8.05–8.48–8.60–15.3–13.8–15.6–22.2–21.0–22.6
BQMall–6.00–5.63–5.80–11.3–9.59–10.7–19.1–20.2–17.3
DBasketballPass–7.32–6.60–6.78–11.5–13.0–14.7–20.5–17.9–16.9
BlowingBubbles–6.53–6.10–6.56–12.7–10.8–13.4–20.7–19.0–15.9
EFourPeople–7.14–6.89–7.33–15.3–14.5–13.8–21.8–21.0–20.2
Johnny–5.32–5.26–6.27–14.5–14.9–15.9–22.8–23.1–22.2
AVERAGE–5.68–5.38–5.66–12.3–11.6–12.4–19.5–18.9–19.0
Table 2. Upper Bound of Bit Saving under the Proposed Framework in Terms of BD-BR (Unit: %)

3.2 Deep Learning based Intra Mode Derivation

Figure 3 illustrates the proposed architecture of deep learning based intra mode derivation scheme, in which two neural networks are included, one is feature learning network and the other is intra mode derivation network. The former is used to extract the highly dimensional features and the latter aims to infer the optimal intra mode directly without RD cost checking. In particular, the hand-crafted and learned features are combined to enjoy their individual benefits. The detailed hyper parameters of these two networks are listed in Tables 3 and 4.
Fig. 3.
Fig. 3. Architecture of deep learning based intra mode derivation.
Table 3.
#TypeKernelStrideOutputsActivation
1aCNN\(1\times 1\)164ReLU
1b\(3\times 3\)
2a\(1\times 1\)
2b\(3\times 3\)
3a\(1\times 1\)
3b\(3\times 3\)
4\(3\times 3\)
5\(3\times 3\)
Table 3. Hyper-parameters of the Feature Learning Network
Table 4.
#TypeInput SizeNodesActivation
1FCN33792 + 732048ReLU
22048 + 73
3
4
567SoftMax
Table 4. Hyper-parameters of the Intra Mode Derivation Network
In the feature learning network, five convolutional layers are included, and the first three ones (each has two sub-layers) are placed in a parallel manner. The kernel sizes are \(1\times 1\) and \(3\times 3\) in convolutional layers. Rectified Linear Unit (ReLU) is employed as activation function. The number of feature maps in each convolutional layer is 64. In the intra mode derivation network, five fully connected layers are included, and the node of each layer except the last one is 2048. In the last fully connected layer, the activation function is SoftMax, and the number of nodes becomes 67, which aims to match the number of intra modes. Hand-crafted features are always included in the input of each fully connected layer, which is represented by,
\begin{equation} {\bf I}_i^f = {\rm {concat}}\big ({\bf O}^f_{i-1}, {\bf f}_0\big), i \in [1,5], \end{equation}
(4)
where \({\bf O}^f_{i-1}\) is the output of the \((i-1)^{th}\) fully connected layer, \({\bf f}_0\) indicates the hand-crafted features, \({\bf O}^f_0\) is the reshaped vector of output of feature learning network. To avoid overfitting, the dropout is performed at each fully connected layer. The dropout rate is set as 0.5 at the training stage, and set as 1.0 at the testing stage.
With the neighboring blocks and reference pixels, 73 hand-crafted features are collected, including 67 features from gradient histogram, five features from intra mode of neighboring blocks, and QP value. The gradient histogram can be regarded as the probability of each candidate intra mode, and its detailed calculation can be found in [29]. Due to the highly spatial correlation, the Up-Left (UL), Up (U), Up-Right (UR), Left (L) and Bottom Left (BL) blocks are used to provide their final selected intra modes as five hand-crafted features, as shown in the module of hand-crafted feature collection in Figure 3. The QP value balances the reconstruction quality and coding bits, i.e., lower QP value indicates better reconstruction quality and more coding bits, and vice versa, which has an impact on the intra mode derivation. In addition, with QP value as the feature, it is unnecessary to train different networks for different QP settings. In the case of frame boundary, the intra modes of neighbors are initially set as Planar mode because they are unavailable.
Due to lossy video coding, the neighboring blocks and reference pixels are degraded, which may affect the hand-crafted features, especially the gradient histogram. Therefore, the learned features are employed. As mentioned before, the coding block is not fixed, which can be flexibly partitioned from \(128\ \times \ 128\) to \(4\ \times \ 4\), including non-square patterns. Accordingly, the number of reference pixels is different for different coding blocks. For luma coding block, it follows that the width and height belong to {4, 8, 16, 32, 64} [31]. Therefore, it seems that 25 networks are required, which challenges computational and storage resources. It is expected that one single trained network can be applied to variable coding blocks. In the designed architecture, the multi-line reference pixels are collected and padded to a fixed size to adapt to variable coding blocks, where the fixed memory is allocated under the maximum available coding block \(64 \times 64\). Suppose the current block has a dimension of \(4\times 4\), the padding is performed to the lines that exceed the current block size of \(4\times 4\), where 60 padded lines from left and 60 padded lines from top are always used. A matrix with size of \((64 + 4 + 64)\times 4\) is fed to the feature learning network. Finally, the learned features with size of \((64 + 4 + 64)\times 4\times 64 = 33792\) are produced and reshaped to one dimension of vector.
In addition, the number of FLoating-point OPerations (FLOPs) [26] is used to evaluate the complexity of neural network. For convolutional layer,
\begin{equation} FLOPs = 2H \times W \times (C_{in}\times K^2 + 1) \times C_{out}, \end{equation}
(5)
where \(H\), \(W\) and \(C_{in}\) are height, width and number of channels of the input feature map, \(K\) is the kernel size, and \(C_{out}\) is the number of output channels. The values of FLOPs in convolutional layers are \(1.3\times 10^5\), \(6.7\times 10^5\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(7.7\times 10^7\), and \(3.9\times 10^7\), respectively. For fully connected layer,
\begin{equation} FLOPs = (2I - 1)\times O, \end{equation}
(6)
where \(I\) is the input dimensionality and \(O\) is the output dimensionality. The values of FLOPs in fully connected layers are \(1.3\times 10^8\), \(8.6\times 10^6\), \(8.6\times 10^6\), \(8.6\times 10^6\), and \(2.8\times 10^5\), respectively.

3.3 Neural Network Training

The DIV2K database [37] with 900 images (the resolution changes from \(2040\times 648\) to \(2040\times 2040\)) is used to generate the neural network training dataset. These images are all resized to \(2048\ \times \ 1536\) and packed as a sequence from RGB color space to YCbCr color space. Then, this pseudo sequence is encoded by VTM 5.0 with default AI configuration to collect the training samples, where the QP values are set as {22, 27, 32, 37}. During the process of video coding, the hand-crafted features and source reference pixels of current block are collected with the associated label (intra mode) regardless of the coding block size. According to Figure 3 shown in [51], the distribution of intra modes is uneven. The Planar, DC, horizontal, and vertical modes are more frequently selected than other intra modes. Additionally, the unbalanced data may make the multi-class classification network training failure. Thereby, the number of training samples for each label and QP is fixed as 50,000, and then the total number is \(50000 \times 4 \times 67\). In total, the volume of training data reaches about 80 GB. In addition, for the purpose of validation, \(20000\times 67\) samples are selected in the training dataset, where the number of validation samples is 20,000 for each label.
In this work, the Tensorflow package is adopted for network training on NVIDA GeForce 1080 Ti GPU with AdamOptimizer. The memory of workstation is 112G, which is able to accommodate the training samples. For this multi-class classification task, the cross entropy is utilized as the loss function,
\begin{equation} L = -\frac{1}{N}\sum ^N_{i=1} \sum ^M_{j=1} \big \lbrace y_{j}^{i} \times ln \big (x_{j}^{i}\big)\big \rbrace , \end{equation}
(7)
where \(N\) is the number of training samples in a batch, \(M\) is the number of classes, i.e., \(M=67\), \(y_{j}^{i}\) is the ground truth of the \(i^{th}\) training sample, and \(x_{j}^{i}\) is the output of intra mode derivation network after softmax layer. It should be noted that the ground truth is represented in the one-hot manner. For example, if the intra mode is 4 for the \(i^{th}\) training sample, \(y^i_4 = 1\) and \(y^i_j = 0\) \((j \ne 4)\), it also can be rewritten as \({\bf y}^i = [y^i_1, y^i_2, \dots , y^i_j, \dots , y^i_{67}] = [0, 0, 0, 1, 0, \dots , 0]\). The batch size and number of epochs are set as 1024 and 1000. The initial learning rate \(r_0\) is \(1\times 10^{-4}\), and it is always updated after each epoch, i.e., \(r_0 \times 0.999^i\), where \(i\) is the index of training epoch.

4 Experimental Results and Analyses

4.1 Coding Performance Comparison

The experiments are conducted on the platform VTM 5.0 following the default AI configuration and the Common Test Conditions (CTC) [5]. The workstation is equipped with the Intel Core i7-4790 CPU @2.60 GHz, Windows 7 Enterprise 64-bit operating systems for video coding. The original VTM 5.0 is regarded as the anchor for coding performance comparison, which is evaluated by BD-BR. Twenty-two sequences with various contents and resolutions, different from the training dataset, are utilized in the experiments.
Table 5 illustrates the values of BPM under the proposed method. Two sequences of each class are utilized for this experiment, which are identical to those in Table 1. During intra coding, the total bits of intra modes and the number of intra blocks are collected for BPM calculation when the QP values are set as {22, 27, 32, 37}. Compared with those shown in Table 1, the average values of BPM are changed from 3.35, 3.48, 3.44, and 3.39 to 2.38, 2.35, 2.22, and 2.20 under four QP settings, respectively. In general, the coding bit saving of intra mode can be calculated as follows regardless of residue and other information,
\begin{equation} \eta ^{\prime } = \frac{\alpha - {\alpha }^{\prime }}{\alpha } \times 100 \%, \end{equation}
(8)
where \(\alpha\) is the original value of BPM and \(\alpha ^{\prime }\) is the current value of BPM. Accordingly, the bit saving of intra mode can reach 30.4%, 33.2%, 36.2%, and 35.3% on average under four QP settings, respectively.
Table 5.
ClassSequenceCurrent value of BPM
\(\alpha ^{\prime }\)
Coding bit saving of intra mode
\(\eta ^{\prime }\)
QP = 22QP = 27QP = 32QP = 37QP = 22QP = 27QP = 32QP = 37
ATango21.021.461.441.7153.2%49.7%50.7%42.4%
FoodMarket41.331.501.531.8245.7%42.5%41.8%32.8%
BBasketballDrive1.991.831.651.6932.3%37.3%43.7%40.7%
BQTerrace2.762.532.362.2518.8%29.1%33.1%34.6%
CBQMall2.842.782.592.4026.0%28.0%30.9%34.2%
BasketballDrill2.732.602.492.5822.4%26.8%30.4%31.4%
DBlowingBubbles3.092.922.712.4528.8%31.3%35.0%35.9%
BasketballPass3.163.022.782.5717.9%25.2%29.3%30.0%
EFourPeople2.512.402.312.2330.1%33.7%35.3%36.8%
Johnny2.412.452.342.3228.5%28.6%31.8%33.7%
AVERAGE2.382.352.222.2030.4%33.2%36.2%35.3%
Table 5. Coding Bits Per Intra Mode under the Proposed Method
Three state-of-the-art works are adopted for coding performance comparison. Narsallah’s scheme [29] derives the intra mode with gradient histogram and the one with the highest probability is determined eventually. Abdoli’s scheme [1] produces a new intra prediction result with weighted intra modes from the top-2 highest probability in gradient histogram. Li’s scheme [22] re-constructs the MPM list with short and long range correlations. These three works are optimized from different directions, and related to the proposed method, which can be compared in terms of coding performance. The comparison is illustrated in Table 6.
Table 6.
ClassSequenceNarsallah’s [29]Abdoli’s [1]Li’s [22]Proposed
YUVYUVYUVYUV
A1Tango2–0.20–0.410.01–0.69–0.91–0.16–0.08–0.040.13–2.50–2.77–2.20
FoodMarket4–0.360.06–0.44–0.81–0.33–0.68–0.02–0.140.00–2.65–1.66–1.14
Campfire–0.18–0.16–0.34–0.42–0.20–0.12–0.080.00–0.19–2.38–1.14–1.98
A2CatRobot1–0.13–0.29–0.33–0.33–0.08–0.36–0.07–0.08–0.01–2.69–1.90–2.44
DaylightRoad20.02–0.290.01–0.29–0.32–0.10–0.26–0.13–0.15–2.70–2.66–2.63
ParkRunning3–0.04–0.02–0.11–0.27–0.25–0.29–0.05–0.07–0.08–1.04–0.85–0.88
BMarketPlace–0.12–0.08–0.09–0.350.00–0.43–0.10–0.23–0.20–2.29–1.51–1.61
RitualDance–0.46–0.41–0.33–0.65–0.32–0.31–0.02–0.23–0.20–1.81–1.67–1.55
Cactus–0.060.090.02–0.350.03–0.21–0.07–0.18–0.06–2.49–1.17–3.41
BasketballDrive–0.20–0.67–0.16–0.67–0.82–0.25–0.10–0.61–0.35–2.42–2.68–2.04
BQTerrace0.00–0.28–0.11–0.27–0.37–0.26–0.15–0.31–0.41–1.88–1.77–2.40
CBasketballDrill0.150.040.80–0.270.140.89–0.31–0.01–0.69–2.291.18–2.62
BQMall–0.37–0.500.14–0.52–0.27–0.37–0.010.40–0.01–2.89–1.93–1.36
PartyScene–0.18–0.11–0.16–0.35–0.20–0.29–0.160.02–0.03–1.93–2.45–0.71
RaceHorsesC–0.16–0.16–0.39–0.49–0.16–0.110.010.08–0.08–1.78–0.89–2.01
DBasketballPass–0.20–0.14–0.02–0.41–0.01–0.380.04–0.50–0.59–1.67–2.31–4.17
BQSquare–0.33–0.27–0.02–0.23–0.08–0.24–0.010.140.04–1.92–0.13–1.73
BlowingBubbles–0.27–0.68–0.95–0.69–0.53–0.960.03–0.55–0.34–2.09–1.36–1.67
RaceHorses–0.250.290.38–0.43–0.65–0.38–0.04–0.030.69–1.84–2.48–1.42
EFourPeople–0.41–0.52–0.38–0.55–0.770.040.03–0.170.00–3.21–2.61–2.28
Johnny–0.25–0.84–0.39–0.45–1.01–0.540.04–0.16–0.07–2.31–3.71–5.30
KristenAndSara–0.27–0.320.12–0.48–0.36–0.470.02–0.430.07–3.31–1.79–2.44
AVERAGE–0.19–0.26–0.12–0.45–0.34–0.27–0.06–0.15–0.12–2.28–1.74–2.18
Table 6. Performance Comparison in Terms of BD-BR with QPs {22, 27, 32, 37} (Unit: %)
For Narsallah’s scheme [29], it reduces 0.19%, 0.26%, and 0.12% bit rate on average for Y, U, and V components, respectively. 0.45%, 0.34%, and 0.27% bit rates are saved for Y, U, and V components in the Abdoli’s scheme [1]. For Li’s scheme [22], it achieves 0.06%, 0.15%, and 0.12% bit rate reduction on average for luma and two chroma components, respectively. Regarding the proposed method, the bit rate reduction reaches 2.28%, 1.74%, and 2.18% on average for luma and two chroma components, respectively. From this comparison, it can be observed that the proposed method is better than the other three methods. Compared with Narsallah’s scheme [29], the proposed method not only adopts the existing hand-crafted features, but also learns features in highly dimensional space for the intra mode derivation.
In addition, the test sequences are encoded under the small QP setting {11, 16, 21, 26} and large QP setting {33, 38, 43, 48} to evaluate the performance of the proposed method. It should be noted that the neural network is not re-trained. The coding performance is shown in Table 7. The bit rate reductions can reach 0.71% and 3.64% for luma component under the small and large QP settings, respectively. Compared with the results in Table 6, the performance of normal QP setting is a little worse than that of large QP setting and better than that of small QP setting. The reason is that the percentage of coding bits of intra mode in a frame becomes large as QP value increases, and vice versa. Consequently, in the low bit rate scenario, the compression efficiency is greatly improved by the proposed method.
Table 7.
ClassSequenceSmall QPs {11, 16, 21, 26}Large QPs {33, 38, 43, 48}
YUVYUV
A1Tango2–0.76–0.320.11–3.68–4.01–3.88
FoodMarket4–0.31–0.77–0.60–3.24–2.89–3.28
Campfire–1.00–0.66–0.67–4.37–2.63–3.27
A2CatRobot1–0.79–0.25–0.42–4.37–2.63–3.27
DaylightRoad2–0.44–0.18–0.03–5.03–4.61–5.28
ParkRunning3–0.26–0.26–0.23–2.33–1.60–1.72
BMarketPlace–0.58–0.57–0.04–2.97–5.290.23
RitualDance–0.29–0.58–1.14–4.87–6.58–4.68
Cactus–0.60–0.49–0.46–4.15–1.81–3.41
BasketballDrive–0.600.01–0.86–3.65–4.40–4.16
BQTerrace–0.63–0.36–0.47–3.70–3.57–6.81
CBasketballDrill–0.87–1.63–1.20–3.03–5.430.93
BQMall–1.00–0.59–0.86–4.16–4.52–5.75
PartyScene–0.84–0.51–0.58–3.71–0.98–6.77
RaceHorsesC–0.70–0.55–0.57–3.35–2.35–4.11
DBasketballPass–0.660.53–1.87–1.91–6.51–3.23
BQSquare–0.89–0.79–1.62–3.84–10.4–9.82
BlowingBubbles–0.76–1.34–0.59–2.88–0.621.48
RaceHorses–0.89–0.79–1.62–3.84–10.4–9.82
EFourPeople–0.87–0.71–0.97–2.96–2.23–4.26
Johnny–1.21–1.08–0.89–4.20–3.56–4.80
KristenAndSara–0.62–0.64–1.15–4.36–3.06–2.66
AVERAGE–0.71–0.57–0.76–3.64–4.15–4.00
Table 7. Performance Evaluation in Terms of BD-BR with Different QP Settings (Unit: %)
As shown in Figure 4, the first frames of six sequences, including BasketballPass (\(416\times 240\)), BQSquare (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), and Johnny (\(1280\times 720\)), are utilized to demonstrate the coding blocks that are selected by the proposed DLIMD in a frame, where the QP value is 22. The selected blocks are marked in different colors according to the size. There are five different colors and the details are listed as follows. If the size is smaller than \(8\times 8\), the block is marked as red color; if the size is greater than \(8\times 8\) and smaller than \(16\times 16\), the block is marked as green color; if the size is greater than \(16\times 16\) and smaller than \(32\times 32\), the block is marked as blue color; if the size is greater than \(32\times 32\) and smaller than \(64\times 64\), the block is marked as black color; otherwise, the block is marked as white color. It can be clearly observed that lots of coding blocks select the proposed DLIMD. Moreover, the quantitative results are presented in Table 8 under different QP settings and different sequences. The percentage of DLIMD selection is calculated by the ratio of selected area against the whole frame, which is represented by,
\begin{equation} \Omega = \frac{\sum _{i=1}^{N}C_i \times w_i \times h_i}{\sum _{i=1}^{N}w_i \times h_i} \times 100\%, \end{equation}
(9)
where \(N\) indicates the number of coding blocks in a frame, \(C_i\) indicates the DLIMD selection, \(C_i = 0\) if the current coding block does not select DLIMD, \(w_i\) and \(h_i\) are the width and height of the current coding block. From this table, the percentage can reach 42.9%, 45.2%, 48.5%, and 50.5% on average under four QP settings, respectively. It indicates that the coding performance can be efficiently improved. In addition, the selected blocks under four QP settings are re-organized according to block size. There are 17 available block sizes for these sequences, i.e., \(4\times 4\), \(4\times 8\), \(4\times 16\), \(4\times 32\), \(8\times 4\), \(8\times 8\), \(8\times 16\), \(8\times 32\), \(16\times 4\), \(16\times 8\), \(16\times 16\), \(16\times 32\), \(32\times 4\), \(32\times 8\), \(32\times 16\), \(32\times 32\), and \(64\times 64\). For each block size, the ratio of selected block number against the total block number is calculated, as shown in Figure 5. It can be observed that the ratio can reach from 43.0% to 54.5%.
Fig. 4.
Fig. 4. DLIMD selected in a frame. (They are resized to the same resolution for visualization.)
Fig. 5.
Fig. 5. Percentage of selected blocks according to block size.
Table 8.
ClassSequenceQP = 22QP = 27QP = 32QP = 37
A1Tango239.941.344.346.5
FoodMarket438.537.641.444.2
Campfire48.048.649.853.5
A2CatRobot141.347.350.550.6
DaylightRoad245.450.952.852.4
ParkRunning343.646.547.247.9
BMarketPlace41.344.547.148.9
RitualDance43.845.950.252.2
Cactus40.845.148.150.7
BasketballDrive43.446.150.652.6
BQTerrace41.748.451.754.4
CBasketballDrill57.860.359.356.1
BQMall43.344.948.550.5
PartyScene41.845.547.852.5
RaceHorsesC39.843.545.449.9
DBasketballPass45.540.848.851.3
BQSquare39.640.645.549.8
BlowingBubbles41.542.847.748.7
RaceHorses38.839.745.650.3
EFourPeople43.245.547.548.2
Johnny42.345.049.250.0
KristenAndSara42.243.347.449.4
AVERAGE42.945.248.550.5
Table 8. Percentage of the Proposed Method Selection (Unit: %)

4.2 Influence of Learned and Hand-crafted Features

The individual influence of learned features and hand-crafted features in the proposed architecture (shown in Figure 3) is analyzed. Four cases are presented, i.e., (1) H: the module of learning features is removed, only the hand-crafted features are used for intra mode derivation; (2) L: the hand-crafted features are removed, only the learned features are used for intra mode derivation; (3) H’+L: both the hand-crafted features (excluding gradient histogram) and learned features are used for intra mode derivation; and (4) H+L: both the hand-crafted features and learned features are used for intra mode derivation.
With the same training samples claimed in Section 3.3, three more neural networks are trained separately according to the listed cases. The training process is as same as that in Section 3.3, where NVIDIA GeForce 1080 Ti GPU with AdamOptimizer is adopted and the loss function is cross entropy. Figure 6 illustrates the comparison of these four cases in terms of training loss and validation classification accuracy. The classification accuracy is calculated as follows,
\begin{equation} P = \frac{1}{N}\sum _{i=1}^N \delta _i \times 100\%, \end{equation}
(10)
where \(N\) is the number of testing samples, \(\delta _i=1\) in the case that the difference between predicted label and ground truth is less than a pre-defined threshold \(\Delta\), i.e., \(\Vert \texttt {argmax}({\bf y}^i) - \texttt {argmax}({\bf O}^f_5)\Vert \le \Delta\), otherwise \(\delta _i=0\). argmax() returns the position of maximum value in a vector, \({\bf y}^i\) is the ground truth represented in the one-hot manner, \({\bf O}^f_5\) is the output of intra mode derivation network. Here, the value of \(\Delta\) is set as 0. The cases of separate hand-crafted and learned features both converge at round 3.8 cross entropy loss and achieve about 25% validation classification accuracy, the case of H’+L converges at about 3.6 cross entropy loss and achieves about 30% validation classification accuracy, while the combination of hand-crafted and learned features converges at 3.5 cross entropy loss and achieves about 35% validation classification accuracy.
Fig. 6.
Fig. 6. Comparison of four cases with hand-crafted and learned features.
From these results, it can be obviously observed that the case of combining hand-crafted and learned features achieves the best performance when compared with the other three. The reasons are that although CNN is able to extract high-level features and latent representation, the hand-crafted features still can provide useful information and compensate the limitation of learned features. For example, the intra modes of neighbors from spatial domain, which cannot be learned from the feature learning network, play an important role for intra mode derivation.

4.3 Ablation Study of Architecture

In addition, we aim to further analyze the impact of modules in the network architecture. Alternative networks are designed, and illustrated in Figure 7. Different from the proposed network shown in Figure 3, the convolutional layers are placed in the serial manner, the number of feature maps in the first three layers is 128 which matches the input of convolutional layers 2a, 2b, 3a, 3b in Figure 3, the kernel sizes of the first and third convolutional layers are set as \(1\times 1\) and the others are \(3\times 3\), the hand-crafted features are only combined to the first fully connected layer.
Fig. 7.
Fig. 7. Alternative network. (H: hand-crafted feature, L: learned feature).
Three configurations are listed for comparison, i.e., Case A: feature learning network in Figure 3 and intra mode derivation network in Figure 3; Case B: feature learning network in Figure 3 and intra mode derivation network in Figure 7; Case C: feature learning network in Figure 7 and intra mode derivation network in Figure 3. It should be noted that Case A is the proposed one. Here, two more networks for Cases B and C are trained with the same samples. The results are compared in terms of multi-class classification accuracy. Two test sequences from each class defined in the CTC [5] are employed, i.e., BasketballPass (\(416\times 240\)), BlowingBubbles (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), Johnny (\(1280\times 720\)), BasketballDrive (\(1920\times 1080\)), BQTerrace (\(1920\times 1080\)), Tango2 (\(3840\times 2160\)), and FoodMarket4 (\(3840\times 2160\)). These sequences are all encoded by VTM 5.0 with default AI configuration, where the QP values are set as {22, 27, 32, 37}. During the process of encoding, the testing samples are collected simultaneously. For each sequence, \(800\times 67\times 4\) samples are selected under four QP settings. Table 9 illustrates the experimental results, and the classification accuracy is calculated by Equation (10). In Table 9, under this condition of \(\Delta = 0\), the multi-class classification accuracies are 34.8%, 31.4%, and 32.6% on average for Cases A, B, and C, respectively. As such, we can conclude that the hand-crafted features combined to each fully connected layer and different kernel sizes placed in the parallel manner in convolutional layers can achieve better performance.
Table 9.
ClassSequenceCase A (proposed)Case BCase C
\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5\(\Delta\) = 0\(\Delta\) = 1\(\Delta\) = 3\(\Delta\) = 5
ATango231.845.359.868.028.441.656.765.329.643.358.167.1
FoodMarket434.550.967.075.331.147.864.273.331.849.165.974.4
BBasketballDrive37.252.665.672.134.049.662.569.534.549.363.170.2
BQTerrace32.847.358.664.928.743.054.461.331.046.158.064.7
CBQMall35.650.962.367.532.947.458.264.133.950.060.966.8
BasketballDrill37.355.066.972.433.852.464.670.035.454.066.972.5
DBlowingBubbles32.147.261.167.829.244.158.265.730.246.060.868.3
BasketballPass34.750.863.469.332.048.059.565.832.549.962.268.2
EFourPeople34.149.361.567.729.844.856.862.932.247.559.766.1
Johnny37.654.768.475.134.251.565.572.735.252.566.673.4
AVERAGE34.850.463.570.031.447.060.167.132.648.762.269.1
Table 9. Multi-class Classification Accuracy (Unit: %)
In addition, the normalized confusion matrices of Case A are illustrated in Figure 8. The horizontal is predicted label and the vertical is ground truth. It can be observed that the difference between ground truth and predicted label is limited. For the proposed one (Case A), the average classification accuracies under four conditions are 34.8%, 50.4%, 63.5%, and 70.0% in Table 9, respectively. Although there are some differences between predicted label and ground truth under the conditions of \(\Delta = \lbrace 1, 3, 5\rbrace\), the intra prediction results may be similar, and the RDO will be performed to balance the distortion and coding bits during video coding. Therefore, the coding gains can still be achieved with limited difference between predicted label and ground truth.
Fig. 8.
Fig. 8. Confusion matrix of multi-class classification under Case A.

4.4 Computational Complexity Analyses

Additionally, the coding/decoding time of video codec equipped with the DLIMD is compared with that of the anchor, which is calculated by,
\begin{equation} \Delta T_m = \frac{1}{4}\sum _{i=1}^{4}{\frac{T_{\Psi }^m(QP_i)}{T_{c}^m(QP_i)}}, \end{equation}
(11)
where \(T_{c}^m(QP_i)\) is the coding/decoding time of the anchor under \(QP_i\), and \(T_{\Psi }^m(QP_i)\) is the coding/ decoding time of the video codec equipped with proposed method under \(QP_i\), \(m \in\) {coding, decoding}. Compared with the anchor, the values of computational complexity of the proposed method are 33.6 times, 140.3 times under CPU+GPU platform and 231.3 times, and 604.8 times under CPU platform on average for video coding and decoding, respectively. The computational complexity is a great challenge. In the video codec, the DLIMD is performed in the variable coding blocks and the convolutional/fully connected operations in the neural network result in high complexity.
For other deep learning based schemes [12, 38] that focus on the optimization of intra prediction, the values of computational complexity are 9.87 times, 87.4 times at the encoder side and 151.7 times, and 124.5 times at the decoder side with respect to the anchor. The former and latter schemes with 1.92% and 3.4% bit rate reductions for the luma component are performed on the platform of CPU and CPU+GPU, respectively. For the conventional schemes [1, 22, 29] whose compression efficiencies have been compared in Table 6, the values of encoding and decoding complexity are 109%, 111%, 101% and 104%, 105%, 100% with respect to the anchor. It can be found that the computational complexity of deep learning based schemes including the proposed one is much higher than that of conventional schemes.
Generally, to accelerate deep learning based schemes, the strategies include SIMD optimization, neural network quantization, and parameters/layers pruning. The first two strategies require the support/optimization from hardware devices. Therefore, the third one is adopted to investigate the trade-off between computational complexity and compression efficiency. One more architecture (denoted as DLIMD-L) is designed by reducing the parameters, i.e., the output of last layer in feature learning network is changed from 64 to 16, and the number of nodes of hidden layers except the last one in intra mode derivation network is changed from 2048 to 128. According to the definition of FLOPs [26], the computational complexity can be largely reduced. DLIMD-L is trained with the same training dataset as DLIMD. The coding experiments are performed on the platform of CPU+GPU and the results are presented in Table 10. These sequences are all encoded with default AI configuration, where the QP values are set as {22, 27, 32, 37}. It can be observed that the coding efficiency of DLIMD-L is \(-\)1.17% on average for luma component in terms of BD-BR, which is worse than that of DLIMD. The values of encoding and decoding complexity of DLIMD-L are 27.9 times and 40.8 times with respect to the anchor (VTM 5.0), where 6.8 times of encoding complexity and 105.0 times of decoding complexity are reduced. Although the computational complexity is still high, we believe that it can be optimized in the future.
Table 10.
ClassSequenceDLIMDDLIMD-L
BDBR (%)ComplexityBDBR (%)Complexity
YUVEncodeDecodeYUVEncodeDecode
ATango2–2.50–2.77–2.20\(36.8\times\)\(144.7\times\)–1.71–2.53–1.30\(28.8\times\)\(38.1\times\)
FoodMarket4–2.65–1.66–1.14\(25.8\times\)\(134.9\times\)–1.92–1.08–0.39\(19.9\times\)\(39.1\times\)
BBasketballDrive–2.42–2.68–2.04\(36.6\times\)\(135.3\times\)–1.55–1.70–2.27\(29.0\times\)\(35.3\times\)
BQTerrace–1.88–1.77–2.40\(37.0\times\)\(167.9\times\)–0.73–0.97–0.33\(30.3\times\)\(36.2\times\)
CBQMall–2.89–1.93–1.36\(33.9\times\)\(138.2\times\)–2.11–0.36–1.14\(27.2\times\)\(56.3\times\)
BasketballDrill–2.291.18–2.62\(33.4\times\)\(176.9\times\)1.400.97–0.26\(27.4\times\)\(37.1\times\)
DBlowingBubbles–2.09–1.36–1.67\(30.4\times\)\(136.8\times\)–1.32–0.35–1.77\(25.9\times\)\(38.6\times\)
BasketballPass–1.67–2.31–4.17\(32.2\times\)\(155.8\times\)–0.461.77–1.65\(26.7\times\)\(43.0\times\)
EFourPeople–3.21–2.61–2.28\(42.9\times\)\(136.5\times\)–2.44–2.81–1.54\(34.1\times\)\(46.1\times\)
Johnny–2.31–3.71–5.30\(38.1\times\)\(131.2\times\)–0.81–0.73–3.52\(30.4\times\)\(38.2\times\)
AVERAGE–2.39–1.96–2.52\(34.7\times\)\(145.8\times\)–1.17–0.78–1.42\(27.9\times\)\(40.8\times\)
Table 10. Trade-off between Computational Complexity and Compression Efficiency on the Platform of CPU+GPU

4.5 Coding Performance under the Latest VVC Test Model and Other Configurations

In addition, the proposed method is evaluated on the platform of the latest VVC test model, i.e., VTM 16.0, in which DLIMD has been implemented. Besides AI configuration, the coding experiments are also conducted under Low Delay P (LDP) and Random Access (RA) configurations. It should be noted that the neural network is not changed after training, as claimed in Section 3.3.
The experimental results are shown in Table 11, where the original VTM 16.0 is utilized as the anchor to calculate the value of BD-BR. It can be observed that the proposed method achieves 1.91%, 0.87%, and 1.15% bit rate reductions for Y component under AI, LDP and RA configurations, respectively. The coding gains are a little worse than those under VTM 5.0 shown in Table 6. The reasons are that the neural network is not re-trained, and the intra coding from VTM 5.0 to VTM 16.0 has been optimized.
Table 11.
ClassSequenceAI ConfigurationLDP ConfigurationRA Configuration
YUVYUVYUV
A1Tango2–2.48–0.33–1.60–0.83–0.81–0.46–1.18–0.96–0.10
FoodMarket4–1.87–2.17–1.36–0.63–1.430.05–0.94–0.38–1.03
Campfire–2.37–1.49–2.72–1.18–0.95–0.71–1.82–1.47–1.32
A2CatRobot1–2.54–1.53–2.14–1.36–1.95–1.62–1.53–1.72–1.32
DaylightRoad2–2.48–2.06–2.86–1.84–2.55–2.29–2.30–1.69–2.02
ParkRunning3–1.07–0.78–0.37–0.41–0.41–0.43–0.48–0.36–0.35
BMarketPlace–1.88–1.31–1.99–0.60–0.550.41–1.11–1.610.88
RitualDance–3.48–4.15–3.44–0.72–0.69–1.20–1.04–0.86–1.53
Cactus–2.27–1.61–1.05–1.08–0.94–0.37–1.63–1.35–2.01
BasketballDrive–2.35–1.94–2.38–0.78–1.39–0.24–1.10–0.29–1.06
BQTerrace–1.74–1.63–0.80–0.78–1.46–1.23–1.30–1.12–0.97
CBasketballDrill–1.31–2.29–1.05–0.41–1.030.06–1.22–2.00–1.41
BQMall–2.11–0.34–2.58–1.14–2.15–1.16–1.340.04–0.12
PartyScene–1.50–1.10–2.58–0.77–0.57–1.27–0.90–1.08–0.52
RaceHorsesC–1.470.18–0.42–0.25–1.02–0.67–0.65–1.230.11
DBasketballPass–0.47–1.94–3.35–0.25–1.250.700.18–2.36–0.67
BQSquare–1.26–2.130.27–0.55–2.42–3.05–0.74–0.07–0.96
BlowingBubbles–1.11–1.050.20–0.32–0.390.71–1.03–0.21–0.78
RaceHorses–1.12–1.040.81–0.25–1.010.40–0.382.080.16
EFourPeople–3.02–1.73–2.68–2.04–2.66–2.93–2.14–1.28–2.18
Johnny–1.88–3.26–2.15–1.23–0.35–2.26–1.06–1.87–0.71
KristenAndSara–2.22–4.38–2.66–1.640.44–2.44–1.50–1.16–1.95
AVERAGE–1.91–1.73–1.68–0.87–1.16–0.91–1.15–0.95–0.90
Table 11. Coding Performance in Terms of BD-BR on the Latest Platform of VTM 16.0 under AI, LDP, RA Configurations (Unit: %)

5 Conclusions

In this paper, a deep learning based intra mode derivation method is presented to skip the module of intra mode signaling for saving coding bits. Instead of checking the candidate intra modes one by one to achieve the optimal, this process is casted into a multi-class classification task from signal processing to artificial intelligence. To adapt to variable coding blocks and different QP settings with one single model, the architecture is effectively developed. In particular, the hand-crafted and learned features are combined to compensate their individual limitations. The rate-distortion optimization is performed between the proposed method and the traditional method with a strategy flag signaled for performance competition. Compared with the state-of-the-art works, the proposed method achieves significant coding gains.

References

[1]
Mohsen Abdoli, Thomas Guionnet, Mickael Raulet, Gosala Kulupana, and Saverio Blasi. 2020. Decoder-side intra mode derivation for next generation video coding. In 2020 IEEE International Conference on Multimedia and Expo (ICME). 1–6.
[2]
Mohsen Abdoli, Félix Henry, Patrice Brault, Pierre Duhamel, and Frédéric Dufaux. 2018. Short-distance intra prediction of screen content in versatile video coding (VVC). IEEE Signal Processing Letters 25, 11 (2018), 1690–1694.
[3]
Gisle Bjontegaard. 2001. Calculation of Average PSNR Differences between RD Curves. ITU-T Video Coding Experts Group, VCEG-M33.
[4]
Saverio G. Blasi, Marta Mrak, and Ebroul Izquierdo. 2015. Frequency-domain intra prediction analysis and processing for high-quality video coding. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 798–811.
[5]
Frank Bossen, Jill Boyce, Karsten Suehring, Xiang Li, and Vadim Seregin. 2019. JVET Common Test Conditions and Software Reference Configurations for SDR Video. Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, JVET-N1010-v1.
[6]
Fabian Brand, Jürgen Seiler, and André Kaup. 2021. Intra-frame coding using a conditional autoencoder. IEEE Journal of Selected Topics in Signal Processing 15, 2 (2021), 354–365.
[7]
Benjamin Bross, Jianle Chen, Jens-Rainer Ohm, Gary J. Sullivan, and Ye-Kui Wang. 2021. Developments in international video coding standardization after AVC, with an overview of versatile video coding (VVC). Proc. IEEE 109, 9 (2021), 1463–1493.
[8]
Xun Cai and Jae S. Lim. 2013. Algorithms for transform selection in multiple-transform video compression. IEEE Transactions on Image Processing 22, 12 (2013), 5395–5407.
[9]
Yao-Jen Chang, Hong-Jheng Jhu, Hui-Yu Jiang, Liang Zhao, Xin Zhao, Xiang Li, Shan Liu, Benjamin Bross, Paul Keydel, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. 2019. Multiple reference line coding for most probable modes in intra prediction. In 2019 Data Compression Conference (DCC). 559–559.
[10]
Haoming Chen, Tao Zhang, Ming-Ting Sun, Ankur Saxena, and Madhukar Budagavi. 2016. Improving intra prediction in high-efficiency video coding. IEEE Transactions on Image Processing 25, 8 (2016), 3671–3682.
[11]
Jie Chen, Junhui Hou, and Lap-Pui Chau. 2018. Light field compression with disparity-guided sparse coding based on structural key views. IEEE Transactions on Image Processing 27, 1 (2018), 314–324.
[12]
Thierry Dumas, Franck Galpin, and Philippe Bordes. 2021. Iterative training of neural networks for intra prediction. IEEE Transactions on Image Processing 30 (2021), 697–711.
[13]
Thierry Dumas, Aline Roumy, and Christine Guillemot. 2020. Context-adaptive neural network-based prediction for image compression. IEEE Transactions on Image Processing 29 (2020), 679–693.
[14]
Edouard François, Chad Fogg, Yuwen He, Xiang Li, Ajay Luthra, and Andrew Segall. 2016. High dynamic range and wide color gamut video coding in HEVC: Status and potential future enhancements. IEEE Transactions on Circuits and Systems for Video Technology 26, 1 (2016), 63–75.
[15]
Han Gao, Xu Chen, Semih Esenlik, Jianle Chen, and Eckehard Steinbach. 2021. Decoder-side motion vector refinement in VVC: Algorithm and hardware implementation considerations. IEEE Transactions on Circuits and Systems for Video Technology 31, 8 (2021), 3197–3211.
[16]
Yueyu Hu, Wenhan Yang, Mading Li, and Jiaying Liu. 2019. Progressive spatial recurrent neural network for intra prediction. IEEE Transactions on Multimedia 21, 12 (2019), 3024–3037.
[17]
Yu-Wen Huang, Chih-Wei Hsu, Ching-Yeh Chen, Tzu-Der Chuang, Shih-Ta Hsiang, Chun-Chia Chen, Man-Shu Chiang, Chen-Yen Lai, Chia-Ming Tsai, Yu-Chi Su, Zhi-Yi Lin, Yu-Ling Hsiao, Olena Chubach, Yu-Cheng Lin, and Shaw-Min Lei. 2020. A VVC proposal with quaternary tree plus binary-ternary tree coding block structure and advanced coding techniques. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2020), 1311–1325.
[18]
Minqiang Jiang, Shanxi Li, Nam Ling, Jianhua Zheng, and Philipp Zhang. 2018. On derivation of most probable modes for intra prediction in video coding. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS). 1–4.
[19]
Jani Lainema, Frank Bossen, Woo-Jin Han, Junghye Min, and Kemal Ugur. 2012. Intra coding of the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 1792–1801.
[20]
Congrui Li, Zhenghui Zhao, Junru Li, Xiang Zhang, Siwei Ma, and Chen Li. 2019. Bi-intra prediction for versatile video coding. In 2019 Data Compression Conference (DCC). 587–587.
[21]
Jiahao Li, Bin Li, Jizheng Xu, and Ruiqin Xiong. 2018. Efficient multiple-line-based intra prediction for HEVC. IEEE Transactions on Circuits and Systems for Video Technology 28, 4 (2018), 947–957.
[22]
Junru Li, Meng Wang, Li Zhang, Kai Zhang, Hongbin Liu, Shiqi Wang, Siwei Ma, and Wen Gao. 2020. Unified intra mode coding based on short and long range correlations. IEEE Transactions on Image Processing 29 (2020), 7245–7260.
[23]
Yue Li, Yan Yi, Dong Liu, Li Li, Zhu Li, and Houqiang Li. 2021. Neural-network-based cross-channel intra prediction. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3, Article 77 (Jul. 2021), 23 pages.
[24]
Di Ma, Fan Zhang, and David Bull. 2022. BVI-DVC: A training database for deep video compression. IEEE Transactions on Multimedia 24 (2022), 3847–3858.
[25]
Siwei Ma, Xinfeng Zhang, Chuanmin Jia, Zhenghui Zhao, Shiqi Wang, and Shanshe Wang. 2020. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1683–1698.
[26]
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations (ICLR). 1–17.
[27]
Elie Gabriel Mora, Joel Jung, Marco Cagnazzo, and Béatrice Pesquet-Popescu. 2014. Depth video coding based on intra mode inheritance from texture. APSIPA Transactions on Signal and Information Processing 3 (2014), 1–13.
[28]
Karsten Müller, Heiko Schwarz, Detlev Marpe, Christian Bartnik, Sebastian Bosse, Heribert Brust, Tobias Hinz, Haricharan Lakshman, Philipp Merkle, Franz Hunn Rhee, Gerhard Tech, Martin Winken, and Thomas Wiegand. 2013. 3D high-efficiency video coding for multi-view video and depth data. IEEE Transactions on Image Processing 22, 9 (2013), 3366–3378.
[29]
Anthony Nasrallah, Mohsen Abdoli, Elie Gabriel Mora, Thomas Guionnet, and Mickael Raulet. 2019. Decoder-side intra mode derivation with texture analysis in VVC test model. In 2019 IEEE International Conference on Image Processing (ICIP). 3153–3157.
[30]
Anthony Nasrallah, Elie Mora, Thomas Guionnet, and Mickael Raulet. 2019. Decoder-side intra mode derivation based on a histogram of gradients in versatile video coding. In 2019 Data Compression Conference (DCC). 597–597.
[31]
Jonathan Pfaff, Alexey Filippov, Shan Liu, Xin Zhao, Jianle Chen, Santiago De-Luxán-Hernández, Thomas Wiegand, Vasily Rufitskiy, Adarsh Krishnan Ramasubramonian, and Geert Van der Auwera. 2021. Intra prediction and mode coding in VVC. IEEE Transactions on Circuits and Systems for Video Technology 31, 10 (2021), 3834–3847.
[32]
Kevin Reuze, Wassim Hamidouche, Pierrick Philippe, and Olivier Deforges. 2019. Dynamic lists for efficient coding of intra prediction modes in the future video coding standard. In 2019 Data Compression Conference (DCC). 601–601.
[33]
Sebastian Schwarz, Marius Preda, Vittorio Baroncini, Madhukar Budagavi, Pablo Cesar, Philip A. Chou, Robert A. Cohen, Maja Krivokuća, Sébastien Lasserre, Zhu Li, Joan Llach, Khaled Mammou, Rufael Mekuria, Ohji Nakagami, Ernestasia Siahaan, Ali Tabatabai, Alexis M. Tourapis, and Vladyslav Zakharchenko. 2019. Emerging MPEG standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 1 (2019), 133–148.
[34]
Michael Schäfer, Björn Stallenberger, Jonathan Pfaff, Philipp Helle, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. 2020. Efficient fixed-point implementation of matrix-based intra prediction. In 2020 IEEE International Conference on Image Processing (ICIP). 3364–3368.
[35]
Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22, 12 (2012), 1649–1668.
[36]
Heming Sun, Zhengxue Cheng, Masaru Takeuchi, and Jiro Katto. 2020. Enhanced intra prediction for video coding by using multiple neural networks. IEEE Transactions on Multimedia 22, 11 (2020), 2764–2779.
[37]
Radu Timofte and Eirikur Agustsson. 2017. NTIRE 2017 challenge on single image super-resolution: Methods and results. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 1110–1121.
[38]
Yang Wang, Xiaopeng Fan, Shaohui Liu, Debin Zhao, and Wen Gao. 2020. Multi-scale convolutional neural network-based intra prediction for video coding. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2020), 1803–1815.
[39]
Thomas Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra. 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 7 (2003), 560–576.
[40]
Xiaoyu Xiu, Yuwen He, and Yan Ye. 2016. Decoder-side intra mode derivation for block-based video coding. In 2016 Picture Coding Symposium (PCS). 1–5.
[41]
Xiaozhong Xu, Robert Cohen, Anthony Vetro, and Huifang Sun. 2012. Predictive coding of intra prediction modes for high efficiency video coding. In 2012 Picture Coding Symposium. 457–460.
[42]
Xiaozhong Xu, Shan Liu, Tzu-Der Chuang, Yu-Wen Huang, Shaw-Min Lei, Krishnakanth Rapaka, Chao Pang, Vadim Seregin, Ye-Kui Wang, and Marta Karczewicz. 2016. Intra block copy in HEVC screen content coding extensions. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6, 4 (2016), 409–419.
[43]
Yan Ye, Jill M. Boyce, and Philippe Hanhart. 2020. Omnidirectional 360° video coding technology in responses to the joint call for proposals on video compression with capability beyond HEVC. IEEE Transactions on Circuits and Systems for Video Technology 30, 5 (2020), 1241–1252.
[44]
Yong-Uk Yoon, Do-Hyeon Park, Jae-Gon Kim, Jinho Lee, and Jung-Won Kang. 2019. Most frequent mode for intra-mode coding in video coding. Electronics Letters 55, 4 (2019), 188–190.
[45]
Kai Zhang, Jianle Chen, Li Zhang, Xiang Li, and Marta Karczewicz. 2018. Enhanced cross-component linear model for chroma intra-prediction in video coding. IEEE Transactions on Image Processing 27, 8 (2018), 3983–3997.
[46]
Li Zhang, Kai Zhang, Hongbin Liu, Hsiao Chiang Chuang, Yue Wang, Jizheng Xu, Pengwei Zhao, and Dingkun Hong. 2019. History-based motion vector prediction in versatile video coding. In 2019 Data Compression Conference (DCC). 43–52.
[47]
Tao Zhang, Xiaopeng Fan, Debin Zhao, Ruiqin Xiong, and Wen Gao. 2018. Hybrid intraprediction based on local and nonlocal correlations. IEEE Transactions on Multimedia 20, 7 (2018), 1622–1635.
[48]
Yun Zhang, Sam Kwong, and Shiqi Wang. 2020. Machine learning based video coding optimizations: A survey. Information Sciences 506 (2020), 395–423.
[49]
Amin Zheng, Yuan Yuan, Jiantao Zhou, Yuanfang Guo, Haitao Yang, and Oscar C. Au. 2016. Adaptive block coding order for intra prediction in HEVC. IEEE Transactions on Circuits and Systems for Video Technology 26, 11 (2016), 2152–2158.
[50]
Linwei Zhu, Sam Kwong, Yun Zhang, Shiqi Wang, and Xu Wang. 2020. Generative adversarial network-based intra prediction for video coding. IEEE Transactions on Multimedia 22, 1 (2020), 45–58.
[51]
Linwei Zhu, Yun Zhang, Na Li, Jinyong Pi, and Xinju Wu. 2020. Sparse representation-based intra prediction for lossless/near lossless video coding. In 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). 164–167.
[52]
Linwei Zhu, Yun Zhang, Shiqi Wang, Sam Kwong, Xin Jin, and Yu Qiao. 2021. Deep learning-based chroma prediction for intra versatile video coding. IEEE Transactions on Circuits and Systems for Video Technology 31, 8 (2021), 3168–3181.

Cited By

View all
  • (2024)Compressed Point Cloud Quality Index by Combining Global Appearance and Local DetailsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672567Online publication date: 15-Jun-2024
  • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
  • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
  • Show More Cited By

Index Terms

  1. Deep Learning-Based Intra Mode Derivation for Versatile Video Coding

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
    April 2023
    545 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3572861
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2023
    Online AM: 16 September 2022
    Accepted: 01 September 2022
    Revised: 15 July 2022
    Received: 04 April 2022
    Published in TOMM Volume 19, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Versatile video coding
    2. intra mode derivation
    3. most probable mode
    4. deep learning
    5. multi-class classification

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Shenzhen Science and Technology Program
    • Guangdong Basic and Applied Basic Research Foundation
    • Membership of Youth Innovation Promotion Association, Chinese Academy of Sciences (CAS)
    • China Postdoctoral Science Foundation
    • CAS Present’s International Fellowship Initiative (PIFI)
    • Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA)
    • Hong Kong GRF-RGC General Research Fund

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)903
    • Downloads (Last 6 weeks)106
    Reflects downloads up to 21 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Compressed Point Cloud Quality Index by Combining Global Appearance and Local DetailsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672567Online publication date: 15-Jun-2024
    • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
    • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
    • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
    • (2024)SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual AttentionIEEE Transactions on Multimedia10.1109/TMM.2023.330659626(3061-3076)Online publication date: 1-Jan-2024
    • (2024)Neural Network Based Multi-Level In-Loop Filtering for Versatile Video CodingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342043534:11(12092-12096)Online publication date: Nov-2024
    • (2024)Deep video compression based on Long-range Temporal Context LearningComputer Vision and Image Understanding10.1016/j.cviu.2024.104127248:COnline publication date: 1-Nov-2024
    • (2023)Adaptive Video Coding Framework with Spatial-Temporal Fusion for Optimized Streaming in Next-Generation NetworksInternational Journal of Electrical and Electronics Research10.37391/ijeer.11ngwcn0411:NGWCN(20-24)Online publication date: 30-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media