research-article

Open access

Deep Learning-Based Intra Mode Derivation for Versatile Video Coding

Authors:

Linwei Zhu,

Yun Zhang,

Na Li,

Gangyi Jiang,

Sam KwongAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 2s

Article No.: 96, Pages 1 - 20

https://doi.org/10.1145/3563699

Published: 17 February 2023 Publication History

All formats PDF

Abstract

In intra coding, Rate Distortion Optimization (RDO) is performed to achieve the optimal intra mode from a pre-defined candidate list. The optimal intra mode is also required to be encoded and transmitted to the decoder side besides the residual signal, where lots of coding bits are consumed. To further improve the performance of intra coding in Versatile Video Coding (VVC), an intelligent intra mode derivation method is proposed in this paper, termed as Deep Learning based Intra Mode Derivation (DLIMD). In specific, the process of intra mode derivation is formulated as a multi-class classification task, which aims to skip the module of intra mode signaling for coding bits reduction. The architecture of DLIMD is developed to adapt to different quantization parameter settings and variable coding blocks including non-square ones, where only one single trained model is required. Different from the existing deep learning based classification problems, the hand-crafted features are also fed into intra mode derivation network besides the learned features from feature learning network. To compete with traditional methods, one additional binary flag is utilized in the video codec to indicate the selected scheme with RDO. Extensive experimental results reveal that the proposed method can achieve 2.28%, 1.74%, and 2.18% bit rate reduction on average for Y, U, and V components on the platform of VVC test model, which outperforms the state-of-the-art works.

1 Introduction

With the rapid development of information technology, videos have been applied to the fields of entertainment, surveillance, education, and so on. To adapt to more applications in our daily life, the videos have evolved in various dimensions in the last decade, including High Definition (HD), Wide Color Gamut (WCG) [14], High Dynamic Range (HDR) [14], Multi-view Video plus Depth (MVD) [28], 360 degree video [43], light field image/video [11], and dynamic point cloud [33]. Unfortunately, from low dimension to high dimension, the dramatically increased video data challenges the limited storage space and transmission bandwidth. From H.264/Advanced Video Coding (AVC) [39], High Efficiency Video Coding (HEVC) [35], to the state-of-the-art Versatile Video Coding (VVC) [7] that was issued in 2020, although a large compression ratio has been achieved, it still cannot catch up with the increase of video data. Advanced video compression algorithm is always desired to maximize the visual quality at a given bandwidth budget.

In the framework of existing hybrid video coding, the modules mainly consist of intra/inter prediction, transform, quantization, entropy encoding and in-loop filtering. To improve compression efficiency, a variety of novel coding tools have been developed in the issued standards, including QuadTree plus Multi-Type Tree (QT+MTT) structure [17] for coding block partition, Matrix-based Intra Prediction (MIP) [34] and Cross-Component Linear Model (CCLM) [45] for intra luma and chroma prediction, History-based Motion Vector Prediction (HMVP) [46] and Decoder-side MV Refinement (DMVR) [15] for motion estimation/compensation, Multiple Transform Selection (MTS) [8] for transform, CABAC engine with multi-hypothesis probability estimation for entropy encoding, and Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF) for in-loop filtering. These mentioned coding tools have achieved significant coding gains.

One of the most important modules is intra prediction [19], which aims to remove spatial redundancy as much as possible. Parts of the available neighboring blocks are weighted to produce the predicted block. Traditionally, intra modes include Planar, DC, and angular modes. To achieve more accurate prediction result, various algorithms have been developed. In [4], intra prediction was analyzed in frequency domain, and the frequency components were selectively discarded to improve the performance. Li et al. [20] presented a bi-intra prediction method based on the binary combination of existing uni-intra prediction modes. Rather than regular out-block reference pixels, the in-block ones were employed in [2] to perform intra prediction for screen content, and an additional in-loop residual signal was used. An iterative filtering method was employed for intra prediction in addition to the traditional intra prediction in [10]. To achieve more reference pixels, the multi-line based scheme was presented in [21], where six more lines of pixels located at the above and left neighbors were collected. Different from fixed scan order, an adaptive block coding order [49] was proposed for intra prediction to better exploit spatial correlations. In analogous to motion estimation in inter coding, Intra Block Copy (IBC) [42] was introduced for screen content, which aims to exploit long distance correlations in an image. Two modes with high probability from gradient histogram were combined to generate a new intra mode in [1]. In [47], the local and nonlocal correlations were exploited for hybrid intra prediction, where the adaptive template matching prediction, combined local and nonlocal prediction, combined neighboring modes prediction were performed. These methods mentioned above exploit spatial redundancy from neighbors with manually designed functions, which may limit the performance. Advanced schemes are desired to adapt to diverse video contents.

To further improve compression efficiency of intra coding, the problem of signal processing is formulated as an artificial intelligence task, where powerful neural network is adopted [25, 48] and a training database for deep video compression is provided in [24]. In specific, the problem of intra luma prediction was formulated as an inpainting task [50], and the problem of intra chroma prediction was modeled as a colorization task [23, 52]. An iterative training strategy for neural network was presented in [12], where training blocks were collected from previous iteration to further improve performance. Wang et al. [38] proposed a multi-scale convolutional network based intra prediction approach, in which the neighboring reconstructed L-shape was fed to the network as well as the traditional angular intra prediction result to make more accurate prediction. With conditional autoencoder [6], multi-mode intra prediction was performed for luma and chroma components. Sun et al. [36] proposed two enhanced intra prediction schemes with multiple neural networks, where the appending scheme was to replace the traditional modes and the substitution scheme was to replace the highest and lowest probable traditional modes. In [16], a progressive spatial recurrent neural network was presented for intra prediction, which was able to produce prediction by passing information along from previous output. To adapt to variable coding blocks in intra prediction, fully connected and convolutional neural networks were carefully designed [13] for small and large blocks, respectively. Most of these existing learning based methods aim to make more accurate luma and chroma predictions from a regression perspective to achieve coding gains, while the module of intra mode derivation has not been exploited from a classification perspective with deep learning tools.

In intra coding, the intra mode is also required to be encoded and transmitted to the decoder side besides residual signal. For intra mode signaling, Most Probable Mode (MPM) list, which is constructed from the neighboring blocks, plays an important role and saves significant coding bits. In [18], two MPM construction methods were presented for VVC, where one was extended from HEVC, and the other was sorted according to the probability of each candidate. Besides the nearest neighboring lines, Chang et al. [9] extended MPM mechanism to Multi-Reference Line (MRL) scheme for better performance. A conditional random field model was established to re-construct the MPM list in [22], where the short and long range correlations were considered. In addition, decision tree was utilized to exploit multiple dynamic lists of intra mode signaling [32]. By investigating the occurrences of intra modes in the neighboring blocks, Most Frequent Mode (MFM) list [44] was derived to compete with the existing MPM list. To skip intra mode signaling and save coding bits, Xu et al. [41] proposed a predictive coding scheme, in which the angular correlation in spatial domain was calculated with modulo-N arithmetic operations. Additionally, template based [40], histogram of gradients based [30], and texture analysis based [29] intra mode derivation methods were presented in a manual manner. For depth video coding, a coding tool [27] was presented to reduce intra mode signaling bitrate, in which the texture intra modes were inherited for the depth intra modes. Basically, the MPM list construction and intra mode derivation have been investigated by traditional statistics and experience, which can be further improved with advanced learning based schemes.

In this work, to skip the module of intra mode signaling and save coding bits, the process of intra mode derivation is formulated as a multi-class classification task. The main contributions of this work are listed as follows.

(1)

The process of intra mode derivation in intra coding is modeled as a multi-class classification task, termed as Deep Learning based Intra Mode Derivation (DLIMD), which is used to skip the module of intra mode signaling for saving bits.

(2)

In DLIMD, the learned features and hand-crafted features are combined together for intra mode derivation. Additionally, the proposed DLIMD can be applied to variable coding blocks (including non-square blocks) and any different Quantization Parameter (QP) settings.

(3)

To further improve the performance, one additional binary flag is utilized to indicate the finally selected scheme from Rate Distortion (RD) cost competition. The proposed method achieves superior performance when compared with the state-of-the-art algorithms.

The remainder of this work is organized as follows. Motivation is presented in Section 2. The proposed DLIMD for video coding is discussed in detail in Section 3. The experiments are conducted and the results are analyzed in Section 4. Section 5 concludes this work.

2 Motivation

In VVC, intra coding modes/tools [31] include DC, Planar, 65 angular modes, Wide Angle Intra Prediction (WAIP), MRL, Position Dependent Prediction Combination (PDPC), MIP, Intra-Sub Partition (ISP), and CCLM. It should be mentioned that the intra mode is also required to be encoded and transmitted to the decoder side. To effectively signal these intra modes to the decoder side, the derivation is performed with intra modes from neighbors, where six of them are produced and accommodated to the MPM list. Generally, the first one in the MPM list is always fixed, i.e., Planar mode, which is encoded with two-bit length. The other five MPMs are achieved according to spatial correlation from the neighbors, and encoded with three-bit to six-bit length. The non-MPM modes are divided into two parts which contain 3 and 58 modes, respectively. Truncated binary encoding is performed for them with six-bit and seven-bit length. The detailed intra modes signaling can be found in Figure 1. In addition, statistical experiments are conducted under the platform of VVC Test Model version 5.0 (VTM 5.0) to present coding Bits Per intra Mode (BPM), where ten sequences with various contents from different classes are encoded under All Intra (AI) configuration. The value of BPM is calculated by the total coding bits of intra mode against the number of intra blocks, where the coding bits are collected after CABAC entropy encoding. The statistical results are shown in the left columns of Table 1 and the values of BPM are 3.35, 3.48, 3.44, and 3.39 on average under four QP settings.

Fig. 1.

Table 1.

Class	Sequence	Coding bits per intra mode (BPM) \(\alpha\)				Percentage of coding bits of intra mode \(\beta\)
		QP = 22	QP = 27	QP = 32	QP = 37	QP = 22	QP = 27	QP = 32	QP = 37
		A	Tango2	2.18	2.90	2.92	2.97	5.39%	10.9%	14.7%	18.3%
FoodMarket4	2.45	A	2.61	2.63	2.71	7.14%	10.4%	12.9%	14.9%
B	BasketballDrive	2.94	2.92	2.93	2.85	6.60%	11.5%	15.6%	18.3%
B	BQTerrace	3.40	3.57	3.53	3.44	5.27%	9.46%	14.1%	18.9%
C	BQMall	3.84	3.86	3.75	3.65	8.95%	12.2%	16.3%	20.8%
C	BasketballDrill	3.52	3.55	3.58	3.76	14.6%	18.4%	21.0%	25.1%
D	BlowingBubbles	4.34	4.25	4.17	3.82	7.88%	11.5%	16.3%	20.6%
D	BasketballPass	3.85	4.04	3.93	3.67	8.82%	11.8%	16.4%	22.1%
E	FourPeople	3.59	3.62	3.57	3.53	9.80%	13.8%	17.7%	21.8%
E	Johnny	3.37	3.43	3.43	3.50	8.39%	13.1%	18.7%	23.2%
AVERAGE		3.35	3.48	3.44	3.39	8.28%	12.3%	16.4%	20.4%

Table 1. Statistical Results of Intra Mode Signaling

Furthermore, to demonstrate how many bits are spent in the module of intra mode signaling, the percentage of coding bits of intra mode in a frame is collected, and illustrated in the right columns of Table 1. It can be found that this percentage increases from 8.28% to 20.4% on average as QP value increases. In the case of small QP settings, the percentage is limited, because the coding bits of residue (the difference between prediction and source) are much larger than those of intra mode, while in the case of large QP settings, the coding bits of residue become limited, which results in a high percentage of coding bits of intra mode. From these results, we can conclude that if more advanced intra mode signaling approach is presented, the coding performance can be further improved.

3 Proposed Deep Learning Based Intra Mode Derivation for Video Coding

3.1 Problem Formulation and Framework

In this work, we focus on the optimization of DC, Planar, and 65 angular modes signaling for the luma component, while the chroma component is ignored. According to Figure 1, the straightforward idea of improving coding performance is to predict the best intra mode from all 67 candidates and place it to the first in MPM list. This intra mode derivation can achieve promising performance, because the intra mode signaling only consumes two bits, which is less than other cases. However, it still can be improved by skipping the RD checking process and intra mode signaling to save coding bits.

The optimal intra mode of current block is finally selected based on the minimum RD cost by checking the candidate list. This process can be represented by the following equation,

\begin{equation} n^* = \mathop {\arg \min }_{n}\big \lbrace D_n + \lambda \big (R_n^{r} + R_n^{m} + R_n^{o}\big)\big \rbrace , \end{equation}

(1)

where \(n\) indicates the index of intra mode, \(n \in [0, 66]\) for Planar, DC, and 65 angular modes, \(D_n\) is the distortion, \(\lambda\) is the Lagrange Multiplier, \(R_n^r\), \(R_n^m\), and \(R_n^o\) indicate the coding bits of residue, intra mode, and other information, respectively.

According to Equation (1), to achieve the optimal one from a pre-defined candidate list, this process can be formulated as a multi-class classification task. Generally, the construction of MPM list can be regarded as a manual classification scheme, and top-6 intra modes are manually selected. To further improve the performance, we aim to solve this multi-class classification task with a deep learning approach. In specific, the optimal intra mode can be derived directly instead of checking candidate list, and the module of intra mode signaling is expected to be skipped for coding bits reduction. Figure 2 illustrates the framework of proposed deep learning based intra mode derivation for video coding. T and Q indicate transform and quantization, T\(^{-1}\) and Q\(^{-1}\) indicate inverse transform and inverse quantization.

Fig. 2.

In the video encoder, intra mode derivation is performed, including the conventional intra mode checking from intra mode list and the proposed DLIMD. According to RD cost, only one of them will be finally selected. If DLIMD is selected, the strategy flag is set as 1 and the switch is opened for skipping the module of intra mode encoding; otherwise, the strategy flag is set as 0 and the switch is closed for activating the module of intra mode encoding. The strategy flag is always encoded and transmitted to indicate the selected scheme. It is worth mentioning that the other modules in video codec are not changed. The RD cost competition between DLIMD and traditional method (including DC, Planar, angular modes, MRL, MIP and ISP) can be represented by the following equation.

\begin{equation} S^* = \mathop {\arg \min }_{S}\big \lbrace D_S + \lambda \big (R_S^{r} + R_S^{m} + R_S^{f} + R_S^{o}\big)\big \rbrace , \end{equation}

(2)

where \(S^*\) indicates the selected scheme, i.e., \(S\in\) {DLIMD, traditional method}, \(D_S\) is the distortion under \(S\) scheme, \(\lambda\) is the Lagrange Multiplier, \(R_S^r\), \(R_S^m\), \(R_S^f\), and \(R_S^o\) indicate the coding bits of residue, intra mode, strategy flag, and other information, respectively. In addition, it should be noted that if the proposed DLIMD is selected, there are no coding bits for intra mode, i.e., \(R_S^m = 0\).

In the video decoder, the strategy flag is firstly decoded before intra prediction. If this strategy flag is 0, the intra mode will be decoded directly; otherwise, the intra mode will be derived by the proposed DLIMD. With the intra mode, intra prediction is performed accordingly. Finally, the prediction result plus decoded residual information produces the reconstruction.

To estimate the upper bound of performance under the proposed framework, we define that \(\alpha\) and \(\beta\) are the original value of BPM and the percentage of coding bits of intra mode in a frame, the statistical values of them are illustrated in Table 1, \(\gamma\) is the percentage of selected intra blocks under the proposed scheme. One additional binary flag is utilized for indication between the proposed scheme and the original scheme, which is encoded by context mode. Then, the value of BPM becomes \(- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})\). Accordingly, the bit saving can be calculated as follows,

\begin{equation} \eta = \frac{\alpha - [- \gamma \times log_2(\gamma) + (1 - \gamma)\times {(\alpha - log_2(1 - \gamma)})]}{\alpha } \times \beta . \end{equation}

(3)

The condition of upper bound is \(\gamma = 100\%\), then the bit saving equals to \(\beta\). As such, the upper bound of bit saving can reach 8.28%, 12.3%, 16.4%, and 20.4% when QP value equals to {22, 27, 32, 37}, respectively.

Besides theoretical analysis, an experiment has been conducted to present the upper bound of bit saving under the proposed framework. At the encoding stage, the optimal intra mode of current block is achieved after rate distortion cost comparison among all the candidates. Then, this optimal intra mode is regarded as the predicted one to skip intra mode signaling for bits reduction. The sequences are encoded with small QPs {11, 16, 21, 26}, normal QPs {22, 27, 32, 37}, large QPs {33, 38, 43, 48}, and default AI configuration. The coding performance is measured by Bj\({\phi }\)ntegaard Delta Bit Rate (BD-BR) [3] with respect to the original VTM. From Table 2, it can be found that the upper bound of bit saving can reach 5.68%, 12.3%, and 19.5% on average for the luma component under small, normal, and large QP settings, which are close to those from theoretical analysis. However, this is the ideal case that cannot be achieved because the intra mode cannot be accurately predicted.

Table 2.

Class	Sequence	Small QPs {11, 16, 21, 26}			Normal QPs {22, 27, 32, 37}			Large QPs {33, 38, 43, 48}
Class	Sequence	Y	U	V	Y	U	V	Y	U	V
A	Tango2	–5.80	–4.49	–4.05	–12.3	–11.9	–11.6	–18.3	–17.8	–17.8
A	FoodMarket4	–4.12	–5.19	–4.76	–10.8	–8.99	–8.73	–14.2	–13.4	–14.9
B	BasketballDrive	–3.00	–2.31	–3.45	–10.4	–9.63	–10.7	–18.2	–19.0	–21.1
B	BQTerrace	–3.49	–2.84	–2.96	–8.68	–8.62	–8.70	–16.8	–16.8	–20.6
C	BasketballDrill	–8.05	–8.48	–8.60	–15.3	–13.8	–15.6	–22.2	–21.0	–22.6
C	BQMall	–6.00	–5.63	–5.80	–11.3	–9.59	–10.7	–19.1	–20.2	–17.3
D	BasketballPass	–7.32	–6.60	–6.78	–11.5	–13.0	–14.7	–20.5	–17.9	–16.9
D	BlowingBubbles	–6.53	–6.10	–6.56	–12.7	–10.8	–13.4	–20.7	–19.0	–15.9
E	FourPeople	–7.14	–6.89	–7.33	–15.3	–14.5	–13.8	–21.8	–21.0	–20.2
E	Johnny	–5.32	–5.26	–6.27	–14.5	–14.9	–15.9	–22.8	–23.1	–22.2
AVERAGE		–5.68	–5.38	–5.66	–12.3	–11.6	–12.4	–19.5	–18.9	–19.0

Table 2. Upper Bound of Bit Saving under the Proposed Framework in Terms of BD-BR (Unit: %)

3.2 Deep Learning based Intra Mode Derivation

Figure 3 illustrates the proposed architecture of deep learning based intra mode derivation scheme, in which two neural networks are included, one is feature learning network and the other is intra mode derivation network. The former is used to extract the highly dimensional features and the latter aims to infer the optimal intra mode directly without RD cost checking. In particular, the hand-crafted and learned features are combined to enjoy their individual benefits. The detailed hyper parameters of these two networks are listed in Tables 3 and 4.

Fig. 3.

Table 3.

#	Type	Kernel	Stride	Outputs	Activation
1a	CNN	\(1\times 1\)	1	64	ReLU
1b		\(3\times 3\)
2a		\(1\times 1\)
2b		\(3\times 3\)
3a		\(1\times 1\)
3b		\(3\times 3\)
4		\(3\times 3\)
5		\(3\times 3\)

Table 3. Hyper-parameters of the Feature Learning Network

Table 4.

#	Type	Input Size	Nodes	Activation
1	FCN	33792 + 73	2048	ReLU
2		2048 + 73
3
4
5			67	SoftMax

Table 4. Hyper-parameters of the Intra Mode Derivation Network

In the feature learning network, five convolutional layers are included, and the first three ones (each has two sub-layers) are placed in a parallel manner. The kernel sizes are \(1\times 1\) and \(3\times 3\) in convolutional layers. Rectified Linear Unit (ReLU) is employed as activation function. The number of feature maps in each convolutional layer is 64. In the intra mode derivation network, five fully connected layers are included, and the node of each layer except the last one is 2048. In the last fully connected layer, the activation function is SoftMax, and the number of nodes becomes 67, which aims to match the number of intra modes. Hand-crafted features are always included in the input of each fully connected layer, which is represented by,

\begin{equation} {\bf I}_i^f = {\rm {concat}}\big ({\bf O}^f_{i-1}, {\bf f}_0\big), i \in [1,5], \end{equation}

(4)

where \({\bf O}^f_{i-1}\) is the output of the \((i-1)^{th}\) fully connected layer, \({\bf f}_0\) indicates the hand-crafted features, \({\bf O}^f_0\) is the reshaped vector of output of feature learning network. To avoid overfitting, the dropout is performed at each fully connected layer. The dropout rate is set as 0.5 at the training stage, and set as 1.0 at the testing stage.

With the neighboring blocks and reference pixels, 73 hand-crafted features are collected, including 67 features from gradient histogram, five features from intra mode of neighboring blocks, and QP value. The gradient histogram can be regarded as the probability of each candidate intra mode, and its detailed calculation can be found in [29]. Due to the highly spatial correlation, the Up-Left (UL), Up (U), Up-Right (UR), Left (L) and Bottom Left (BL) blocks are used to provide their final selected intra modes as five hand-crafted features, as shown in the module of hand-crafted feature collection in Figure 3. The QP value balances the reconstruction quality and coding bits, i.e., lower QP value indicates better reconstruction quality and more coding bits, and vice versa, which has an impact on the intra mode derivation. In addition, with QP value as the feature, it is unnecessary to train different networks for different QP settings. In the case of frame boundary, the intra modes of neighbors are initially set as Planar mode because they are unavailable.

Due to lossy video coding, the neighboring blocks and reference pixels are degraded, which may affect the hand-crafted features, especially the gradient histogram. Therefore, the learned features are employed. As mentioned before, the coding block is not fixed, which can be flexibly partitioned from \(128\ \times \ 128\) to \(4\ \times \ 4\), including non-square patterns. Accordingly, the number of reference pixels is different for different coding blocks. For luma coding block, it follows that the width and height belong to {4, 8, 16, 32, 64} [31]. Therefore, it seems that 25 networks are required, which challenges computational and storage resources. It is expected that one single trained network can be applied to variable coding blocks. In the designed architecture, the multi-line reference pixels are collected and padded to a fixed size to adapt to variable coding blocks, where the fixed memory is allocated under the maximum available coding block \(64 \times 64\). Suppose the current block has a dimension of \(4\times 4\), the padding is performed to the lines that exceed the current block size of \(4\times 4\), where 60 padded lines from left and 60 padded lines from top are always used. A matrix with size of \((64 + 4 + 64)\times 4\) is fed to the feature learning network. Finally, the learned features with size of \((64 + 4 + 64)\times 4\times 64 = 33792\) are produced and reshaped to one dimension of vector.

In addition, the number of FLoating-point OPerations (FLOPs) [26] is used to evaluate the complexity of neural network. For convolutional layer,

\begin{equation} FLOPs = 2H \times W \times (C_{in}\times K^2 + 1) \times C_{out}, \end{equation}

(5)

where \(H\), \(W\) and \(C_{in}\) are height, width and number of channels of the input feature map, \(K\) is the kernel size, and \(C_{out}\) is the number of output channels. The values of FLOPs in convolutional layers are \(1.3\times 10^5\), \(6.7\times 10^5\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(8.7\times 10^6\), \(7.7\times 10^7\), \(7.7\times 10^7\), and \(3.9\times 10^7\), respectively. For fully connected layer,

\begin{equation} FLOPs = (2I - 1)\times O, \end{equation}

(6)

where \(I\) is the input dimensionality and \(O\) is the output dimensionality. The values of FLOPs in fully connected layers are \(1.3\times 10^8\), \(8.6\times 10^6\), \(8.6\times 10^6\), \(8.6\times 10^6\), and \(2.8\times 10^5\), respectively.

3.3 Neural Network Training

The DIV2K database [37] with 900 images (the resolution changes from \(2040\times 648\) to \(2040\times 2040\)) is used to generate the neural network training dataset. These images are all resized to \(2048\ \times \ 1536\) and packed as a sequence from RGB color space to YCbCr color space. Then, this pseudo sequence is encoded by VTM 5.0 with default AI configuration to collect the training samples, where the QP values are set as {22, 27, 32, 37}. During the process of video coding, the hand-crafted features and source reference pixels of current block are collected with the associated label (intra mode) regardless of the coding block size. According to Figure 3 shown in [51], the distribution of intra modes is uneven. The Planar, DC, horizontal, and vertical modes are more frequently selected than other intra modes. Additionally, the unbalanced data may make the multi-class classification network training failure. Thereby, the number of training samples for each label and QP is fixed as 50,000, and then the total number is \(50000 \times 4 \times 67\). In total, the volume of training data reaches about 80 GB. In addition, for the purpose of validation, \(20000\times 67\) samples are selected in the training dataset, where the number of validation samples is 20,000 for each label.

In this work, the Tensorflow package is adopted for network training on NVIDA GeForce 1080 Ti GPU with AdamOptimizer. The memory of workstation is 112G, which is able to accommodate the training samples. For this multi-class classification task, the cross entropy is utilized as the loss function,

\begin{equation} L = -\frac{1}{N}\sum ^N_{i=1} \sum ^M_{j=1} \big \lbrace y_{j}^{i} \times ln \big (x_{j}^{i}\big)\big \rbrace , \end{equation}

(7)

where \(N\) is the number of training samples in a batch, \(M\) is the number of classes, i.e., \(M=67\), \(y_{j}^{i}\) is the ground truth of the \(i^{th}\) training sample, and \(x_{j}^{i}\) is the output of intra mode derivation network after softmax layer. It should be noted that the ground truth is represented in the one-hot manner. For example, if the intra mode is 4 for the \(i^{th}\) training sample, \(y^i_4 = 1\) and \(y^i_j = 0\) \((j \ne 4)\), it also can be rewritten as \({\bf y}^i = [y^i_1, y^i_2, \dots , y^i_j, \dots , y^i_{67}] = [0, 0, 0, 1, 0, \dots , 0]\). The batch size and number of epochs are set as 1024 and 1000. The initial learning rate \(r_0\) is \(1\times 10^{-4}\), and it is always updated after each epoch, i.e., \(r_0 \times 0.999^i\), where \(i\) is the index of training epoch.

4 Experimental Results and Analyses

4.1 Coding Performance Comparison

The experiments are conducted on the platform VTM 5.0 following the default AI configuration and the Common Test Conditions (CTC) [5]. The workstation is equipped with the Intel Core i7-4790 CPU @2.60 GHz, Windows 7 Enterprise 64-bit operating systems for video coding. The original VTM 5.0 is regarded as the anchor for coding performance comparison, which is evaluated by BD-BR. Twenty-two sequences with various contents and resolutions, different from the training dataset, are utilized in the experiments.

Table 5 illustrates the values of BPM under the proposed method. Two sequences of each class are utilized for this experiment, which are identical to those in Table 1. During intra coding, the total bits of intra modes and the number of intra blocks are collected for BPM calculation when the QP values are set as {22, 27, 32, 37}. Compared with those shown in Table 1, the average values of BPM are changed from 3.35, 3.48, 3.44, and 3.39 to 2.38, 2.35, 2.22, and 2.20 under four QP settings, respectively. In general, the coding bit saving of intra mode can be calculated as follows regardless of residue and other information,

\begin{equation} \eta ^{\prime } = \frac{\alpha - {\alpha }^{\prime }}{\alpha } \times 100 \%, \end{equation}

(8)

where \(\alpha\) is the original value of BPM and \(\alpha ^{\prime }\) is the current value of BPM. Accordingly, the bit saving of intra mode can reach 30.4%, 33.2%, 36.2%, and 35.3% on average under four QP settings, respectively.

Table 5.

Class	Sequence	Current value of BPM \(\alpha ^{\prime }\)				Coding bit saving of intra mode \(\eta ^{\prime }\)
		QP = 22	QP = 27	QP = 32	QP = 37	QP = 22	QP = 27	QP = 32	QP = 37
		A	Tango2	1.02	1.46	1.44	1.71	53.2%	49.7%	50.7%	42.4%
FoodMarket4	1.33	A	1.50	1.53	1.82	45.7%	42.5%	41.8%	32.8%
B	BasketballDrive	1.99	1.83	1.65	1.69	32.3%	37.3%	43.7%	40.7%
B	BQTerrace	2.76	2.53	2.36	2.25	18.8%	29.1%	33.1%	34.6%
C	BQMall	2.84	2.78	2.59	2.40	26.0%	28.0%	30.9%	34.2%
C	BasketballDrill	2.73	2.60	2.49	2.58	22.4%	26.8%	30.4%	31.4%
D	BlowingBubbles	3.09	2.92	2.71	2.45	28.8%	31.3%	35.0%	35.9%
D	BasketballPass	3.16	3.02	2.78	2.57	17.9%	25.2%	29.3%	30.0%
E	FourPeople	2.51	2.40	2.31	2.23	30.1%	33.7%	35.3%	36.8%
E	Johnny	2.41	2.45	2.34	2.32	28.5%	28.6%	31.8%	33.7%
AVERAGE		2.38	2.35	2.22	2.20	30.4%	33.2%	36.2%	35.3%

Table 5. Coding Bits Per Intra Mode under the Proposed Method

Three state-of-the-art works are adopted for coding performance comparison. Narsallah’s scheme [29] derives the intra mode with gradient histogram and the one with the highest probability is determined eventually. Abdoli’s scheme [1] produces a new intra prediction result with weighted intra modes from the top-2 highest probability in gradient histogram. Li’s scheme [22] re-constructs the MPM list with short and long range correlations. These three works are optimized from different directions, and related to the proposed method, which can be compared in terms of coding performance. The comparison is illustrated in Table 6.

Table 6.

Class	Sequence	Narsallah’s [29]			Abdoli’s [1]			Li’s [22]			Proposed
Class	Sequence	Y	U	V	Y	U	V	Y	U	V	Y	U	V
A1	Tango2	–0.20	–0.41	0.01	–0.69	–0.91	–0.16	–0.08	–0.04	0.13	–2.50	–2.77	–2.20
	FoodMarket4	–0.36	0.06	–0.44	–0.81	–0.33	–0.68	–0.02	–0.14	0.00	–2.65	–1.66	–1.14
	Campfire	–0.18	–0.16	–0.34	–0.42	–0.20	–0.12	–0.08	0.00	–0.19	–2.38	–1.14	–1.98
A2	CatRobot1	–0.13	–0.29	–0.33	–0.33	–0.08	–0.36	–0.07	–0.08	–0.01	–2.69	–1.90	–2.44
	DaylightRoad2	0.02	–0.29	0.01	–0.29	–0.32	–0.10	–0.26	–0.13	–0.15	–2.70	–2.66	–2.63
	ParkRunning3	–0.04	–0.02	–0.11	–0.27	–0.25	–0.29	–0.05	–0.07	–0.08	–1.04	–0.85	–0.88
B	MarketPlace	–0.12	–0.08	–0.09	–0.35	0.00	–0.43	–0.10	–0.23	–0.20	–2.29	–1.51	–1.61
	RitualDance	–0.46	–0.41	–0.33	–0.65	–0.32	–0.31	–0.02	–0.23	–0.20	–1.81	–1.67	–1.55
	Cactus	–0.06	0.09	0.02	–0.35	0.03	–0.21	–0.07	–0.18	–0.06	–2.49	–1.17	–3.41
	BasketballDrive	–0.20	–0.67	–0.16	–0.67	–0.82	–0.25	–0.10	–0.61	–0.35	–2.42	–2.68	–2.04
	BQTerrace	0.00	–0.28	–0.11	–0.27	–0.37	–0.26	–0.15	–0.31	–0.41	–1.88	–1.77	–2.40
C	BasketballDrill	0.15	0.04	0.80	–0.27	0.14	0.89	–0.31	–0.01	–0.69	–2.29	1.18	–2.62
	BQMall	–0.37	–0.50	0.14	–0.52	–0.27	–0.37	–0.01	0.40	–0.01	–2.89	–1.93	–1.36
	PartyScene	–0.18	–0.11	–0.16	–0.35	–0.20	–0.29	–0.16	0.02	–0.03	–1.93	–2.45	–0.71
	RaceHorsesC	–0.16	–0.16	–0.39	–0.49	–0.16	–0.11	0.01	0.08	–0.08	–1.78	–0.89	–2.01
D	BasketballPass	–0.20	–0.14	–0.02	–0.41	–0.01	–0.38	0.04	–0.50	–0.59	–1.67	–2.31	–4.17
	BQSquare	–0.33	–0.27	–0.02	–0.23	–0.08	–0.24	–0.01	0.14	0.04	–1.92	–0.13	–1.73
	BlowingBubbles	–0.27	–0.68	–0.95	–0.69	–0.53	–0.96	0.03	–0.55	–0.34	–2.09	–1.36	–1.67
	RaceHorses	–0.25	0.29	0.38	–0.43	–0.65	–0.38	–0.04	–0.03	0.69	–1.84	–2.48	–1.42
E	FourPeople	–0.41	–0.52	–0.38	–0.55	–0.77	0.04	0.03	–0.17	0.00	–3.21	–2.61	–2.28
	Johnny	–0.25	–0.84	–0.39	–0.45	–1.01	–0.54	0.04	–0.16	–0.07	–2.31	–3.71	–5.30
	KristenAndSara	–0.27	–0.32	0.12	–0.48	–0.36	–0.47	0.02	–0.43	0.07	–3.31	–1.79	–2.44
AVERAGE		–0.19	–0.26	–0.12	–0.45	–0.34	–0.27	–0.06	–0.15	–0.12	–2.28	–1.74	–2.18

Table 6. Performance Comparison in Terms of BD-BR with QPs {22, 27, 32, 37} (Unit: %)

For Narsallah’s scheme [29], it reduces 0.19%, 0.26%, and 0.12% bit rate on average for Y, U, and V components, respectively. 0.45%, 0.34%, and 0.27% bit rates are saved for Y, U, and V components in the Abdoli’s scheme [1]. For Li’s scheme [22], it achieves 0.06%, 0.15%, and 0.12% bit rate reduction on average for luma and two chroma components, respectively. Regarding the proposed method, the bit rate reduction reaches 2.28%, 1.74%, and 2.18% on average for luma and two chroma components, respectively. From this comparison, it can be observed that the proposed method is better than the other three methods. Compared with Narsallah’s scheme [29], the proposed method not only adopts the existing hand-crafted features, but also learns features in highly dimensional space for the intra mode derivation.

In addition, the test sequences are encoded under the small QP setting {11, 16, 21, 26} and large QP setting {33, 38, 43, 48} to evaluate the performance of the proposed method. It should be noted that the neural network is not re-trained. The coding performance is shown in Table 7. The bit rate reductions can reach 0.71% and 3.64% for luma component under the small and large QP settings, respectively. Compared with the results in Table 6, the performance of normal QP setting is a little worse than that of large QP setting and better than that of small QP setting. The reason is that the percentage of coding bits of intra mode in a frame becomes large as QP value increases, and vice versa. Consequently, in the low bit rate scenario, the compression efficiency is greatly improved by the proposed method.

Table 7.

Class	Sequence	Small QPs {11, 16, 21, 26}			Large QPs {33, 38, 43, 48}
Class	Sequence	Y	U	V	Y	U	V
A1	Tango2	–0.76	–0.32	0.11	–3.68	–4.01	–3.88
	FoodMarket4	–0.31	–0.77	–0.60	–3.24	–2.89	–3.28
	Campfire	–1.00	–0.66	–0.67	–4.37	–2.63	–3.27
A2	CatRobot1	–0.79	–0.25	–0.42	–4.37	–2.63	–3.27
	DaylightRoad2	–0.44	–0.18	–0.03	–5.03	–4.61	–5.28
	ParkRunning3	–0.26	–0.26	–0.23	–2.33	–1.60	–1.72
B	MarketPlace	–0.58	–0.57	–0.04	–2.97	–5.29	0.23
	RitualDance	–0.29	–0.58	–1.14	–4.87	–6.58	–4.68
	Cactus	–0.60	–0.49	–0.46	–4.15	–1.81	–3.41
	BasketballDrive	–0.60	0.01	–0.86	–3.65	–4.40	–4.16
	BQTerrace	–0.63	–0.36	–0.47	–3.70	–3.57	–6.81
C	BasketballDrill	–0.87	–1.63	–1.20	–3.03	–5.43	0.93
	BQMall	–1.00	–0.59	–0.86	–4.16	–4.52	–5.75
	PartyScene	–0.84	–0.51	–0.58	–3.71	–0.98	–6.77
	RaceHorsesC	–0.70	–0.55	–0.57	–3.35	–2.35	–4.11
D	BasketballPass	–0.66	0.53	–1.87	–1.91	–6.51	–3.23
	BQSquare	–0.89	–0.79	–1.62	–3.84	–10.4	–9.82
	BlowingBubbles	–0.76	–1.34	–0.59	–2.88	–0.62	1.48
	RaceHorses	–0.89	–0.79	–1.62	–3.84	–10.4	–9.82
E	FourPeople	–0.87	–0.71	–0.97	–2.96	–2.23	–4.26
	Johnny	–1.21	–1.08	–0.89	–4.20	–3.56	–4.80
	KristenAndSara	–0.62	–0.64	–1.15	–4.36	–3.06	–2.66
AVERAGE		–0.71	–0.57	–0.76	–3.64	–4.15	–4.00

Table 7. Performance Evaluation in Terms of BD-BR with Different QP Settings (Unit: %)

As shown in Figure 4, the first frames of six sequences, including BasketballPass (\(416\times 240\)), BQSquare (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), and Johnny (\(1280\times 720\)), are utilized to demonstrate the coding blocks that are selected by the proposed DLIMD in a frame, where the QP value is 22. The selected blocks are marked in different colors according to the size. There are five different colors and the details are listed as follows. If the size is smaller than \(8\times 8\), the block is marked as red color; if the size is greater than \(8\times 8\) and smaller than \(16\times 16\), the block is marked as green color; if the size is greater than \(16\times 16\) and smaller than \(32\times 32\), the block is marked as blue color; if the size is greater than \(32\times 32\) and smaller than \(64\times 64\), the block is marked as black color; otherwise, the block is marked as white color. It can be clearly observed that lots of coding blocks select the proposed DLIMD. Moreover, the quantitative results are presented in Table 8 under different QP settings and different sequences. The percentage of DLIMD selection is calculated by the ratio of selected area against the whole frame, which is represented by,

\begin{equation} \Omega = \frac{\sum _{i=1}^{N}C_i \times w_i \times h_i}{\sum _{i=1}^{N}w_i \times h_i} \times 100\%, \end{equation}

(9)

where \(N\) indicates the number of coding blocks in a frame, \(C_i\) indicates the DLIMD selection, \(C_i = 0\) if the current coding block does not select DLIMD, \(w_i\) and \(h_i\) are the width and height of the current coding block. From this table, the percentage can reach 42.9%, 45.2%, 48.5%, and 50.5% on average under four QP settings, respectively. It indicates that the coding performance can be efficiently improved. In addition, the selected blocks under four QP settings are re-organized according to block size. There are 17 available block sizes for these sequences, i.e., \(4\times 4\), \(4\times 8\), \(4\times 16\), \(4\times 32\), \(8\times 4\), \(8\times 8\), \(8\times 16\), \(8\times 32\), \(16\times 4\), \(16\times 8\), \(16\times 16\), \(16\times 32\), \(32\times 4\), \(32\times 8\), \(32\times 16\), \(32\times 32\), and \(64\times 64\). For each block size, the ratio of selected block number against the total block number is calculated, as shown in Figure 5. It can be observed that the ratio can reach from 43.0% to 54.5%.

Fig. 4.

Fig. 5.

Table 8.

Class	Sequence	QP = 22	QP = 27	QP = 32	QP = 37
A1	Tango2	39.9	41.3	44.3	46.5
	FoodMarket4	38.5	37.6	41.4	44.2
	Campfire	48.0	48.6	49.8	53.5
A2	CatRobot1	41.3	47.3	50.5	50.6
	DaylightRoad2	45.4	50.9	52.8	52.4
	ParkRunning3	43.6	46.5	47.2	47.9
B	MarketPlace	41.3	44.5	47.1	48.9
	RitualDance	43.8	45.9	50.2	52.2
	Cactus	40.8	45.1	48.1	50.7
	BasketballDrive	43.4	46.1	50.6	52.6
	BQTerrace	41.7	48.4	51.7	54.4
C	BasketballDrill	57.8	60.3	59.3	56.1
	BQMall	43.3	44.9	48.5	50.5
	PartyScene	41.8	45.5	47.8	52.5
	RaceHorsesC	39.8	43.5	45.4	49.9
D	BasketballPass	45.5	40.8	48.8	51.3
	BQSquare	39.6	40.6	45.5	49.8
	BlowingBubbles	41.5	42.8	47.7	48.7
	RaceHorses	38.8	39.7	45.6	50.3
E	FourPeople	43.2	45.5	47.5	48.2
	Johnny	42.3	45.0	49.2	50.0
	KristenAndSara	42.2	43.3	47.4	49.4
AVERAGE		42.9	45.2	48.5	50.5

Table 8. Percentage of the Proposed Method Selection (Unit: %)

4.2 Influence of Learned and Hand-crafted Features

The individual influence of learned features and hand-crafted features in the proposed architecture (shown in Figure 3) is analyzed. Four cases are presented, i.e., (1) H: the module of learning features is removed, only the hand-crafted features are used for intra mode derivation; (2) L: the hand-crafted features are removed, only the learned features are used for intra mode derivation; (3) H’+L: both the hand-crafted features (excluding gradient histogram) and learned features are used for intra mode derivation; and (4) H+L: both the hand-crafted features and learned features are used for intra mode derivation.

With the same training samples claimed in Section 3.3, three more neural networks are trained separately according to the listed cases. The training process is as same as that in Section 3.3, where NVIDIA GeForce 1080 Ti GPU with AdamOptimizer is adopted and the loss function is cross entropy. Figure 6 illustrates the comparison of these four cases in terms of training loss and validation classification accuracy. The classification accuracy is calculated as follows,

\begin{equation} P = \frac{1}{N}\sum _{i=1}^N \delta _i \times 100\%, \end{equation}

(10)

where \(N\) is the number of testing samples, \(\delta _i=1\) in the case that the difference between predicted label and ground truth is less than a pre-defined threshold \(\Delta\), i.e., \(\Vert \texttt {argmax}({\bf y}^i) - \texttt {argmax}({\bf O}^f_5)\Vert \le \Delta\), otherwise \(\delta _i=0\). argmax() returns the position of maximum value in a vector, \({\bf y}^i\) is the ground truth represented in the one-hot manner, \({\bf O}^f_5\) is the output of intra mode derivation network. Here, the value of \(\Delta\) is set as 0. The cases of separate hand-crafted and learned features both converge at round 3.8 cross entropy loss and achieve about 25% validation classification accuracy, the case of H’+L converges at about 3.6 cross entropy loss and achieves about 30% validation classification accuracy, while the combination of hand-crafted and learned features converges at 3.5 cross entropy loss and achieves about 35% validation classification accuracy.

Fig. 6.

From these results, it can be obviously observed that the case of combining hand-crafted and learned features achieves the best performance when compared with the other three. The reasons are that although CNN is able to extract high-level features and latent representation, the hand-crafted features still can provide useful information and compensate the limitation of learned features. For example, the intra modes of neighbors from spatial domain, which cannot be learned from the feature learning network, play an important role for intra mode derivation.

4.3 Ablation Study of Architecture

In addition, we aim to further analyze the impact of modules in the network architecture. Alternative networks are designed, and illustrated in Figure 7. Different from the proposed network shown in Figure 3, the convolutional layers are placed in the serial manner, the number of feature maps in the first three layers is 128 which matches the input of convolutional layers 2a, 2b, 3a, 3b in Figure 3, the kernel sizes of the first and third convolutional layers are set as \(1\times 1\) and the others are \(3\times 3\), the hand-crafted features are only combined to the first fully connected layer.

Fig. 7.

Three configurations are listed for comparison, i.e., Case A: feature learning network in Figure 3 and intra mode derivation network in Figure 3; Case B: feature learning network in Figure 3 and intra mode derivation network in Figure 7; Case C: feature learning network in Figure 7 and intra mode derivation network in Figure 3. It should be noted that Case A is the proposed one. Here, two more networks for Cases B and C are trained with the same samples. The results are compared in terms of multi-class classification accuracy. Two test sequences from each class defined in the CTC [5] are employed, i.e., BasketballPass (\(416\times 240\)), BlowingBubbles (\(416\times 240\)), BQMall (\(832\times 480\)), BasketballDrill (\(832\times 480\)), FourPeople (\(1280\times 720\)), Johnny (\(1280\times 720\)), BasketballDrive (\(1920\times 1080\)), BQTerrace (\(1920\times 1080\)), Tango2 (\(3840\times 2160\)), and FoodMarket4 (\(3840\times 2160\)). These sequences are all encoded by VTM 5.0 with default AI configuration, where the QP values are set as {22, 27, 32, 37}. During the process of encoding, the testing samples are collected simultaneously. For each sequence, \(800\times 67\times 4\) samples are selected under four QP settings. Table 9 illustrates the experimental results, and the classification accuracy is calculated by Equation (10). In Table 9, under this condition of \(\Delta = 0\), the multi-class classification accuracies are 34.8%, 31.4%, and 32.6% on average for Cases A, B, and C, respectively. As such, we can conclude that the hand-crafted features combined to each fully connected layer and different kernel sizes placed in the parallel manner in convolutional layers can achieve better performance.

Table 9.

Class	Sequence	Case A (proposed)				Case B				Case C
Class	Sequence	\(\Delta\) = 0	\(\Delta\) = 1	\(\Delta\) = 3	\(\Delta\) = 5	\(\Delta\) = 0	\(\Delta\) = 1	\(\Delta\) = 3	\(\Delta\) = 5	\(\Delta\) = 0	\(\Delta\) = 1	\(\Delta\) = 3	\(\Delta\) = 5
A	Tango2	31.8	45.3	59.8	68.0	28.4	41.6	56.7	65.3	29.6	43.3	58.1	67.1
A	FoodMarket4	34.5	50.9	67.0	75.3	31.1	47.8	64.2	73.3	31.8	49.1	65.9	74.4
B	BasketballDrive	37.2	52.6	65.6	72.1	34.0	49.6	62.5	69.5	34.5	49.3	63.1	70.2
B	BQTerrace	32.8	47.3	58.6	64.9	28.7	43.0	54.4	61.3	31.0	46.1	58.0	64.7
C	BQMall	35.6	50.9	62.3	67.5	32.9	47.4	58.2	64.1	33.9	50.0	60.9	66.8
C	BasketballDrill	37.3	55.0	66.9	72.4	33.8	52.4	64.6	70.0	35.4	54.0	66.9	72.5
D	BlowingBubbles	32.1	47.2	61.1	67.8	29.2	44.1	58.2	65.7	30.2	46.0	60.8	68.3
D	BasketballPass	34.7	50.8	63.4	69.3	32.0	48.0	59.5	65.8	32.5	49.9	62.2	68.2
E	FourPeople	34.1	49.3	61.5	67.7	29.8	44.8	56.8	62.9	32.2	47.5	59.7	66.1
E	Johnny	37.6	54.7	68.4	75.1	34.2	51.5	65.5	72.7	35.2	52.5	66.6	73.4
AVERAGE		34.8	50.4	63.5	70.0	31.4	47.0	60.1	67.1	32.6	48.7	62.2	69.1

Table 9. Multi-class Classification Accuracy (Unit: %)

In addition, the normalized confusion matrices of Case A are illustrated in Figure 8. The horizontal is predicted label and the vertical is ground truth. It can be observed that the difference between ground truth and predicted label is limited. For the proposed one (Case A), the average classification accuracies under four conditions are 34.8%, 50.4%, 63.5%, and 70.0% in Table 9, respectively. Although there are some differences between predicted label and ground truth under the conditions of \(\Delta = \lbrace 1, 3, 5\rbrace\), the intra prediction results may be similar, and the RDO will be performed to balance the distortion and coding bits during video coding. Therefore, the coding gains can still be achieved with limited difference between predicted label and ground truth.

Fig. 8.

4.4 Computational Complexity Analyses

Additionally, the coding/decoding time of video codec equipped with the DLIMD is compared with that of the anchor, which is calculated by,

\begin{equation} \Delta T_m = \frac{1}{4}\sum _{i=1}^{4}{\frac{T_{\Psi }^m(QP_i)}{T_{c}^m(QP_i)}}, \end{equation}

(11)

where \(T_{c}^m(QP_i)\) is the coding/decoding time of the anchor under \(QP_i\), and \(T_{\Psi }^m(QP_i)\) is the coding/ decoding time of the video codec equipped with proposed method under \(QP_i\), \(m \in\) {coding, decoding}. Compared with the anchor, the values of computational complexity of the proposed method are 33.6 times, 140.3 times under CPU+GPU platform and 231.3 times, and 604.8 times under CPU platform on average for video coding and decoding, respectively. The computational complexity is a great challenge. In the video codec, the DLIMD is performed in the variable coding blocks and the convolutional/fully connected operations in the neural network result in high complexity.

For other deep learning based schemes [12, 38] that focus on the optimization of intra prediction, the values of computational complexity are 9.87 times, 87.4 times at the encoder side and 151.7 times, and 124.5 times at the decoder side with respect to the anchor. The former and latter schemes with 1.92% and 3.4% bit rate reductions for the luma component are performed on the platform of CPU and CPU+GPU, respectively. For the conventional schemes [1, 22, 29] whose compression efficiencies have been compared in Table 6, the values of encoding and decoding complexity are 109%, 111%, 101% and 104%, 105%, 100% with respect to the anchor. It can be found that the computational complexity of deep learning based schemes including the proposed one is much higher than that of conventional schemes.

Generally, to accelerate deep learning based schemes, the strategies include SIMD optimization, neural network quantization, and parameters/layers pruning. The first two strategies require the support/optimization from hardware devices. Therefore, the third one is adopted to investigate the trade-off between computational complexity and compression efficiency. One more architecture (denoted as DLIMD-L) is designed by reducing the parameters, i.e., the output of last layer in feature learning network is changed from 64 to 16, and the number of nodes of hidden layers except the last one in intra mode derivation network is changed from 2048 to 128. According to the definition of FLOPs [26], the computational complexity can be largely reduced. DLIMD-L is trained with the same training dataset as DLIMD. The coding experiments are performed on the platform of CPU+GPU and the results are presented in Table 10. These sequences are all encoded with default AI configuration, where the QP values are set as {22, 27, 32, 37}. It can be observed that the coding efficiency of DLIMD-L is \(-\)1.17% on average for luma component in terms of BD-BR, which is worse than that of DLIMD. The values of encoding and decoding complexity of DLIMD-L are 27.9 times and 40.8 times with respect to the anchor (VTM 5.0), where 6.8 times of encoding complexity and 105.0 times of decoding complexity are reduced. Although the computational complexity is still high, we believe that it can be optimized in the future.

Table 10.

Class	Sequence	DLIMD					DLIMD-L
		BDBR (%)			Complexity		BDBR (%)			Complexity
		Y	U	V	Encode	Decode	Y	U	V	Encode	Decode
A	Tango2	–2.50	–2.77	–2.20	\(36.8\times\)	\(144.7\times\)	–1.71	–2.53	–1.30	\(28.8\times\)	\(38.1\times\)
A	FoodMarket4	–2.65	–1.66	–1.14	\(25.8\times\)	\(134.9\times\)	–1.92	–1.08	–0.39	\(19.9\times\)	\(39.1\times\)
B	BasketballDrive	–2.42	–2.68	–2.04	\(36.6\times\)	\(135.3\times\)	–1.55	–1.70	–2.27	\(29.0\times\)	\(35.3\times\)
B	BQTerrace	–1.88	–1.77	–2.40	\(37.0\times\)	\(167.9\times\)	–0.73	–0.97	–0.33	\(30.3\times\)	\(36.2\times\)
C	BQMall	–2.89	–1.93	–1.36	\(33.9\times\)	\(138.2\times\)	–2.11	–0.36	–1.14	\(27.2\times\)	\(56.3\times\)
C	BasketballDrill	–2.29	1.18	–2.62	\(33.4\times\)	\(176.9\times\)	1.40	0.97	–0.26	\(27.4\times\)	\(37.1\times\)
D	BlowingBubbles	–2.09	–1.36	–1.67	\(30.4\times\)	\(136.8\times\)	–1.32	–0.35	–1.77	\(25.9\times\)	\(38.6\times\)
D	BasketballPass	–1.67	–2.31	–4.17	\(32.2\times\)	\(155.8\times\)	–0.46	1.77	–1.65	\(26.7\times\)	\(43.0\times\)
E	FourPeople	–3.21	–2.61	–2.28	\(42.9\times\)	\(136.5\times\)	–2.44	–2.81	–1.54	\(34.1\times\)	\(46.1\times\)
E	Johnny	–2.31	–3.71	–5.30	\(38.1\times\)	\(131.2\times\)	–0.81	–0.73	–3.52	\(30.4\times\)	\(38.2\times\)
AVERAGE		–2.39	–1.96	–2.52	\(34.7\times\)	\(145.8\times\)	–1.17	–0.78	–1.42	\(27.9\times\)	\(40.8\times\)

Table 10. Trade-off between Computational Complexity and Compression Efficiency on the Platform of CPU+GPU

4.5 Coding Performance under the Latest VVC Test Model and Other Configurations

In addition, the proposed method is evaluated on the platform of the latest VVC test model, i.e., VTM 16.0, in which DLIMD has been implemented. Besides AI configuration, the coding experiments are also conducted under Low Delay P (LDP) and Random Access (RA) configurations. It should be noted that the neural network is not changed after training, as claimed in Section 3.3.

The experimental results are shown in Table 11, where the original VTM 16.0 is utilized as the anchor to calculate the value of BD-BR. It can be observed that the proposed method achieves 1.91%, 0.87%, and 1.15% bit rate reductions for Y component under AI, LDP and RA configurations, respectively. The coding gains are a little worse than those under VTM 5.0 shown in Table 6. The reasons are that the neural network is not re-trained, and the intra coding from VTM 5.0 to VTM 16.0 has been optimized.

Table 11.

Class	Sequence	AI Configuration			LDP Configuration			RA Configuration
Class	Sequence	Y	U	V	Y	U	V	Y	U	V
A1	Tango2	–2.48	–0.33	–1.60	–0.83	–0.81	–0.46	–1.18	–0.96	–0.10
	FoodMarket4	–1.87	–2.17	–1.36	–0.63	–1.43	0.05	–0.94	–0.38	–1.03
	Campfire	–2.37	–1.49	–2.72	–1.18	–0.95	–0.71	–1.82	–1.47	–1.32
A2	CatRobot1	–2.54	–1.53	–2.14	–1.36	–1.95	–1.62	–1.53	–1.72	–1.32
	DaylightRoad2	–2.48	–2.06	–2.86	–1.84	–2.55	–2.29	–2.30	–1.69	–2.02
	ParkRunning3	–1.07	–0.78	–0.37	–0.41	–0.41	–0.43	–0.48	–0.36	–0.35
B	MarketPlace	–1.88	–1.31	–1.99	–0.60	–0.55	0.41	–1.11	–1.61	0.88
	RitualDance	–3.48	–4.15	–3.44	–0.72	–0.69	–1.20	–1.04	–0.86	–1.53
	Cactus	–2.27	–1.61	–1.05	–1.08	–0.94	–0.37	–1.63	–1.35	–2.01
	BasketballDrive	–2.35	–1.94	–2.38	–0.78	–1.39	–0.24	–1.10	–0.29	–1.06
	BQTerrace	–1.74	–1.63	–0.80	–0.78	–1.46	–1.23	–1.30	–1.12	–0.97
C	BasketballDrill	–1.31	–2.29	–1.05	–0.41	–1.03	0.06	–1.22	–2.00	–1.41
	BQMall	–2.11	–0.34	–2.58	–1.14	–2.15	–1.16	–1.34	0.04	–0.12
	PartyScene	–1.50	–1.10	–2.58	–0.77	–0.57	–1.27	–0.90	–1.08	–0.52
	RaceHorsesC	–1.47	0.18	–0.42	–0.25	–1.02	–0.67	–0.65	–1.23	0.11
D	BasketballPass	–0.47	–1.94	–3.35	–0.25	–1.25	0.70	0.18	–2.36	–0.67
	BQSquare	–1.26	–2.13	0.27	–0.55	–2.42	–3.05	–0.74	–0.07	–0.96
	BlowingBubbles	–1.11	–1.05	0.20	–0.32	–0.39	0.71	–1.03	–0.21	–0.78
	RaceHorses	–1.12	–1.04	0.81	–0.25	–1.01	0.40	–0.38	2.08	0.16
E	FourPeople	–3.02	–1.73	–2.68	–2.04	–2.66	–2.93	–2.14	–1.28	–2.18
	Johnny	–1.88	–3.26	–2.15	–1.23	–0.35	–2.26	–1.06	–1.87	–0.71
	KristenAndSara	–2.22	–4.38	–2.66	–1.64	0.44	–2.44	–1.50	–1.16	–1.95
AVERAGE		–1.91	–1.73	–1.68	–0.87	–1.16	–0.91	–1.15	–0.95	–0.90

Table 11. Coding Performance in Terms of BD-BR on the Latest Platform of VTM 16.0 under AI, LDP, RA Configurations (Unit: %)

5 Conclusions

In this paper, a deep learning based intra mode derivation method is presented to skip the module of intra mode signaling for saving coding bits. Instead of checking the candidate intra modes one by one to achieve the optimal, this process is casted into a multi-class classification task from signal processing to artificial intelligence. To adapt to variable coding blocks and different QP settings with one single model, the architecture is effectively developed. In particular, the hand-crafted and learned features are combined to compensate their individual limitations. The rate-distortion optimization is performed between the proposed method and the traditional method with a strategy flag signaled for performance competition. Compared with the state-of-the-art works, the proposed method achieves significant coding gains.

References

[1]

Mohsen Abdoli, Thomas Guionnet, Mickael Raulet, Gosala Kulupana, and Saverio Blasi. 2020. Decoder-side intra mode derivation for next generation video coding. In 2020 IEEE International Conference on Multimedia and Expo (ICME). 1–6.

Abstract

1 Introduction

2 Motivation

3 Proposed Deep Learning Based Intra Mode Derivation for Video Coding

3.1 Problem Formulation and Framework

3.2 Deep Learning based Intra Mode Derivation

3.3 Neural Network Training

4 Experimental Results and Analyses

4.1 Coding Performance Comparison

4.2 Influence of Learned and Hand-crafted Features

4.3 Ablation Study of Architecture

4.4 Computational Complexity Analyses

4.5 Coding Performance under the Latest VVC Test Model and Other Configurations

5 Conclusions

References

Cited By

Index Terms

Recommendations

Accelerating QTMT-based CU partition and intra mode decision for versatile video coding

Novel video coding methods for versatile video coding

Fast coding unit partitioning algorithms for versatile video coding intra coding

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations