WO2023197219A1

WO2023197219A1 - Cnn-based post-processing filter for video compression with multi-scale feature representation

Info

Publication number: WO2023197219A1
Application number: PCT/CN2022/086686
Authority: WO
Inventors: Cheolkon Jung; Zhanyuan QI
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2023-10-19

Abstract

Methods and systems for video processing are provided. In some embodiments, the method includes (i) receiving an input image; (ii) decomposing the input image using discrete wavelet transform; (iii) inputting a set of wavelet subbands to a neural network framework, where the neural network framework comprises two a first and second subnetwork; and (iv) receiving, from the neural network framework, an enhanced version of the input image. The first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency using the restored high frequency wavelet subbands to restore the low frequency wavelet subbands. The neural network framework may apply inverse discrete wave transform to the restored wavelet subbands to create a reconstructed signal. The second subnetwork is configured to expand a size of a receptive field of the reconstructed signal for the enhanced version of the input image.

Description

CNN-BASED POST-PROCESSING FILTER FOR VIDEO COMPRESSION WITH MULTI-SCALE FEATURE REPRESENTATION

TECHNICAL FIELD

The present disclosure relates to video compression schemes that can improve reconstruction performance. More specifically, the present disclosure is directed to systems and methods for providing a convolutional neural network filter used for post-processing video compression based on wavelet decomposition.

BACKGROUND

Common image and video compression methods includes those using Joint Photographic Experts Group (JPEG) standard (e.g., for still images) JPEG as well as High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards (e.g., for videos) . In these methods, quantization and prediction processes are performed during the coding processes, resulting in irreversible information loss and various compression artifacts in compressed images/videos, such as blocking, blurring, and banding. This drawback is especially obvious when using a high compression ratio.

To address the foregoing drawback, multiple deep-learning based methods are used. These methods include frameworks/networks based on simple concatenated layers, deep residual blocks, dense connections, cascading connections, and feature reuse. Most of these methods do not employ advanced features, such as residual dense blocks and informative associations between different frequencies. As a result, these methods cannot further improve their learning and feature selection abilities, thus their compression artifact removal is very limited.

SUMMARY

The present disclosure is related to systems and methods for improving image qualities of videos using a neural network for video compression (e.g., video processing) based on wavelet decomposition. Though the following systems and methods are described in relation to video processing, in some embodiments, the systems and methods may be used for other image processing systems and methods. The convolutional neural network (CNN) framework can be trained by deep learning and/or artificial intelligent schemes. The present disclosure provides a CNN framework for improving image qualities of an input image based on two subnetworks: a step-like subband network and a mixed enhancement network. The step-like subband network includes a Res2Net Group (R2NG) composed of Res2Net modules to represent multiscale features. The mixed enhancement network uses dilated convolution and standard convolution for an expanded receptive field without blind spots, unlike the use of dilated convolution alone. The CNN framework, by using the two subnetworks, has improved reconstruction performance on images compared to common reconstruction systems and methods.

In some embodiments, a method employs the CNN framework for image compression. The method receives an input image, which may be a video frame in some embodiments. The method decomposes the input image into a set of wavelet subbands using discrete wavelet transform. The method inputs the set of wavelet subbands to the CNN framework. The CNN framework comprises two subnetworks. The first subnetwork (or the step-like subband network) is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands. The second subnetwork (or the mixed enhancement network) is configured to expand a size of a receptive field of a signal of the restored set of wavelet subbands using mixed convolution. The method receives an enhanced version of the input image from the CNN framework.

In some embodiments, the present method can be implemented by a tangible, non-transitory, computer-readable medium having processor instructions stored thereon that, when executed by one or more processors, cause the one or more processors to perform one or more aspects/features of the method described herein. In other embodiments, the present method can be implemented by a system comprising a computer processor and a non-transitory computer-readable storage medium storing instructions that when executed by the computer processor cause the computer processor to perform one or more actions of the method described herein.

In some embodiments, the present methods can be implemented in various data processing flows, such as in in-loop filtering module of a video or image codec. The present methods can also be implemented in data processing flow, such as in post filtering module of a video or image codec.

In some embodiments, the present methods can be implemented by a system having an encoder and a decoder. In such embodiments, the encoder can encode an input video/picture based on a set of rules (e.g., a video codec) and then transmit the encoded video/picture. After receiving the encoded video, the decoder can decode the encoded video based on the set of rules and then generate decoded video as an output video. In some embodiments, there can be some “post filter” or “post-filtering” process to further process the output video.

Examples of the post filter include a CNN-based post-processing filter discussed herein. For example, the post filter can be connected to the output of the decoder, and uses the output video as the post filter’s input to further process/filter the output video.

In some embodiments, the CNN-based filter discussed herein can be an "in-loop filter" at both the encoder and the decoder. For example, in such embodiments, the filter can be within an in-loop filtering module configured to enhance the quality of reconstructed pictures or a region a picture (e.g., coding unit, coding tree unit, sub-picture, etc. ) . In such embodiments, reconstructed pictures at the encoder or the decoder can serve as an input for the in-loop filtering module. In some embodiments, the output of the in-loop filtering module can be connected to a buffer (e.g., a decoded picture buffer) .

BRIEF DESCRIPTION OF THE DRAWINGS

To describe the technical solutions in the implementations of the present disclosure more clearly, the following briefly describes the accompanying drawings. The accompanying drawings show merely some aspects or implementations of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework in accordance with one or more implementations of the present disclosure.

Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure.

Fig. 2B is a schematic diagram illustrating a Res2Net module of the CNN framework in accordance with one or more implementations of the present disclosure.

Fig. 3 is a schematic diagram of a mixed convolution group of the CNN framework in accordance with one or more implementations of the present disclosure.

Fig. 4 is a schematic diagram of a wireless communication system in accordance with one or more implementations of the present disclosure.

Fig. 5 is a schematic block diagram of a terminal device in accordance with one or more implementations of the present disclosure.

Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure.

Fig. 7 is a flowchart of a video encoder in accordance with one or more implementations of the present disclosure.

Fig. 8 is a flowchart of a video decoder in accordance with one or more implementations of the present disclosure.

DETAILED DESCRIPTION

Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework 100 in accordance with one or more implementations of the present disclosure. The CNN framework 100 is configured to learn, train, and/or use residual information to improve image qualities of an input image 101. As discussed in detail below, the CNN framework 100 uses two types of subnetworks, which are a step-like subband network 102 and a mixed enhancement network 103, to reconstruct the input image. In some embodiments, the CNN framework 100 may include one or more additional filters or convolutional filters not shown in Fig. 1.

The CNN framework 100 receives an input image 101. The input image 101 may be a singular image that has been compressed before transmission or may be one of a set of video frames from a video. The CNN framework may receive the input image 101 from a computing device connected to the CNN framework or a system that employs the CNN framework via a network or direct connection. The signal of the input image 101 can be represented by wavelet coefficients after the application of discrete wavelet transform (DWT) 106. The wavelet coefficients reflect similarity between wavelet basis and image signal and form wavelet subbands. The CNN Framework 100 uses discrete wavelet transform (DWT) 106 to decompose the input image 100 into four wavelet subbands: a low frequency feature (LL) subband 107A, a vertical feature (HL) subband 107B, a horizontal feature (LH) subband 107C, and a diagonal feature (HH) subband 107D. Most content from the input image 101 is concentrated in the LL subband 107A. The HL subband 107B, LH subband 107C, and HH subband 107D are high frequency components of the input image 101 and contain information on areas of the input image 101 with sharp changes in gray value (e.g., edges and texture in the input image 101) . The HH subband 107D includes the least amount of information from the input image 107D. Each wavelet subband 107 represents a different direction, such that the relationships between the wavelet subbands 107 include feature location from the input image 101. The CNN framework 100 processes the wavelet subbands 107 in the order of: HH, HL, LH, and LL or HH, LH, HL, and LL. HH is processed first because it contains the least information of the wavelet subbands 107 and is easily lost during the compression process. LL is processed last because it contains most information and loses less information during the compression process.

The CNN framework 100 inputs each wavelet subband 107 through 1x1 convolutional layers 113 and step-by-step to the step-like subband network 102. The step-like subband network 102 processes the high frequency subbands 107B-D before the low frequency subband 107A and uses the restored high frequency subbands 107B-D to aid in recovery of the low frequency subband 107A. In some embodiments, the step-like subband network 102 processes the wavelet subbands 107 one-by-one from highest frequency to lowest frequency and uses the already-processed wavelet subbands 107 to aid in recovery of the wavelet subbands subsequently processed. Each wavelet subband 107 corresponds to a Res2NetGroup 105 in the step-like subband network 102. The Res2NetGroups 105 are further described in relation to Fig. 2A-B.

The step-like subband network may be represented by Equations 1-4, where LL, HL, LH, HH represent the input wavelet subbands, LL', HL', LH', HH' represent output from the step-like subband network for each wavelet subband 107, respectively, and R _LL, R _HL, R _LH, R _HH represent the Res2Net Group 105 corresponding to each wavelet subband 107, respectively.

HH′= R _HH (HH) Equation 1

LH′= R _LH (LH+HH′) Equation 2

HL′= R _HL (HL+LH′) Equation 3

LL′= R _LL (LL+HL′) Equation 4

The CNN framework puts each wavelet subband output from the Res2NetGroups 105 through 1x1 convolutional layers 113 and performs inverse discrete wavelet transform (IDWT) 108 on the wavelet subbands 107 to reconstruct the signal from the input image 101. The CNN framework 100 inputs the reconstructed signal to the mixed enhancement network 103 to expand the size of the receptive field of the reconstructed signal. The mixed enhancement network 103 uses a combination of dilated convolution and standard convolution in each mixed convolution group 109 and further includes two 3x3 convolutional layers 111, one at the beginning of the mixed enhancement network 103 and one at the end of the mixed enhancement network 103 (e.g., after the mixed convolution groups 109. The structure of the mixed convolution groups 109 are further described in relation to Fig. 3. The mixed enhancement network 103 outputs a signal representing an enhanced version of the input image 101 (e.g., an enhanced image 104) .

Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure. As shown in Fig. 2A, the Res2NetGroup 105 includes a 3x3 convolutional layer 203A at its beginning and a 3x3 convolutional layer 203B its end. In some embodiments, the Res2NetGroup may include additional 3x3 convolutional layers 203 at its residual connections. The Res2NetGroup 105 also includes a set of Res2Net modules 201A-B. In some embodiments the Res2NetGroup 105 may include three Res2Net modules 201. In other embodiments, any number of Res2Net modules 201 may be included in each Res2NetGroup 105. Each Res2Net module has stronger multi-feature extraction ability compared to a traditional bottleneck block, with a similar computational load. Using Res2Net modules 201 in the Res2NetGroup improves the CNN’s multi-scale representation capabilities.

Fig. 2B is a schematic diagram illustrating a Res2Net module 201. The Res2Net module 201 processes segmentation in a multi-scale manner to extract both global and local information from the wavelet subbands 107. Each segment 209 in the Res2Net module 201 is connected to the other segments and passed through a 3x3 convolutional layer 203 to fuse information from the input image 101 on different scales. The Res2Net module 201 includes a 1x1 convolutional layer 205A at its beginning and a 1x1 convolutional layer 205B after the 3x3 convolutional layers 203 and before spatial attention and channel attention module 207, which is embedded to adaptively enhance the channel and spatial feature response of the Res2Net module 201.

Fig. 3 is a schematic diagram of a mixed convolution group 103 of mixed enhancement network 103 in accordance with one or more implementations of the present disclosure. The mixed convolution network 103 expands the size of the receptive field of the reconstructed signal 301 from the input image 101 using a set of mixed convolution groups 109. In some embodiments, the mixed enhancement network 103 includes three mixed convolution groups 109 and two 3x3 convolutional layers 111. In other embodiments, the mixed enhancement network 1034 includes any number of mixed convolution groups 109. Each mixed convolution group 109 uses a combination of dilated convolution and standard convolution. Dilated convolution expands the receptive field without increasing parameters of the reconstructed signal or reducing resolution of the reconstructed signal. Due to the predefined gap of dilated convolution, using only dilated convolution causes blind spots to appear in the receptive field due to lack of contextual information available for between pixels of the input image 101, so the use of dilated convolution and standard convolution together results in an improved output 303.

As shown in Fig. 3, a mixed convolution group 109 receives an input 301. The mixed convolutional layer of the mixed enhancement network 103 includes N channels (e.g., 64 channels) , where “pxN” channels are produced by dilated convolution and the remaining channels are produced by standard convolution (e.g., p is a convolution coefficient) . For the first mixed convolution group 109A in the mixed enhancement network 103, the input 301 is the reconstructed signal that underwent IDWT 108. The next mixed convolution group 109B receives the output 303 of the first mixed convolution group 109A as input 301. The reconstructed signal is input and output from the sequence of mixed convolution groups 109 the mixed enhancement network 103 in a similar fashion. The output of the last mixed convolution group 109C is a signal representing the enhance image 104.

The mixed convolution group 109 of Fig. 3 includes a densely connected set of mixed convolution blocks 305 between two 3x3 convolutional layers 111. The set of mixed convolutional blocks 305 have sequential dilated coefficients of 1 , 2 , 4, and 8, respectively. The mixed convolution groups 109 may be represented by Equations 5-9, where M _d=n represents a mixed convolution block 305 with a dilated coefficient of n, X ⁿ represents the n-th output of a mixed convolution group 109, and X _n represents the a combination of n mixed convolution group 109. “Function “fusion” is used to combine different outputs of the mixed convolution group 109.

X ¹ = M _d=1 (X _n-1) Equation 5

X ² = M _d=2 (M _d=1 (X _{n-1} ) ) Equation 6

X ³ = M _d=4 (M _d=2 (M _d=1 (X _n-1) ) ) Equation 7

X ⁴ = M _d=8 (M _d=4 (M _d=2 (M _d=1 (X _n-1) ) ) ) Equation 8

X _n = f _fusion ( [X ¹, X ², X ³, X ⁴] ) Equation 9

One set of comparative tests demonstrates that the CNN framework recovers objects more effectively than VTM 11.0-NNVC and retains more perceptual texture details. A number of perceptual improvements were spotted in the regions in a test video including the wall in the background, the clothes of the man, and the man’s outline. In another set of comparative tests, it is proved that the CNN framework removes the block effect. For example, it was observed that the CNN framework retained visible details of the necklace in a test video more effectively than VTM 11.0-NNVC.

Tables 1-3 below shows quantitative measurements of the use of the CNN framework 100, compared to VTM 11.0-NNVC, as a post-processing filter on input images 101. In Tables 1-3, Y-PNSR represents the peak signal-to-noise ratio of the Y channel and Y-MSIM represents the multi-scale structural similarity of the Y channel of a processed image. Further, negative BD-rate values represent coding gains from the use of the CNN framework 100.

Table 1 shows the quantitative measurements of BD-rate of using the CNN framework in a Random Access (RA) configuration, compared to VTM 11.0-NNVC.

Table 1

Metric	Y-PSNR	Y-MSIM
Class B	-2.62%	-2.34%
Class C	-3.22%	-3.12%
Class D	-4.27%	-2.83%

Table 2 shows the quantitative measurements of using the CNN framework in an All Intra (AI) configuration, compared to VTM 11.0-NNVC.

Table 2

Metric	Y-PSNR	Y-MSIM
Class B	-3.17%	-2.93%
Class C	-4.31%	-3.39%
Class D	-4.58%	-3.18%
Class E	-5.51%	-6.29%

Table 3 shows the quantitative measurements of using the CNN framework in a Low Delay P (LDP) , compared to VTM 11.0-NNVC.

Table 3

Metric	Y-PSNR	Y-MSIM
Class B	-3.19%	-2.63%
Class C	-3.62%	-2.65%
Class D	-4.15%	-2.73%
Class E	-5.74%	-6.11%

These results indicate that the CNN framework 100 achieves average BD-rate reductions of 2.99%, 4.8%, 3.72%and 4.5%over VTM 11.0-NNVC for Y channel on B, C, D and E classes in AI, RA and LDP configurations, respectively. Thus, the CNN framework 100 produces improved performance over VTM 11.0-NNVC for compression.

Fig. 4 is a schematic diagram of a wireless communication system 400 in accordance with one or more implementations of the present disclosure. The wireless communication system 400 can implement the CNN framework 100 discussed herein. As shown in Fig. 4, the wireless communications system 400 can include a network device (or base station) 401. Examples of the network device 401 include a base transceiver station (Base Transceiver Station, BTS) , a NodeB (NodeB, NB) , an evolved Node B (eNB or eNodeB) , a Next Generation NodeB (gNB or gNode B) , a Wireless Fidelity (Wi-Fi) access point (AP) , etc. In some embodiments, the network device 401 can include a relay station, an access point, an in-vehicle device, a wearable device, and the like. The network device 401 can include wireless connection devices for communication networks such as: a Global System for Mobile Communications (GSM) network, a Code Division Multiple Access (CDMA) network, a Wideband CDMA (WCDMA) network, an LTE network, a cloud radio access network (Cloud Radio Access Network, CRAN) , an Institute of Electrical and Electronics Engineers (IEEE) 802.11-based network (e.g., a Wi-Fi network) , an Internet of Things (IoT) network, a device-to-device (D2D) network, a next-generation network (e.g., a 5G network) , a future evolved public land mobile network (Public Land Mobile Network, PLMN) , or the like. A 5G system or network can be referred to as a new radio (New Radio, NR) system or network.

In Fig. 4 the wireless communications system 400 also includes a terminal device 403. The terminal device 403 can be an end-user device configured to facilitate wireless communication. The terminal device 403 can be configured to wirelessly connect to the network device 401 (via, e.g., via a wireless channel 405) according to one or more corresponding communication protocols/standards. The terminal device 403 may be mobile or fixed. The terminal device 403 can be a user equipment (UE) , an access terminal, a user unit, a user station, a mobile site, a mobile station, a remote station, a remote terminal, a mobile device, a user terminal, a terminal, a wireless communications device, a user agent, or a user apparatus. Examples of the terminal device 403 include a modem, a cellular phone, a smartphone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (WLL) station, a personal digital assistant (PDA) , a handheld device having a wireless communication function, a computing device or another processing device connected to a wireless modem, an in-vehicle device, a wearable device, an Internet-of-Things (IoT) device, a device used in a 5G network, a device used in a public land mobile network, or the like. For illustrative purposes, Fig. 4 illustrates only one network device 401 and one terminal device 403 in the wireless communications system 400. However, in some instances, the wireless communications system 400 can include additional network device 401 and/or terminal device 403.

Fig. 5 is a schematic block diagram of a terminal device 403 (e.g., which can implement the methods discussed herein) in accordance with one or more implementations of the present disclosure. As shown, the terminal device 403 includes a processing unit 510 (e.g., a DSP, a CPU, a GPU, etc. ) and a memory 520. The processing unit 510 can be configured to implement instructions that correspond to the method 800 of Fig. 6 and/or other aspects of the implementations described above. It should be understood that the processor 510 in the implementations of this technology may be an integrated circuit chip and has a signal processing capability. During implementation, the steps in the foregoing method may be implemented by using an integrated logic circuit of hardware in the processor 510 or an instruction in the form of software. The processor 510 may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) or another programmable logic device, a discrete gate or transistor logic device, and a discrete hardware component. The methods, steps, and logic block diagrams disclosed in the implementations of this technology may be implemented or performed. The general-purpose processor 510 may be a microprocessor, or the processor 510 may be alternatively any conventional processor or the like. The steps in the methods disclosed with reference to the implementations of this technology may be directly performed or completed by a decoding processor implemented as hardware or performed or completed by using a combination of hardware and software modules in a decoding processor. The software module may be located at a random-access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, or another mature storage medium in this field. The storage medium is located at a memory 520, and the processor 510 reads information in the memory 520 and completes the steps in the foregoing methods in combination with the hardware thereof.

It may be understood that the memory 520 in the implementations of this technology may be a volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory. The non-volatile memory may be a read-only memory (ROM) , a programmable read-only memory (PROM) , an erasable programmable read-only memory (EPROM) , an electrically erasable programmable read-only memory (EEPROM) or a flash memory. The volatile memory may be a random-access memory (RAM) and is used as an external cache. For exemplary rather than limitative description, many forms of RAMs can be used, and are, for example, a static random-access memory (SRAM) , a dynamic random-access memory (DRAM) , a synchronous dynamic random-access memory (SDRAM) , a double data rate synchronous dynamic random-access memory (DDR SDRAM) , an enhanced synchronous dynamic random-access memory (ESDRAM) , a synchronous link dynamic random-access memory (SLDRAM) , and a direct Rambus random-access memory (DR RAM) . It should be noted that the memories in the systems and methods described herein are intended to include, but are not limited to, these memories and memories of any other suitable type. In some embodiments, the memory may be a non-transitory computer-readable storage medium that stores instructions capable of execution by a processor.

Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure. The method 800 can be implemented by a system (such as a system with the CNN framework 100) . The method 800 is for enhancing image qualities (particularly, for compressed images) . The method 800 includes, at block 601, receiving an input image 101. At block 603, the method 800 continues by decomposing the input image 101 into a set of wavelet subbands 107. In some embodiments, the decomposition of the input image is accomplished using DWT 106. At block 605, the method 800 proceeds to input the wavelet subbands 107 to the CNN framework 100. The CNN framework 100 comprises two subnetworks. A first subnetwork (e.g., the step-like subband network 102 of Figs. 2A and 2B) is configured to restore the set of wavelet subbands 107 from high frequency to low frequency and uses the restored high frequency wavelet subbands 107B-D to restore the low frequency subband 107A. A second subnetwork (e.g., the mixed enhancement network 103 of Fig. 3) is configured to expand the size of the receptive field of the restored set of wavelet subbands using mixed convolution. In some embodiments, the CNN framework 100 may apply IDWT 108 to the restored set of wavelet subbands between the first subnetwork and the second subnetwork to create a reconstructed signal. At block 607, the method 800 receives an enhanced version of the input image 101 from the CNN framework 100.

In some embodiments, the first subnetwork may comprise one or more Res2NetGroups 105, where each Res2NetGroup 105 includes one or more Res2Net modules 201. In further embodiments, each Res2NetGroup 105 may comprise a first 3x3 convolutional layer 203A at a beginning of the Res2NetGroup 105 and a second 3x3 convolutional layer 203B at an ending of the Res2NetGroup 105. In some embodiments, each Res2Net module 201 may comprise one or more 3x3 convolutional layers 203 between a first 1x1 convolutional layer 205A and a second 1x1 convolutional layer 205B. In further embodiments, each Res2Net module 210 may comprise a spatial attention and channel attention module 207 implemented after the second 1x1 convolutional layer 205B.

In some embodiments, the mixed convolution may comprise a combination of dilated convolution and standard convolution. In further embodiments, the second subnetwork may comprise one or more mixed convolution groups between a first and second 3x3 convolutional layer and each mixed convolution group may comprise two or more densely connected mixed convolution blocks. In an example embodiment, each mixed convolution group may comprise four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.

Fig. 7 is a schematic block diagram of a video encoder. An input video contains one or more pictures. Partition unit 701 divides a picture in an input video into one or more coding tree units (CTUs) . Partition unit 701 divides the picture into tiles, and optionally may further divide a tile into one or more bricks, wherein a tile or a brick contains one or more integral and/or partial CTUs. Partition unit 701 forms one or more slices, wherein a slice may contain one or more tiles in a raster order of tiles in the picture, or one or more tiles covering a rectangular region in the picture. Partition unit 701 may also forms one or more sub-pictures, wherein a sub-picture contains one or more slices, tiles or bricks.

In encoding process of encoder 700, partition unit 701 passes CTUs to prediction unit 702. Generally, prediction unit 702 is composed of block partition unit 703, ME (motion estimation) unit 704, MC (motion compensation) unit 705 and intra prediction unit 706. Block partition unit 703 further divides an input CTU into smaller coding units (CUs) using quadtree split, binary split and ternary split iteratively. Prediction unit 702 may derive inter prediction block of a CU using ME unit 704 and MC unit 705. Intra prediction unit 706 may derive intra prediction block of a CU using various intra prediction modes including angular prediction modes, DC mode, plenar mode, matrix-based intra prediction mode, and etc. In an example, rate-distortion optimized motion estimation method can be invoked by ME unit 704 and MC unit 705 to derive the inter prediction block, and rate-distortion optimized mode decision method can be invoked by intra prediction unit 706 to get the intra prediction block.

Prediction unit 702 outputs a prediction block of a CU. Adder 707 calculates a difference, i.e. residual CU, between the CU in the output of partition unit 701 and the prediction block of the CU. Transform unit 708 reads the residual CU, and performs one or more transform operations on the residual CU to get coefficients. Quantization unit 709 quantizes the coefficients and outputs quantized coefficients, i.e. levels.

Inverse quantization unit 710 performs scaling operations on the quantized coefficients to output reconstructed coefficients. Inverse transform unit 711 performs one or more inverse transforms corresponding to the transforms in transform unit 708 and output reconstructed residual. Adder 712 calculates reconstructed CU by adding the reconstructed residual and the prediction block of the CU from prediction unit 702. Adder 712 also forwards its output to prediction unit 702 to be used as intra prediction reference. After all the CUs in a picture or a sub-picture have been reconstructed, filtering unit 712 performs in-loop filtering on the reconstructed picture or sub-picture. Filtering unit 712 contains one or more filters, for example, deblocking filter, sample adaptive offset (SAO) filter, adaptive loop filter (ALF) , luma mapping with chroma scaling (LMCS) filter and neural network based filters. Filtering unit 712 also contains the CNN-based filter discussed herein. As an example, the filters in the filtering unit 712 may be connected with each other in a cascading order. Alternatively, when filtering unit 712 determines that the CU is not used as reference for encoding other CUs, filtering unit 712 performs in-loop filtering on one or more target pixels in the CU.

Output of filtering unit 712 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 713. DPB 713 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 713 may also be employed as reference for performing inter or intra prediction by prediction unit 702. Optionally, the decoded picture can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.

Entropy coding unit 715 converts parameters from units in encoder 700 that are necessary for deriving decoded picture as well as control parameters and supplemental information into binary representations, and writes such binary representations according to syntax structure of each data unit into a generated video bitstream.

Encoder 700 could be a computing device with a processor and a storage medium recording an encoding program. When the processor reads and executes the encoding program, the encoder 700 reads an input video and generates corresponding video bitstream.

Encoder 700 could be a computing device with one or more chips. The units, implemented as integrated circuits, on the chip are of similar functionalities with similar connections as well as data exchanges as the corresponding ones in Fig. 7.

Note that the video encoder 700 can also be used to encode an image with only block partition unit 703 and intra prediction unit 706 enabled in prediction unit 702.

Fig. 8 is a schematic block diagram of a video decoder. Input bitstream of a decoder 800 can be a bitstream generated by the encoder 700. Parsing unit 801 parses the input bitstream and obtains values of syntax elements from the input bitstream. Parsing unit 801 converts binary representations of syntax elements to numerical values and forwards the numerical values to the units in the decoder 800 to derive one or more decoded pictures. Parsing unit 801 may also parse one or more syntax elements from the input bitstream for displaying the decoded pictures.

Parsing unit 801 forwards the values of syntax elements, as well as one or more variables set or determined according the values of syntax elements, for deriving one or more decoded pictures to the units in the decoder 800. Prediction unit 802 determines a prediction block of a current decoding block (e.g. a CU) . When it is indicated that an inter coding mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to MC unit 803 to derive inter prediction block. When it is indicated that an intra prediction mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to intra prediction unit 804 to derive intra prediction block.

Scaling unit 805 has the same function as that of inverse quantization unit 710 in the encoder 700. Scaling unit 805 performs scaling operations on quantized coefficients (i.e. Levels) from parsing unit 801 to get reconstructed coefficients.

Transform unit 806 has the same function as that of inverse transform unit 711 in the encoder 700. Transform unit 806 performs one or more transform operations (i.e. inverse operations of the one or more transform operations by inverse transform unit 711 in the encoder 700) to get reconstructed residual.

Adder 807 performs addition operation with its inputs of prediction block from prediction unit 802 and reconstructed residual from 806 to get reconstructed block of the current decoding block. The reconstructed block is also sent to prediction unit 802 to be used as reference for other blocks coded in intra prediction mode.

After all the CUs in a picture or a sub-picture have been reconstructed, filtering unit 808 performs in-loop filtering on the reconstructed picture or sub-picture. Filtering unit 808 contains one or more filters, for example, deblocking filter, sample adaptive offset (SAO) filter, adaptive loop filter (ALF) , luma mapping with chroma scaling (LMCS) filter and neural network based filters. Filtering unit 808 also contains the CNN-based filter discussed herein. As an example, the filters in the filtering unit 808 may be connected with each other in a cascading order. Alternatively, when filtering unit 808 determines that the reconstructed block is not used as reference for decoding other blocks, filtering unit 808 performs in-loop filtering on one or more target pixels in the reconstructed block.

Output of filtering unit 808 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 809. DPB 809 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 809 may also be employed as reference for performing inter or intra prediction by prediction unit 802. Optionally, the decoded pictures outputted from DPB 809 by the decoder 800 can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.

Decoder 800 could be a computing device with a processor and a storage medium recording a decoding program. When the processor reads and executes the decoding program, the decoder 800 reads an input video bitstream and generates corresponding decoded video.

Encoder 800 could be a computing device with one or more chips. The units, implemented as integrated circuits, on the chip are of similar functionalities with similar connections as well as data exchanges as the corresponding ones in Fig. 8.

Note that the video decoder 800 can also be used to decode a bitstream of an image. One example implementation is only set intra prediction unit 804 enabled in prediction unit 802 of the decoder 800.

ADDITIONAL CONSIDERATIONS

The above Detailed Description of examples of the disclosed technology is not intended to be exhaustive or to limit the disclosed technology to the precise form disclosed above. While specific examples for the disclosed technology are described above for illustrative purposes, various equivalent modifications are possible within the scope of the described technology, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative implementations may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative implementations or sub-combinations. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed or implemented in parallel, or may be performed at different times. Further, any specific numbers noted herein are only examples; alternative implementations may employ differing values or ranges.

In the Detailed Description, numerous specific details are set forth to provide a thorough understanding of the presently described technology. In other implementations, the techniques introduced here can be practiced without these specific details. In other instances, well-known features, such as specific functions or routines, are not described in detail in order to avoid unnecessarily obscuring the present disclosure. References in this description to “an implementation/embodiment, ” “one implementation/embodiment, ” or the like mean that a particular feature, structure, material, or characteristic being described is included in at least one implementation of the described technology. Thus, the appearances of such phrases in this specification do not necessarily all refer to the same implementation/embodiment. On the other hand, such references are not necessarily mutually exclusive either. Furthermore, the particular features, structures, materials, or characteristics can be combined in any suitable manner in one or more implementations/embodiments. It is to be understood that the various implementations shown in the figures are merely illustrative representations and are not necessarily drawn to scale.

Several details describing structures or processes that are well-known and often associated with communications systems and subsystems, but that can unnecessarily obscure some significant aspects of the disclosed techniques, are not set forth herein for purposes of clarity. Moreover, although the following disclosure sets forth several implementations of different aspects of the present disclosure, several other implementations can have different configurations or different components than those described in this section. Accordingly, the disclosed techniques can have other implementations with additional elements or without several of the elements described below.

Many implementations or aspects of the technology described herein can take the form of computer-or processor-executable instructions, including routines executed by a programmable computer or processor. Those skilled in the relevant art will appreciate that the described techniques can be practiced on computer or processor systems other than those shown and described below. The techniques described herein can be implemented in a special-purpose computer or data processor that is specifically programmed, configured, or constructed to execute one or more of the computer-executable instructions described below. Accordingly, the terms “computer” and “processor” as generally used herein refer to any data processor. Information handled by these computers and processors can be presented at any suitable display medium. Instructions for executing computer-or processor-executable tasks can be stored in or on any suitable computer-readable medium, including hardware, firmware, or a combination of hardware and firmware. Instructions can be contained in any suitable memory device, including, for example, a flash drive and/or other suitable medium.

The term “and/or” in this specification is only an association relationship for describing the associated objects, and indicates that three relationships may exist, for example, A and/or B may indicate the following three cases: A exists separately, both A and B exist, and B exists separately.

These and other changes can be made to the disclosed technology in light of the above Detailed Description. While the Detailed Description describes certain examples of the disclosed technology, as well as the best mode contemplated, the disclosed technology can be practiced in many ways, no matter how detailed the above description appears in text. Details of the system may vary considerably in its specific implementation, while still being encompassed by the technology disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the disclosed technology should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the disclosed technology with which that terminology is associated. Accordingly, the invention is not limited, except as by the appended claims. In general, the terms used in the following claims should not be construed to limit the disclosed technology to the specific examples disclosed in the specification, unless the above Detailed Description section explicitly defines such terms.

A person of ordinary skill in the art may be aware that, in combination with the examples described in the implementations disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.

Although certain aspects of the invention are presented below in certain claim forms, the applicant contemplates the various aspects of the invention in any number of claim forms. Accordingly, the applicant reserves the right to pursue additional claims after filing this application to pursue such additional claim forms, in either this application or in a continuing application.

Claims

A method for video processing, the method comprising:

receiving an input image;

decomposing the input image into a set of wavelet subbands using discrete wavelet transform;

inputting the set of wavelet subbands to a neural network framework,

wherein the neural network framework comprises two subnetworks, wherein:

the first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands, and

the second subnetwork is configured to expand a size of a receptive field of the restored set of wavelet subbands using a mixed convolution; and

receiving, from the neural network framework, an enhanced version of the input image output from the second subnetwork.
The method of claim 1, wherein the mixed convolution comprises a combination of a dilated convolution and a standard convolution.
The method of claim 1, wherein the first subnetwork further comprises one or more Res2NetGroups, wherein each Res2NetGroup comprises one or more Res2Net modules.
The method of claim 3, wherein each Res2NetGroup further comprises a first 3x3 convolutional layer at a beginning of the Res2NetGroup and a second 3x3 convolutional layer at an ending of the Res2NetGroup.
The method of claim 3, wherein each Res2Net module comprises one or more 3x3 convolutional layers between a first 1x1 convolutional layer and a second 1x1 convolutional layer.
The method of claim 5, wherein each Res2Net module comprises a spatial attention module and/or a channel attention module after the second 1x1 convolutional layer.
The method of claim 1, wherein the second subnetwork comprises one or more mixed convolution groups between a first 3x3 convolutional layer and a second 3x3 convolutional layer.
The method of claim 7, wherein each mixed convolution group comprises one or more densely connected mixed convolution blocks.
The method of claim 8, wherein each mixed convolution group comprises four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
The method of claim 1, wherein the neural network framework applies inverse discrete wavelet transform to the restored set of wavelet subbands between the first subnetwork and the second subnetwork.
A system for video processing, the system comprising:

a processor; and

a memory configured to store instructions, when executed by the processor, to:

receive an input image;

decompose the input image into a set of wavelet subbands using discrete wavelet transform;

input the set of wavelet subbands to a neural network framework,

wherein the neural network framework comprises two subnetworks, wherein:

the first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands, and

the second subnetwork is configured to expand a size of a receptive field of the restored set of wavelet subbands using a mixed convolution; and

receive, from the neural network framework, an enhanced version of the input image output from the second subnetwork.
The system of claim 11, wherein the mixed convolution comprises a combination of a dilated convolution and a standard convolution.
The system of claim 11, wherein the first subnetwork further comprises one or more Res2NetGroups, wherein each Res2NetGroup comprises one or more Res2Net modules.
The system of claim 13, wherein each Res2NetGroup further comprises a first 3x3 convolutional layer at a beginning of the Res2NetGroup and a second 3x3 convolutional layer at an ending of the Res2NetGroup.
The system of claim 13, wherein each Res2Net module comprises one or more 3x3 convolutional layers between a first 1x1 convolutional layer and a second 1x1 convolutional layer.
The system of claim 15, wherein each Res2Net module comprises a spatial attention and channel attention module after the second 1x1 convolutional layer.
The system of claim 11, wherein the second subnetwork comprises one or more mixed convolution groups between a first 3x3 convolutional layer and a second 3x3 convolutional layer.
The system of claim 11, wherein each mixed convolution group comprises one or more densely connected mixed convolution blocks.
The system of claim 18, wherein each mixed convolution group comprises four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
A method for video processing, the method comprising:

receiving an input image;

applying discrete wavelet transform to the input image to form a set of wavelet subbands;

inputting the set of wavelet subbands to a first network configured to restore the set of wavelet subbands from high frequency to low frequency;

applying inverse discrete wavelet transform to the restored wavelet subbands to create a reconstructed signal;

inputting the reconstructed signal to a second network configured to the second subnetwork is configured to expand a size of a receptive field using a mixed convolution; and

receiving, from the second network, an enhanced version of the input image.