WO2023197219A1 - Cnn-based post-processing filter for video compression with multi-scale feature representation - Google Patents
Cnn-based post-processing filter for video compression with multi-scale feature representation Download PDFInfo
- Publication number
- WO2023197219A1 WO2023197219A1 PCT/CN2022/086686 CN2022086686W WO2023197219A1 WO 2023197219 A1 WO2023197219 A1 WO 2023197219A1 CN 2022086686 W CN2022086686 W CN 2022086686W WO 2023197219 A1 WO2023197219 A1 WO 2023197219A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subnetwork
- wavelet
- convolution
- convolutional layer
- input image
- Prior art date
Links
- 238000007906 compression Methods 0.000 title description 13
- 230000006835 compression Effects 0.000 title description 11
- 238000012805 post-processing Methods 0.000 title description 5
- 238000000034 method Methods 0.000 claims abstract description 80
- 238000013528 artificial neural network Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 13
- 230000015654 memory Effects 0.000 claims description 33
- 238000013527 convolutional neural network Methods 0.000 description 52
- 238000001914 filtration Methods 0.000 description 24
- 238000005516 engineering process Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 238000004891 communication Methods 0.000 description 13
- 238000005192 partition Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 7
- 230000003044 adaptive effect Effects 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 238000013139 quantization Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 239000011449 brick Substances 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 241000023320 Luma <angiosperm> Species 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/10—Image enhancement or restoration using non-spatial domain filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/60—Image enhancement or restoration using machine learning, e.g. neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- the present disclosure relates to video compression schemes that can improve reconstruction performance. More specifically, the present disclosure is directed to systems and methods for providing a convolutional neural network filter used for post-processing video compression based on wavelet decomposition.
- Common image and video compression methods includes those using Joint Photographic Experts Group (JPEG) standard (e.g., for still images) JPEG as well as High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards (e.g., for videos) .
- JPEG Joint Photographic Experts Group
- HEVC High Efficiency Video Coding
- VVC Versatile Video Coding
- quantization and prediction processes are performed during the coding processes, resulting in irreversible information loss and various compression artifacts in compressed images/videos, such as blocking, blurring, and banding. This drawback is especially obvious when using a high compression ratio.
- the present disclosure is related to systems and methods for improving image qualities of videos using a neural network for video compression (e.g., video processing) based on wavelet decomposition. Though the following systems and methods are described in relation to video processing, in some embodiments, the systems and methods may be used for other image processing systems and methods.
- the convolutional neural network (CNN) framework can be trained by deep learning and/or artificial intelligent schemes.
- the present disclosure provides a CNN framework for improving image qualities of an input image based on two subnetworks: a step-like subband network and a mixed enhancement network.
- the step-like subband network includes a Res2Net Group (R2NG) composed of Res2Net modules to represent multiscale features.
- R2NG Res2Net Group
- the mixed enhancement network uses dilated convolution and standard convolution for an expanded receptive field without blind spots, unlike the use of dilated convolution alone.
- the CNN framework by using the two subnetworks, has improved reconstruction performance on images compared to common reconstruction systems and methods.
- a method employs the CNN framework for image compression.
- the method receives an input image, which may be a video frame in some embodiments.
- the method decomposes the input image into a set of wavelet subbands using discrete wavelet transform.
- the method inputs the set of wavelet subbands to the CNN framework.
- the CNN framework comprises two subnetworks.
- the first subnetwork (or the step-like subband network) is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands.
- the second subnetwork (or the mixed enhancement network) is configured to expand a size of a receptive field of a signal of the restored set of wavelet subbands using mixed convolution.
- the method receives an enhanced version of the input image from the CNN framework.
- the present method can be implemented by a tangible, non-transitory, computer-readable medium having processor instructions stored thereon that, when executed by one or more processors, cause the one or more processors to perform one or more aspects/features of the method described herein.
- the present method can be implemented by a system comprising a computer processor and a non-transitory computer-readable storage medium storing instructions that when executed by the computer processor cause the computer processor to perform one or more actions of the method described herein.
- the present methods can be implemented in various data processing flows, such as in in-loop filtering module of a video or image codec.
- the present methods can also be implemented in data processing flow, such as in post filtering module of a video or image codec.
- the present methods can be implemented by a system having an encoder and a decoder.
- the encoder can encode an input video/picture based on a set of rules (e.g., a video codec) and then transmit the encoded video/picture.
- the decoder can decode the encoded video based on the set of rules and then generate decoded video as an output video.
- the post filter examples include a CNN-based post-processing filter discussed herein.
- the post filter can be connected to the output of the decoder, and uses the output video as the post filter’s input to further process/filter the output video.
- the CNN-based filter discussed herein can be an "in-loop filter" at both the encoder and the decoder.
- the filter can be within an in-loop filtering module configured to enhance the quality of reconstructed pictures or a region a picture (e.g., coding unit, coding tree unit, sub-picture, etc. ) .
- reconstructed pictures at the encoder or the decoder can serve as an input for the in-loop filtering module.
- the output of the in-loop filtering module can be connected to a buffer (e.g., a decoded picture buffer) .
- Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework in accordance with one or more implementations of the present disclosure.
- CNN convolutional neural network
- Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure.
- Fig. 2B is a schematic diagram illustrating a Res2Net module of the CNN framework in accordance with one or more implementations of the present disclosure.
- Fig. 3 is a schematic diagram of a mixed convolution group of the CNN framework in accordance with one or more implementations of the present disclosure.
- Fig. 4 is a schematic diagram of a wireless communication system in accordance with one or more implementations of the present disclosure.
- Fig. 5 is a schematic block diagram of a terminal device in accordance with one or more implementations of the present disclosure.
- Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure.
- Fig. 7 is a flowchart of a video encoder in accordance with one or more implementations of the present disclosure.
- Fig. 8 is a flowchart of a video decoder in accordance with one or more implementations of the present disclosure.
- Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework 100 in accordance with one or more implementations of the present disclosure.
- the CNN framework 100 is configured to learn, train, and/or use residual information to improve image qualities of an input image 101.
- the CNN framework 100 uses two types of subnetworks, which are a step-like subband network 102 and a mixed enhancement network 103, to reconstruct the input image.
- the CNN framework 100 may include one or more additional filters or convolutional filters not shown in Fig. 1.
- the CNN framework 100 receives an input image 101.
- the input image 101 may be a singular image that has been compressed before transmission or may be one of a set of video frames from a video.
- the CNN framework may receive the input image 101 from a computing device connected to the CNN framework or a system that employs the CNN framework via a network or direct connection.
- the signal of the input image 101 can be represented by wavelet coefficients after the application of discrete wavelet transform (DWT) 106.
- DWT discrete wavelet transform
- the wavelet coefficients reflect similarity between wavelet basis and image signal and form wavelet subbands.
- the CNN Framework 100 uses discrete wavelet transform (DWT) 106 to decompose the input image 100 into four wavelet subbands: a low frequency feature (LL) subband 107A, a vertical feature (HL) subband 107B, a horizontal feature (LH) subband 107C, and a diagonal feature (HH) subband 107D. Most content from the input image 101 is concentrated in the LL subband 107A.
- the HL subband 107B, LH subband 107C, and HH subband 107D are high frequency components of the input image 101 and contain information on areas of the input image 101 with sharp changes in gray value (e.g., edges and texture in the input image 101) .
- the HH subband 107D includes the least amount of information from the input image 107D.
- Each wavelet subband 107 represents a different direction, such that the relationships between the wavelet subbands 107 include feature location from the input image 101.
- the CNN framework 100 processes the wavelet subbands 107 in the order of: HH, HL, LH, and LL or HH, LH, HL, and LL.
- HH is processed first because it contains the least information of the wavelet subbands 107 and is easily lost during the compression process.
- LL is processed last because it contains most information and loses less information during the compression process.
- the CNN framework 100 inputs each wavelet subband 107 through 1x1 convolutional layers 113 and step-by-step to the step-like subband network 102.
- the step-like subband network 102 processes the high frequency subbands 107B-D before the low frequency subband 107A and uses the restored high frequency subbands 107B-D to aid in recovery of the low frequency subband 107A.
- the step-like subband network 102 processes the wavelet subbands 107 one-by-one from highest frequency to lowest frequency and uses the already-processed wavelet subbands 107 to aid in recovery of the wavelet subbands subsequently processed.
- Each wavelet subband 107 corresponds to a Res2NetGroup 105 in the step-like subband network 102.
- the Res2NetGroups 105 are further described in relation to Fig. 2A-B.
- the step-like subband network may be represented by Equations 1-4, where LL, HL, LH, HH represent the input wavelet subbands, LL', HL', LH', HH' represent output from the step-like subband network for each wavelet subband 107, respectively, and R LL , R HL , R LH , R HH represent the Res2Net Group 105 corresponding to each wavelet subband 107, respectively.
- the CNN framework puts each wavelet subband output from the Res2NetGroups 105 through 1x1 convolutional layers 113 and performs inverse discrete wavelet transform (IDWT) 108 on the wavelet subbands 107 to reconstruct the signal from the input image 101.
- the CNN framework 100 inputs the reconstructed signal to the mixed enhancement network 103 to expand the size of the receptive field of the reconstructed signal.
- the mixed enhancement network 103 uses a combination of dilated convolution and standard convolution in each mixed convolution group 109 and further includes two 3x3 convolutional layers 111, one at the beginning of the mixed enhancement network 103 and one at the end of the mixed enhancement network 103 (e.g., after the mixed convolution groups 109.
- the structure of the mixed convolution groups 109 are further described in relation to Fig. 3.
- the mixed enhancement network 103 outputs a signal representing an enhanced version of the input image 101 (e.g., an enhanced image 104) .
- Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure.
- the Res2NetGroup 105 includes a 3x3 convolutional layer 203A at its beginning and a 3x3 convolutional layer 203B its end.
- the Res2NetGroup may include additional 3x3 convolutional layers 203 at its residual connections.
- the Res2NetGroup 105 also includes a set of Res2Net modules 201A-B.
- the Res2NetGroup 105 may include three Res2Net modules 201.
- any number of Res2Net modules 201 may be included in each Res2NetGroup 105.
- Each Res2Net module has stronger multi-feature extraction ability compared to a traditional bottleneck block, with a similar computational load. Using Res2Net modules 201 in the Res2NetGroup improves the CNN’s multi-scale representation capabilities.
- Fig. 2B is a schematic diagram illustrating a Res2Net module 201.
- the Res2Net module 201 processes segmentation in a multi-scale manner to extract both global and local information from the wavelet subbands 107. Each segment 209 in the Res2Net module 201 is connected to the other segments and passed through a 3x3 convolutional layer 203 to fuse information from the input image 101 on different scales.
- the Res2Net module 201 includes a 1x1 convolutional layer 205A at its beginning and a 1x1 convolutional layer 205B after the 3x3 convolutional layers 203 and before spatial attention and channel attention module 207, which is embedded to adaptively enhance the channel and spatial feature response of the Res2Net module 201.
- Fig. 3 is a schematic diagram of a mixed convolution group 103 of mixed enhancement network 103 in accordance with one or more implementations of the present disclosure.
- the mixed convolution network 103 expands the size of the receptive field of the reconstructed signal 301 from the input image 101 using a set of mixed convolution groups 109.
- the mixed enhancement network 103 includes three mixed convolution groups 109 and two 3x3 convolutional layers 111.
- the mixed enhancement network 1034 includes any number of mixed convolution groups 109.
- Each mixed convolution group 109 uses a combination of dilated convolution and standard convolution. Dilated convolution expands the receptive field without increasing parameters of the reconstructed signal or reducing resolution of the reconstructed signal.
- a mixed convolution group 109 receives an input 301.
- the mixed convolutional layer of the mixed enhancement network 103 includes N channels (e.g., 64 channels) , where “pxN” channels are produced by dilated convolution and the remaining channels are produced by standard convolution (e.g., p is a convolution coefficient) .
- N channels e.g. 64 channels
- pxN a convolution coefficient
- the input 301 is the reconstructed signal that underwent IDWT 108.
- the next mixed convolution group 109B receives the output 303 of the first mixed convolution group 109A as input 301.
- the reconstructed signal is input and output from the sequence of mixed convolution groups 109 the mixed enhancement network 103 in a similar fashion.
- the output of the last mixed convolution group 109C is a signal representing the enhance image 104.
- the mixed convolution group 109 of Fig. 3 includes a densely connected set of mixed convolution blocks 305 between two 3x3 convolutional layers 111.
- the set of mixed convolutional blocks 305 have sequential dilated coefficients of 1 , 2 , 4, and 8, respectively.
- One set of comparative tests demonstrates that the CNN framework recovers objects more effectively than VTM 11.0-NNVC and retains more perceptual texture details. A number of perceptual improvements were spotted in the regions in a test video including the wall in the background, the clothes of the man, and the man’s outline. In another set of comparative tests, it is proved that the CNN framework removes the block effect. For example, it was observed that the CNN framework retained visible details of the necklace in a test video more effectively than VTM 11.0-NNVC.
- Tables 1-3 below shows quantitative measurements of the use of the CNN framework 100, compared to VTM 11.0-NNVC, as a post-processing filter on input images 101.
- Y-PNSR represents the peak signal-to-noise ratio of the Y channel
- Y-MSIM represents the multi-scale structural similarity of the Y channel of a processed image.
- negative BD-rate values represent coding gains from the use of the CNN framework 100.
- Table 1 shows the quantitative measurements of BD-rate of using the CNN framework in a Random Access (RA) configuration, compared to VTM 11.0-NNVC.
- Table 2 shows the quantitative measurements of using the CNN framework in an All Intra (AI) configuration, compared to VTM 11.0-NNVC.
- Table 3 shows the quantitative measurements of using the CNN framework in a Low Delay P (LDP) , compared to VTM 11.0-NNVC.
- LDP Low Delay P
- the CNN framework 100 achieves average BD-rate reductions of 2.99%, 4.8%, 3.72%and 4.5%over VTM 11.0-NNVC for Y channel on B, C, D and E classes in AI, RA and LDP configurations, respectively.
- the CNN framework 100 produces improved performance over VTM 11.0-NNVC for compression.
- Fig. 4 is a schematic diagram of a wireless communication system 400 in accordance with one or more implementations of the present disclosure.
- the wireless communication system 400 can implement the CNN framework 100 discussed herein.
- the wireless communications system 400 can include a network device (or base station) 401.
- the network device 401 include a base transceiver station (Base Transceiver Station, BTS) , a NodeB (NodeB, NB) , an evolved Node B (eNB or eNodeB) , a Next Generation NodeB (gNB or gNode B) , a Wireless Fidelity (Wi-Fi) access point (AP) , etc.
- BTS Base Transceiver Station
- NodeB NodeB
- eNB or eNodeB evolved Node B
- gNB or gNode B Next Generation NodeB
- Wi-Fi Wireless Fidelity
- the network device 401 can include a relay station, an access point, an in-vehicle device, a wearable device, and the like.
- the network device 401 can include wireless connection devices for communication networks such as: a Global System for Mobile Communications (GSM) network, a Code Division Multiple Access (CDMA) network, a Wideband CDMA (WCDMA) network, an LTE network, a cloud radio access network (Cloud Radio Access Network, CRAN) , an Institute of Electrical and Electronics Engineers (IEEE) 802.11-based network (e.g., a Wi-Fi network) , an Internet of Things (IoT) network, a device-to-device (D2D) network, a next-generation network (e.g., a 5G network) , a future evolved public land mobile network (Public Land Mobile Network, PLMN) , or the like.
- a 5G system or network can be referred to as a new radio (New Radio, NR) system or network.
- the wireless communications system 400 also includes a terminal device 403.
- the terminal device 403 can be an end-user device configured to facilitate wireless communication.
- the terminal device 403 can be configured to wirelessly connect to the network device 401 (via, e.g., via a wireless channel 405) according to one or more corresponding communication protocols/standards.
- the terminal device 403 may be mobile or fixed.
- the terminal device 403 can be a user equipment (UE) , an access terminal, a user unit, a user station, a mobile site, a mobile station, a remote station, a remote terminal, a mobile device, a user terminal, a terminal, a wireless communications device, a user agent, or a user apparatus.
- UE user equipment
- Examples of the terminal device 403 include a modem, a cellular phone, a smartphone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (WLL) station, a personal digital assistant (PDA) , a handheld device having a wireless communication function, a computing device or another processing device connected to a wireless modem, an in-vehicle device, a wearable device, an Internet-of-Things (IoT) device, a device used in a 5G network, a device used in a public land mobile network, or the like.
- Fig. 4 illustrates only one network device 401 and one terminal device 403 in the wireless communications system 400. However, in some instances, the wireless communications system 400 can include additional network device 401 and/or terminal device 403.
- Fig. 5 is a schematic block diagram of a terminal device 403 (e.g., which can implement the methods discussed herein) in accordance with one or more implementations of the present disclosure.
- the terminal device 403 includes a processing unit 510 (e.g., a DSP, a CPU, a GPU, etc. ) and a memory 520.
- the processing unit 510 can be configured to implement instructions that correspond to the method 800 of Fig. 6 and/or other aspects of the implementations described above.
- the processor 510 in the implementations of this technology may be an integrated circuit chip and has a signal processing capability.
- the steps in the foregoing method may be implemented by using an integrated logic circuit of hardware in the processor 510 or an instruction in the form of software.
- the processor 510 may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) or another programmable logic device, a discrete gate or transistor logic device, and a discrete hardware component.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- the methods, steps, and logic block diagrams disclosed in the implementations of this technology may be implemented or performed.
- the general-purpose processor 510 may be a microprocessor, or the processor 510 may be alternatively any conventional processor or the like.
- the steps in the methods disclosed with reference to the implementations of this technology may be directly performed or completed by a decoding processor implemented as hardware or performed or completed by using a combination of hardware and software modules in a decoding processor.
- the software module may be located at a random-access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, or another mature storage medium in this field.
- the storage medium is located at a memory 520, and the processor 510 reads information in the memory 520 and completes the steps in the foregoing methods in combination with the hardware thereof.
- the memory 520 in the implementations of this technology may be a volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory.
- the non-volatile memory may be a read-only memory (ROM) , a programmable read-only memory (PROM) , an erasable programmable read-only memory (EPROM) , an electrically erasable programmable read-only memory (EEPROM) or a flash memory.
- the volatile memory may be a random-access memory (RAM) and is used as an external cache.
- RAMs can be used, and are, for example, a static random-access memory (SRAM) , a dynamic random-access memory (DRAM) , a synchronous dynamic random-access memory (SDRAM) , a double data rate synchronous dynamic random-access memory (DDR SDRAM) , an enhanced synchronous dynamic random-access memory (ESDRAM) , a synchronous link dynamic random-access memory (SLDRAM) , and a direct Rambus random-access memory (DR RAM) .
- SRAM static random-access memory
- DRAM dynamic random-access memory
- SDRAM synchronous dynamic random-access memory
- DDR SDRAM double data rate synchronous dynamic random-access memory
- ESDRAM enhanced synchronous dynamic random-access memory
- SLDRAM synchronous link dynamic random-access memory
- DR RAM direct Rambus random-access memory
- the memories in the systems and methods described herein are intended to include, but are not limited to, these memories and memories of any other suitable type.
- the memory may be a non-transitory computer-readable
- Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure.
- the method 800 can be implemented by a system (such as a system with the CNN framework 100) .
- the method 800 is for enhancing image qualities (particularly, for compressed images) .
- the method 800 includes, at block 601, receiving an input image 101.
- the method 800 continues by decomposing the input image 101 into a set of wavelet subbands 107.
- the decomposition of the input image is accomplished using DWT 106.
- the method 800 proceeds to input the wavelet subbands 107 to the CNN framework 100.
- the CNN framework 100 comprises two subnetworks.
- a first subnetwork (e.g., the step-like subband network 102 of Figs. 2A and 2B) is configured to restore the set of wavelet subbands 107 from high frequency to low frequency and uses the restored high frequency wavelet subbands 107B-D to restore the low frequency subband 107A.
- a second subnetwork (e.g., the mixed enhancement network 103 of Fig. 3) is configured to expand the size of the receptive field of the restored set of wavelet subbands using mixed convolution.
- the CNN framework 100 may apply IDWT 108 to the restored set of wavelet subbands between the first subnetwork and the second subnetwork to create a reconstructed signal.
- the method 800 receives an enhanced version of the input image 101 from the CNN framework 100.
- the first subnetwork may comprise one or more Res2NetGroups 105, where each Res2NetGroup 105 includes one or more Res2Net modules 201.
- each Res2NetGroup 105 may comprise a first 3x3 convolutional layer 203A at a beginning of the Res2NetGroup 105 and a second 3x3 convolutional layer 203B at an ending of the Res2NetGroup 105.
- each Res2Net module 201 may comprise one or more 3x3 convolutional layers 203 between a first 1x1 convolutional layer 205A and a second 1x1 convolutional layer 205B.
- each Res2Net module 210 may comprise a spatial attention and channel attention module 207 implemented after the second 1x1 convolutional layer 205B.
- the mixed convolution may comprise a combination of dilated convolution and standard convolution.
- the second subnetwork may comprise one or more mixed convolution groups between a first and second 3x3 convolutional layer and each mixed convolution group may comprise two or more densely connected mixed convolution blocks.
- each mixed convolution group may comprise four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
- Fig. 7 is a schematic block diagram of a video encoder.
- An input video contains one or more pictures.
- Partition unit 701 divides a picture in an input video into one or more coding tree units (CTUs) .
- Partition unit 701 divides the picture into tiles, and optionally may further divide a tile into one or more bricks, wherein a tile or a brick contains one or more integral and/or partial CTUs.
- Partition unit 701 forms one or more slices, wherein a slice may contain one or more tiles in a raster order of tiles in the picture, or one or more tiles covering a rectangular region in the picture.
- Partition unit 701 may also forms one or more sub-pictures, wherein a sub-picture contains one or more slices, tiles or bricks.
- partition unit 701 passes CTUs to prediction unit 702.
- prediction unit 702 is composed of block partition unit 703, ME (motion estimation) unit 704, MC (motion compensation) unit 705 and intra prediction unit 706.
- Block partition unit 703 further divides an input CTU into smaller coding units (CUs) using quadtree split, binary split and ternary split iteratively.
- Prediction unit 702 may derive inter prediction block of a CU using ME unit 704 and MC unit 705.
- Intra prediction unit 706 may derive intra prediction block of a CU using various intra prediction modes including angular prediction modes, DC mode, plenar mode, matrix-based intra prediction mode, and etc.
- rate-distortion optimized motion estimation method can be invoked by ME unit 704 and MC unit 705 to derive the inter prediction block
- rate-distortion optimized mode decision method can be invoked by intra prediction unit 706 to get the intra prediction block.
- Prediction unit 702 outputs a prediction block of a CU.
- Adder 707 calculates a difference, i.e. residual CU, between the CU in the output of partition unit 701 and the prediction block of the CU.
- Transform unit 708 reads the residual CU, and performs one or more transform operations on the residual CU to get coefficients.
- Quantization unit 709 quantizes the coefficients and outputs quantized coefficients, i.e. levels.
- Inverse quantization unit 710 performs scaling operations on the quantized coefficients to output reconstructed coefficients.
- Inverse transform unit 711 performs one or more inverse transforms corresponding to the transforms in transform unit 708 and output reconstructed residual.
- Adder 712 calculates reconstructed CU by adding the reconstructed residual and the prediction block of the CU from prediction unit 702. Adder 712 also forwards its output to prediction unit 702 to be used as intra prediction reference. After all the CUs in a picture or a sub-picture have been reconstructed, filtering unit 712 performs in-loop filtering on the reconstructed picture or sub-picture.
- Filtering unit 712 contains one or more filters, for example, deblocking filter, sample adaptive offset (SAO) filter, adaptive loop filter (ALF) , luma mapping with chroma scaling (LMCS) filter and neural network based filters.
- Filtering unit 712 also contains the CNN-based filter discussed herein.
- the filters in the filtering unit 712 may be connected with each other in a cascading order.
- filtering unit 712 determines that the CU is not used as reference for encoding other CUs, filtering unit 712 performs in-loop filtering on one or more target pixels in the CU.
- Output of filtering unit 712 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 713.
- DPB 713 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 713 may also be employed as reference for performing inter or intra prediction by prediction unit 702.
- the decoded picture can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.
- Entropy coding unit 715 converts parameters from units in encoder 700 that are necessary for deriving decoded picture as well as control parameters and supplemental information into binary representations, and writes such binary representations according to syntax structure of each data unit into a generated video bitstream.
- Encoder 700 could be a computing device with a processor and a storage medium recording an encoding program. When the processor reads and executes the encoding program, the encoder 700 reads an input video and generates corresponding video bitstream.
- Encoder 700 could be a computing device with one or more chips.
- the units, implemented as integrated circuits, on the chip are of similar functionalities with similar connections as well as data exchanges as the corresponding ones in Fig. 7.
- video encoder 700 can also be used to encode an image with only block partition unit 703 and intra prediction unit 706 enabled in prediction unit 702.
- Fig. 8 is a schematic block diagram of a video decoder.
- Input bitstream of a decoder 800 can be a bitstream generated by the encoder 700.
- Parsing unit 801 parses the input bitstream and obtains values of syntax elements from the input bitstream. Parsing unit 801 converts binary representations of syntax elements to numerical values and forwards the numerical values to the units in the decoder 800 to derive one or more decoded pictures. Parsing unit 801 may also parse one or more syntax elements from the input bitstream for displaying the decoded pictures.
- Parsing unit 801 forwards the values of syntax elements, as well as one or more variables set or determined according the values of syntax elements, for deriving one or more decoded pictures to the units in the decoder 800.
- Prediction unit 802 determines a prediction block of a current decoding block (e.g. a CU) . When it is indicated that an inter coding mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to MC unit 803 to derive inter prediction block. When it is indicated that an intra prediction mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to intra prediction unit 804 to derive intra prediction block.
- Scaling unit 805 has the same function as that of inverse quantization unit 710 in the encoder 700. Scaling unit 805 performs scaling operations on quantized coefficients (i.e. Levels) from parsing unit 801 to get reconstructed coefficients.
- quantized coefficients i.e. Levels
- Transform unit 806 has the same function as that of inverse transform unit 711 in the encoder 700. Transform unit 806 performs one or more transform operations (i.e. inverse operations of the one or more transform operations by inverse transform unit 711 in the encoder 700) to get reconstructed residual.
- transform operations i.e. inverse operations of the one or more transform operations by inverse transform unit 711 in the encoder 700
- Adder 807 performs addition operation with its inputs of prediction block from prediction unit 802 and reconstructed residual from 806 to get reconstructed block of the current decoding block.
- the reconstructed block is also sent to prediction unit 802 to be used as reference for other blocks coded in intra prediction mode.
- filtering unit 808 After all the CUs in a picture or a sub-picture have been reconstructed, filtering unit 808 performs in-loop filtering on the reconstructed picture or sub-picture.
- Filtering unit 808 contains one or more filters, for example, deblocking filter, sample adaptive offset (SAO) filter, adaptive loop filter (ALF) , luma mapping with chroma scaling (LMCS) filter and neural network based filters.
- Filtering unit 808 also contains the CNN-based filter discussed herein.
- the filters in the filtering unit 808 may be connected with each other in a cascading order.
- filtering unit 808 determines that the reconstructed block is not used as reference for decoding other blocks, filtering unit 808 performs in-loop filtering on one or more target pixels in the reconstructed block.
- Output of filtering unit 808 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 809.
- DPB 809 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 809 may also be employed as reference for performing inter or intra prediction by prediction unit 802.
- the decoded pictures outputted from DPB 809 by the decoder 800 can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.
- Decoder 800 could be a computing device with a processor and a storage medium recording a decoding program.
- the decoder 800 reads an input video bitstream and generates corresponding decoded video.
- Encoder 800 could be a computing device with one or more chips.
- the units, implemented as integrated circuits, on the chip are of similar functionalities with similar connections as well as data exchanges as the corresponding ones in Fig. 8.
- video decoder 800 can also be used to decode a bitstream of an image.
- One example implementation is only set intra prediction unit 804 enabled in prediction unit 802 of the decoder 800.
- Instructions for executing computer-or processor-executable tasks can be stored in or on any suitable computer-readable medium, including hardware, firmware, or a combination of hardware and firmware. Instructions can be contained in any suitable memory device, including, for example, a flash drive and/or other suitable medium.
- a and/or B may indicate the following three cases: A exists separately, both A and B exist, and B exists separately.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Methods and systems for video processing are provided. In some embodiments, the method includes (i) receiving an input image; (ii) decomposing the input image using discrete wavelet transform; (iii) inputting a set of wavelet subbands to a neural network framework, where the neural network framework comprises two a first and second subnetwork; and (iv) receiving, from the neural network framework, an enhanced version of the input image. The first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency using the restored high frequency wavelet subbands to restore the low frequency wavelet subbands. The neural network framework may apply inverse discrete wave transform to the restored wavelet subbands to create a reconstructed signal. The second subnetwork is configured to expand a size of a receptive field of the reconstructed signal for the enhanced version of the input image.
Description
The present disclosure relates to video compression schemes that can improve reconstruction performance. More specifically, the present disclosure is directed to systems and methods for providing a convolutional neural network filter used for post-processing video compression based on wavelet decomposition.
Common image and video compression methods includes those using Joint Photographic Experts Group (JPEG) standard (e.g., for still images) JPEG as well as High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards (e.g., for videos) . In these methods, quantization and prediction processes are performed during the coding processes, resulting in irreversible information loss and various compression artifacts in compressed images/videos, such as blocking, blurring, and banding. This drawback is especially obvious when using a high compression ratio.
To address the foregoing drawback, multiple deep-learning based methods are used. These methods include frameworks/networks based on simple concatenated layers, deep residual blocks, dense connections, cascading connections, and feature reuse. Most of these methods do not employ advanced features, such as residual dense blocks and informative associations between different frequencies. As a result, these methods cannot further improve their learning and feature selection abilities, thus their compression artifact removal is very limited.
SUMMARY
The present disclosure is related to systems and methods for improving image qualities of videos using a neural network for video compression (e.g., video processing) based on wavelet decomposition. Though the following systems and methods are described in relation to video processing, in some embodiments, the systems and methods may be used for other image processing systems and methods. The convolutional neural network (CNN) framework can be trained by deep learning and/or artificial intelligent schemes. The present disclosure provides a CNN framework for improving image qualities of an input image based on two subnetworks: a step-like subband network and a mixed enhancement network. The step-like subband network includes a Res2Net Group (R2NG) composed of Res2Net modules to represent multiscale features. The mixed enhancement network uses dilated convolution and standard convolution for an expanded receptive field without blind spots, unlike the use of dilated convolution alone. The CNN framework, by using the two subnetworks, has improved reconstruction performance on images compared to common reconstruction systems and methods.
In some embodiments, a method employs the CNN framework for image compression. The method receives an input image, which may be a video frame in some embodiments. The method decomposes the input image into a set of wavelet subbands using discrete wavelet transform. The method inputs the set of wavelet subbands to the CNN framework. The CNN framework comprises two subnetworks. The first subnetwork (or the step-like subband network) is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands. The second subnetwork (or the mixed enhancement network) is configured to expand a size of a receptive field of a signal of the restored set of wavelet subbands using mixed convolution. The method receives an enhanced version of the input image from the CNN framework.
In some embodiments, the present method can be implemented by a tangible, non-transitory, computer-readable medium having processor instructions stored thereon that, when executed by one or more processors, cause the one or more processors to perform one or more aspects/features of the method described herein. In other embodiments, the present method can be implemented by a system comprising a computer processor and a non-transitory computer-readable storage medium storing instructions that when executed by the computer processor cause the computer processor to perform one or more actions of the method described herein.
In some embodiments, the present methods can be implemented in various data processing flows, such as in in-loop filtering module of a video or image codec. The present methods can also be implemented in data processing flow, such as in post filtering module of a video or image codec.
In some embodiments, the present methods can be implemented by a system having an encoder and a decoder. In such embodiments, the encoder can encode an input video/picture based on a set of rules (e.g., a video codec) and then transmit the encoded video/picture. After receiving the encoded video, the decoder can decode the encoded video based on the set of rules and then generate decoded video as an output video. In some embodiments, there can be some “post filter” or “post-filtering” process to further process the output video.
Examples of the post filter include a CNN-based post-processing filter discussed herein. For example, the post filter can be connected to the output of the decoder, and uses the output video as the post filter’s input to further process/filter the output video.
In some embodiments, the CNN-based filter discussed herein can be an "in-loop filter" at both the encoder and the decoder. For example, in such embodiments, the filter can be within an in-loop filtering module configured to enhance the quality of reconstructed pictures or a region a picture (e.g., coding unit, coding tree unit, sub-picture, etc. ) . In such embodiments, reconstructed pictures at the encoder or the decoder can serve as an input for the in-loop filtering module. In some embodiments, the output of the in-loop filtering module can be connected to a buffer (e.g., a decoded picture buffer) .
To describe the technical solutions in the implementations of the present disclosure more clearly, the following briefly describes the accompanying drawings. The accompanying drawings show merely some aspects or implementations of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework in accordance with one or more implementations of the present disclosure.
Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure.
Fig. 2B is a schematic diagram illustrating a Res2Net module of the CNN framework in accordance with one or more implementations of the present disclosure.
Fig. 3 is a schematic diagram of a mixed convolution group of the CNN framework in accordance with one or more implementations of the present disclosure.
Fig. 4 is a schematic diagram of a wireless communication system in accordance with one or more implementations of the present disclosure.
Fig. 5 is a schematic block diagram of a terminal device in accordance with one or more implementations of the present disclosure.
Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure.
Fig. 7 is a flowchart of a video encoder in accordance with one or more implementations of the present disclosure.
Fig. 8 is a flowchart of a video decoder in accordance with one or more implementations of the present disclosure.
To describe the technical solutions in the implementations of the present disclosure more clearly, the following briefly describes the accompanying drawings. The accompanying drawings show merely some aspects or implementations of the present disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a convolutional neural network (CNN) framework 100 in accordance with one or more implementations of the present disclosure. The CNN framework 100 is configured to learn, train, and/or use residual information to improve image qualities of an input image 101. As discussed in detail below, the CNN framework 100 uses two types of subnetworks, which are a step-like subband network 102 and a mixed enhancement network 103, to reconstruct the input image. In some embodiments, the CNN framework 100 may include one or more additional filters or convolutional filters not shown in Fig. 1.
The CNN framework 100 receives an input image 101. The input image 101 may be a singular image that has been compressed before transmission or may be one of a set of video frames from a video. The CNN framework may receive the input image 101 from a computing device connected to the CNN framework or a system that employs the CNN framework via a network or direct connection. The signal of the input image 101 can be represented by wavelet coefficients after the application of discrete wavelet transform (DWT) 106. The wavelet coefficients reflect similarity between wavelet basis and image signal and form wavelet subbands. The CNN Framework 100 uses discrete wavelet transform (DWT) 106 to decompose the input image 100 into four wavelet subbands: a low frequency feature (LL) subband 107A, a vertical feature (HL) subband 107B, a horizontal feature (LH) subband 107C, and a diagonal feature (HH) subband 107D. Most content from the input image 101 is concentrated in the LL subband 107A. The HL subband 107B, LH subband 107C, and HH subband 107D are high frequency components of the input image 101 and contain information on areas of the input image 101 with sharp changes in gray value (e.g., edges and texture in the input image 101) . The HH subband 107D includes the least amount of information from the input image 107D. Each wavelet subband 107 represents a different direction, such that the relationships between the wavelet subbands 107 include feature location from the input image 101. The CNN framework 100 processes the wavelet subbands 107 in the order of: HH, HL, LH, and LL or HH, LH, HL, and LL. HH is processed first because it contains the least information of the wavelet subbands 107 and is easily lost during the compression process. LL is processed last because it contains most information and loses less information during the compression process.
The CNN framework 100 inputs each wavelet subband 107 through 1x1 convolutional layers 113 and step-by-step to the step-like subband network 102. The step-like subband network 102 processes the high frequency subbands 107B-D before the low frequency subband 107A and uses the restored high frequency subbands 107B-D to aid in recovery of the low frequency subband 107A. In some embodiments, the step-like subband network 102 processes the wavelet subbands 107 one-by-one from highest frequency to lowest frequency and uses the already-processed wavelet subbands 107 to aid in recovery of the wavelet subbands subsequently processed. Each wavelet subband 107 corresponds to a Res2NetGroup 105 in the step-like subband network 102. The Res2NetGroups 105 are further described in relation to Fig. 2A-B.
The step-like subband network may be represented by Equations 1-4, where LL, HL, LH, HH represent the input wavelet subbands, LL', HL', LH', HH' represent output from the step-like subband network for each wavelet subband 107, respectively, and R
LL, R
HL, R
LH, R
HH represent the Res2Net Group 105 corresponding to each wavelet subband 107, respectively.
HH′= R
HH (HH) Equation 1
LH′= R
LH (LH+HH′) Equation 2
HL′= R
HL (HL+LH′) Equation 3
LL′= R
LL (LL+HL′) Equation 4
The CNN framework puts each wavelet subband output from the Res2NetGroups 105 through 1x1 convolutional layers 113 and performs inverse discrete wavelet transform (IDWT) 108 on the wavelet subbands 107 to reconstruct the signal from the input image 101. The CNN framework 100 inputs the reconstructed signal to the mixed enhancement network 103 to expand the size of the receptive field of the reconstructed signal. The mixed enhancement network 103 uses a combination of dilated convolution and standard convolution in each mixed convolution group 109 and further includes two 3x3 convolutional layers 111, one at the beginning of the mixed enhancement network 103 and one at the end of the mixed enhancement network 103 (e.g., after the mixed convolution groups 109. The structure of the mixed convolution groups 109 are further described in relation to Fig. 3. The mixed enhancement network 103 outputs a signal representing an enhanced version of the input image 101 (e.g., an enhanced image 104) .
Fig. 2A is a schematic diagram illustrating a Res2NetGroup of the CNN framework in accordance with one or more implementations of the present disclosure. As shown in Fig. 2A, the Res2NetGroup 105 includes a 3x3 convolutional layer 203A at its beginning and a 3x3 convolutional layer 203B its end. In some embodiments, the Res2NetGroup may include additional 3x3 convolutional layers 203 at its residual connections. The Res2NetGroup 105 also includes a set of Res2Net modules 201A-B. In some embodiments the Res2NetGroup 105 may include three Res2Net modules 201. In other embodiments, any number of Res2Net modules 201 may be included in each Res2NetGroup 105. Each Res2Net module has stronger multi-feature extraction ability compared to a traditional bottleneck block, with a similar computational load. Using Res2Net modules 201 in the Res2NetGroup improves the CNN’s multi-scale representation capabilities.
Fig. 2B is a schematic diagram illustrating a Res2Net module 201. The Res2Net module 201 processes segmentation in a multi-scale manner to extract both global and local information from the wavelet subbands 107. Each segment 209 in the Res2Net module 201 is connected to the other segments and passed through a 3x3 convolutional layer 203 to fuse information from the input image 101 on different scales. The Res2Net module 201 includes a 1x1 convolutional layer 205A at its beginning and a 1x1 convolutional layer 205B after the 3x3 convolutional layers 203 and before spatial attention and channel attention module 207, which is embedded to adaptively enhance the channel and spatial feature response of the Res2Net module 201.
Fig. 3 is a schematic diagram of a mixed convolution group 103 of mixed enhancement network 103 in accordance with one or more implementations of the present disclosure. The mixed convolution network 103 expands the size of the receptive field of the reconstructed signal 301 from the input image 101 using a set of mixed convolution groups 109. In some embodiments, the mixed enhancement network 103 includes three mixed convolution groups 109 and two 3x3 convolutional layers 111. In other embodiments, the mixed enhancement network 1034 includes any number of mixed convolution groups 109. Each mixed convolution group 109 uses a combination of dilated convolution and standard convolution. Dilated convolution expands the receptive field without increasing parameters of the reconstructed signal or reducing resolution of the reconstructed signal. Due to the predefined gap of dilated convolution, using only dilated convolution causes blind spots to appear in the receptive field due to lack of contextual information available for between pixels of the input image 101, so the use of dilated convolution and standard convolution together results in an improved output 303.
As shown in Fig. 3, a mixed convolution group 109 receives an input 301. The mixed convolutional layer of the mixed enhancement network 103 includes N channels (e.g., 64 channels) , where “pxN” channels are produced by dilated convolution and the remaining channels are produced by standard convolution (e.g., p is a convolution coefficient) . For the first mixed convolution group 109A in the mixed enhancement network 103, the input 301 is the reconstructed signal that underwent IDWT 108. The next mixed convolution group 109B receives the output 303 of the first mixed convolution group 109A as input 301. The reconstructed signal is input and output from the sequence of mixed convolution groups 109 the mixed enhancement network 103 in a similar fashion. The output of the last mixed convolution group 109C is a signal representing the enhance image 104.
The mixed convolution group 109 of Fig. 3 includes a densely connected set of mixed convolution blocks 305 between two 3x3 convolutional layers 111. The set of mixed convolutional blocks 305 have sequential dilated coefficients of 1 , 2 , 4, and 8, respectively. The mixed convolution groups 109 may be represented by Equations 5-9, where M
d=n represents a mixed convolution block 305 with a dilated coefficient of n, X
n represents the n-th output of a mixed convolution group 109, and X
n represents the a combination of n mixed convolution group 109. “Function “fusion” is used to combine different outputs of the mixed convolution group 109.
X
1 = M
d=1 (X
n-1) Equation 5
X
2 = M
d=2 (M
d=1 (X
{n-1} ) ) Equation 6
X
3 = M
d=4 (M
d=2 (M
d=1 (X
n-1) ) ) Equation 7
X
4 = M
d=8 (M
d=4 (M
d=2 (M
d=1 (X
n-1) ) ) ) Equation 8
X
n = f
fusion ( [X
1, X
2, X
3, X
4] ) Equation 9
One set of comparative tests demonstrates that the CNN framework recovers objects more effectively than VTM 11.0-NNVC and retains more perceptual texture details. A number of perceptual improvements were spotted in the regions in a test video including the wall in the background, the clothes of the man, and the man’s outline. In another set of comparative tests, it is proved that the CNN framework removes the block effect. For example, it was observed that the CNN framework retained visible details of the necklace in a test video more effectively than VTM 11.0-NNVC.
Tables 1-3 below shows quantitative measurements of the use of the CNN framework 100, compared to VTM 11.0-NNVC, as a post-processing filter on input images 101. In Tables 1-3, Y-PNSR represents the peak signal-to-noise ratio of the Y channel and Y-MSIM represents the multi-scale structural similarity of the Y channel of a processed image. Further, negative BD-rate values represent coding gains from the use of the CNN framework 100.
Table 1 shows the quantitative measurements of BD-rate of using the CNN framework in a Random Access (RA) configuration, compared to VTM 11.0-NNVC.
Table 1
Metric | Y-PSNR | Y-MSIM |
Class B | -2.62% | -2.34% |
Class C | -3.22% | -3.12% |
Class D | -4.27% | -2.83% |
Table 2 shows the quantitative measurements of using the CNN framework in an All Intra (AI) configuration, compared to VTM 11.0-NNVC.
Table 2
Metric | Y-PSNR | Y-MSIM |
Class B | -3.17% | -2.93% |
Class C | -4.31% | -3.39% |
Class D | -4.58% | -3.18% |
Class E | -5.51% | -6.29% |
Table 3 shows the quantitative measurements of using the CNN framework in a Low Delay P (LDP) , compared to VTM 11.0-NNVC.
Table 3
Metric | Y-PSNR | Y-MSIM |
Class B | -3.19% | -2.63% |
Class C | -3.62% | -2.65% |
Class D | -4.15% | -2.73% |
Class E | -5.74% | -6.11% |
These results indicate that the CNN framework 100 achieves average BD-rate reductions of 2.99%, 4.8%, 3.72%and 4.5%over VTM 11.0-NNVC for Y channel on B, C, D and E classes in AI, RA and LDP configurations, respectively. Thus, the CNN framework 100 produces improved performance over VTM 11.0-NNVC for compression.
Fig. 4 is a schematic diagram of a wireless communication system 400 in accordance with one or more implementations of the present disclosure. The wireless communication system 400 can implement the CNN framework 100 discussed herein. As shown in Fig. 4, the wireless communications system 400 can include a network device (or base station) 401. Examples of the network device 401 include a base transceiver station (Base Transceiver Station, BTS) , a NodeB (NodeB, NB) , an evolved Node B (eNB or eNodeB) , a Next Generation NodeB (gNB or gNode B) , a Wireless Fidelity (Wi-Fi) access point (AP) , etc. In some embodiments, the network device 401 can include a relay station, an access point, an in-vehicle device, a wearable device, and the like. The network device 401 can include wireless connection devices for communication networks such as: a Global System for Mobile Communications (GSM) network, a Code Division Multiple Access (CDMA) network, a Wideband CDMA (WCDMA) network, an LTE network, a cloud radio access network (Cloud Radio Access Network, CRAN) , an Institute of Electrical and Electronics Engineers (IEEE) 802.11-based network (e.g., a Wi-Fi network) , an Internet of Things (IoT) network, a device-to-device (D2D) network, a next-generation network (e.g., a 5G network) , a future evolved public land mobile network (Public Land Mobile Network, PLMN) , or the like. A 5G system or network can be referred to as a new radio (New Radio, NR) system or network.
In Fig. 4 the wireless communications system 400 also includes a terminal device 403. The terminal device 403 can be an end-user device configured to facilitate wireless communication. The terminal device 403 can be configured to wirelessly connect to the network device 401 (via, e.g., via a wireless channel 405) according to one or more corresponding communication protocols/standards. The terminal device 403 may be mobile or fixed. The terminal device 403 can be a user equipment (UE) , an access terminal, a user unit, a user station, a mobile site, a mobile station, a remote station, a remote terminal, a mobile device, a user terminal, a terminal, a wireless communications device, a user agent, or a user apparatus. Examples of the terminal device 403 include a modem, a cellular phone, a smartphone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (WLL) station, a personal digital assistant (PDA) , a handheld device having a wireless communication function, a computing device or another processing device connected to a wireless modem, an in-vehicle device, a wearable device, an Internet-of-Things (IoT) device, a device used in a 5G network, a device used in a public land mobile network, or the like. For illustrative purposes, Fig. 4 illustrates only one network device 401 and one terminal device 403 in the wireless communications system 400. However, in some instances, the wireless communications system 400 can include additional network device 401 and/or terminal device 403.
Fig. 5 is a schematic block diagram of a terminal device 403 (e.g., which can implement the methods discussed herein) in accordance with one or more implementations of the present disclosure. As shown, the terminal device 403 includes a processing unit 510 (e.g., a DSP, a CPU, a GPU, etc. ) and a memory 520. The processing unit 510 can be configured to implement instructions that correspond to the method 800 of Fig. 6 and/or other aspects of the implementations described above. It should be understood that the processor 510 in the implementations of this technology may be an integrated circuit chip and has a signal processing capability. During implementation, the steps in the foregoing method may be implemented by using an integrated logic circuit of hardware in the processor 510 or an instruction in the form of software. The processor 510 may be a general-purpose processor, a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field programmable gate array (FPGA) or another programmable logic device, a discrete gate or transistor logic device, and a discrete hardware component. The methods, steps, and logic block diagrams disclosed in the implementations of this technology may be implemented or performed. The general-purpose processor 510 may be a microprocessor, or the processor 510 may be alternatively any conventional processor or the like. The steps in the methods disclosed with reference to the implementations of this technology may be directly performed or completed by a decoding processor implemented as hardware or performed or completed by using a combination of hardware and software modules in a decoding processor. The software module may be located at a random-access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, or another mature storage medium in this field. The storage medium is located at a memory 520, and the processor 510 reads information in the memory 520 and completes the steps in the foregoing methods in combination with the hardware thereof.
It may be understood that the memory 520 in the implementations of this technology may be a volatile memory or a non-volatile memory, or may include both a volatile memory and a non-volatile memory. The non-volatile memory may be a read-only memory (ROM) , a programmable read-only memory (PROM) , an erasable programmable read-only memory (EPROM) , an electrically erasable programmable read-only memory (EEPROM) or a flash memory. The volatile memory may be a random-access memory (RAM) and is used as an external cache. For exemplary rather than limitative description, many forms of RAMs can be used, and are, for example, a static random-access memory (SRAM) , a dynamic random-access memory (DRAM) , a synchronous dynamic random-access memory (SDRAM) , a double data rate synchronous dynamic random-access memory (DDR SDRAM) , an enhanced synchronous dynamic random-access memory (ESDRAM) , a synchronous link dynamic random-access memory (SLDRAM) , and a direct Rambus random-access memory (DR RAM) . It should be noted that the memories in the systems and methods described herein are intended to include, but are not limited to, these memories and memories of any other suitable type. In some embodiments, the memory may be a non-transitory computer-readable storage medium that stores instructions capable of execution by a processor.
Fig. 6 is a flowchart of a method in accordance with one or more implementations of the present disclosure. The method 800 can be implemented by a system (such as a system with the CNN framework 100) . The method 800 is for enhancing image qualities (particularly, for compressed images) . The method 800 includes, at block 601, receiving an input image 101. At block 603, the method 800 continues by decomposing the input image 101 into a set of wavelet subbands 107. In some embodiments, the decomposition of the input image is accomplished using DWT 106. At block 605, the method 800 proceeds to input the wavelet subbands 107 to the CNN framework 100. The CNN framework 100 comprises two subnetworks. A first subnetwork (e.g., the step-like subband network 102 of Figs. 2A and 2B) is configured to restore the set of wavelet subbands 107 from high frequency to low frequency and uses the restored high frequency wavelet subbands 107B-D to restore the low frequency subband 107A. A second subnetwork (e.g., the mixed enhancement network 103 of Fig. 3) is configured to expand the size of the receptive field of the restored set of wavelet subbands using mixed convolution. In some embodiments, the CNN framework 100 may apply IDWT 108 to the restored set of wavelet subbands between the first subnetwork and the second subnetwork to create a reconstructed signal. At block 607, the method 800 receives an enhanced version of the input image 101 from the CNN framework 100.
In some embodiments, the first subnetwork may comprise one or more Res2NetGroups 105, where each Res2NetGroup 105 includes one or more Res2Net modules 201. In further embodiments, each Res2NetGroup 105 may comprise a first 3x3 convolutional layer 203A at a beginning of the Res2NetGroup 105 and a second 3x3 convolutional layer 203B at an ending of the Res2NetGroup 105. In some embodiments, each Res2Net module 201 may comprise one or more 3x3 convolutional layers 203 between a first 1x1 convolutional layer 205A and a second 1x1 convolutional layer 205B. In further embodiments, each Res2Net module 210 may comprise a spatial attention and channel attention module 207 implemented after the second 1x1 convolutional layer 205B.
In some embodiments, the mixed convolution may comprise a combination of dilated convolution and standard convolution. In further embodiments, the second subnetwork may comprise one or more mixed convolution groups between a first and second 3x3 convolutional layer and each mixed convolution group may comprise two or more densely connected mixed convolution blocks. In an example embodiment, each mixed convolution group may comprise four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
Fig. 7 is a schematic block diagram of a video encoder. An input video contains one or more pictures. Partition unit 701 divides a picture in an input video into one or more coding tree units (CTUs) . Partition unit 701 divides the picture into tiles, and optionally may further divide a tile into one or more bricks, wherein a tile or a brick contains one or more integral and/or partial CTUs. Partition unit 701 forms one or more slices, wherein a slice may contain one or more tiles in a raster order of tiles in the picture, or one or more tiles covering a rectangular region in the picture. Partition unit 701 may also forms one or more sub-pictures, wherein a sub-picture contains one or more slices, tiles or bricks.
In encoding process of encoder 700, partition unit 701 passes CTUs to prediction unit 702. Generally, prediction unit 702 is composed of block partition unit 703, ME (motion estimation) unit 704, MC (motion compensation) unit 705 and intra prediction unit 706. Block partition unit 703 further divides an input CTU into smaller coding units (CUs) using quadtree split, binary split and ternary split iteratively. Prediction unit 702 may derive inter prediction block of a CU using ME unit 704 and MC unit 705. Intra prediction unit 706 may derive intra prediction block of a CU using various intra prediction modes including angular prediction modes, DC mode, plenar mode, matrix-based intra prediction mode, and etc. In an example, rate-distortion optimized motion estimation method can be invoked by ME unit 704 and MC unit 705 to derive the inter prediction block, and rate-distortion optimized mode decision method can be invoked by intra prediction unit 706 to get the intra prediction block.
Output of filtering unit 712 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 713. DPB 713 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 713 may also be employed as reference for performing inter or intra prediction by prediction unit 702. Optionally, the decoded picture can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.
Note that the video encoder 700 can also be used to encode an image with only block partition unit 703 and intra prediction unit 706 enabled in prediction unit 702.
Fig. 8 is a schematic block diagram of a video decoder. Input bitstream of a decoder 800 can be a bitstream generated by the encoder 700. Parsing unit 801 parses the input bitstream and obtains values of syntax elements from the input bitstream. Parsing unit 801 converts binary representations of syntax elements to numerical values and forwards the numerical values to the units in the decoder 800 to derive one or more decoded pictures. Parsing unit 801 may also parse one or more syntax elements from the input bitstream for displaying the decoded pictures.
Parsing unit 801 forwards the values of syntax elements, as well as one or more variables set or determined according the values of syntax elements, for deriving one or more decoded pictures to the units in the decoder 800. Prediction unit 802 determines a prediction block of a current decoding block (e.g. a CU) . When it is indicated that an inter coding mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to MC unit 803 to derive inter prediction block. When it is indicated that an intra prediction mode is used to decoding the current decoding block, prediction unit 802 passes relative parameters from parsing unit 801 to intra prediction unit 804 to derive intra prediction block.
After all the CUs in a picture or a sub-picture have been reconstructed, filtering unit 808 performs in-loop filtering on the reconstructed picture or sub-picture. Filtering unit 808 contains one or more filters, for example, deblocking filter, sample adaptive offset (SAO) filter, adaptive loop filter (ALF) , luma mapping with chroma scaling (LMCS) filter and neural network based filters. Filtering unit 808 also contains the CNN-based filter discussed herein. As an example, the filters in the filtering unit 808 may be connected with each other in a cascading order. Alternatively, when filtering unit 808 determines that the reconstructed block is not used as reference for decoding other blocks, filtering unit 808 performs in-loop filtering on one or more target pixels in the reconstructed block.
Output of filtering unit 808 is a decoded picture or sub-picture, which is forwarded to DPB (decoded picture buffer) 809. DPB 809 outputs decoded pictures according to timing and controlling information. Pictures stored in DPB 809 may also be employed as reference for performing inter or intra prediction by prediction unit 802. Optionally, the decoded pictures outputted from DPB 809 by the decoder 800 can be further process using the CNN-based filter discussed herein to enhance a quality of the decoded picture.
Note that the video decoder 800 can also be used to decode a bitstream of an image. One example implementation is only set intra prediction unit 804 enabled in prediction unit 802 of the decoder 800.
ADDITIONAL CONSIDERATIONS
The above Detailed Description of examples of the disclosed technology is not intended to be exhaustive or to limit the disclosed technology to the precise form disclosed above. While specific examples for the disclosed technology are described above for illustrative purposes, various equivalent modifications are possible within the scope of the described technology, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative implementations may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified to provide alternative implementations or sub-combinations. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed or implemented in parallel, or may be performed at different times. Further, any specific numbers noted herein are only examples; alternative implementations may employ differing values or ranges.
In the Detailed Description, numerous specific details are set forth to provide a thorough understanding of the presently described technology. In other implementations, the techniques introduced here can be practiced without these specific details. In other instances, well-known features, such as specific functions or routines, are not described in detail in order to avoid unnecessarily obscuring the present disclosure. References in this description to “an implementation/embodiment, ” “one implementation/embodiment, ” or the like mean that a particular feature, structure, material, or characteristic being described is included in at least one implementation of the described technology. Thus, the appearances of such phrases in this specification do not necessarily all refer to the same implementation/embodiment. On the other hand, such references are not necessarily mutually exclusive either. Furthermore, the particular features, structures, materials, or characteristics can be combined in any suitable manner in one or more implementations/embodiments. It is to be understood that the various implementations shown in the figures are merely illustrative representations and are not necessarily drawn to scale.
Several details describing structures or processes that are well-known and often associated with communications systems and subsystems, but that can unnecessarily obscure some significant aspects of the disclosed techniques, are not set forth herein for purposes of clarity. Moreover, although the following disclosure sets forth several implementations of different aspects of the present disclosure, several other implementations can have different configurations or different components than those described in this section. Accordingly, the disclosed techniques can have other implementations with additional elements or without several of the elements described below.
Many implementations or aspects of the technology described herein can take the form of computer-or processor-executable instructions, including routines executed by a programmable computer or processor. Those skilled in the relevant art will appreciate that the described techniques can be practiced on computer or processor systems other than those shown and described below. The techniques described herein can be implemented in a special-purpose computer or data processor that is specifically programmed, configured, or constructed to execute one or more of the computer-executable instructions described below. Accordingly, the terms “computer” and “processor” as generally used herein refer to any data processor. Information handled by these computers and processors can be presented at any suitable display medium. Instructions for executing computer-or processor-executable tasks can be stored in or on any suitable computer-readable medium, including hardware, firmware, or a combination of hardware and firmware. Instructions can be contained in any suitable memory device, including, for example, a flash drive and/or other suitable medium.
The term “and/or” in this specification is only an association relationship for describing the associated objects, and indicates that three relationships may exist, for example, A and/or B may indicate the following three cases: A exists separately, both A and B exist, and B exists separately.
These and other changes can be made to the disclosed technology in light of the above Detailed Description. While the Detailed Description describes certain examples of the disclosed technology, as well as the best mode contemplated, the disclosed technology can be practiced in many ways, no matter how detailed the above description appears in text. Details of the system may vary considerably in its specific implementation, while still being encompassed by the technology disclosed herein. As noted above, particular terminology used when describing certain features or aspects of the disclosed technology should not be taken to imply that the terminology is being redefined herein to be restricted to any specific characteristics, features, or aspects of the disclosed technology with which that terminology is associated. Accordingly, the invention is not limited, except as by the appended claims. In general, the terms used in the following claims should not be construed to limit the disclosed technology to the specific examples disclosed in the specification, unless the above Detailed Description section explicitly defines such terms.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the implementations disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
Although certain aspects of the invention are presented below in certain claim forms, the applicant contemplates the various aspects of the invention in any number of claim forms. Accordingly, the applicant reserves the right to pursue additional claims after filing this application to pursue such additional claim forms, in either this application or in a continuing application.
Claims (20)
- A method for video processing, the method comprising:receiving an input image;decomposing the input image into a set of wavelet subbands using discrete wavelet transform;inputting the set of wavelet subbands to a neural network framework,wherein the neural network framework comprises two subnetworks, wherein:the first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands, andthe second subnetwork is configured to expand a size of a receptive field of the restored set of wavelet subbands using a mixed convolution; andreceiving, from the neural network framework, an enhanced version of the input image output from the second subnetwork.
- The method of claim 1, wherein the mixed convolution comprises a combination of a dilated convolution and a standard convolution.
- The method of claim 1, wherein the first subnetwork further comprises one or more Res2NetGroups, wherein each Res2NetGroup comprises one or more Res2Net modules.
- The method of claim 3, wherein each Res2NetGroup further comprises a first 3x3 convolutional layer at a beginning of the Res2NetGroup and a second 3x3 convolutional layer at an ending of the Res2NetGroup.
- The method of claim 3, wherein each Res2Net module comprises one or more 3x3 convolutional layers between a first 1x1 convolutional layer and a second 1x1 convolutional layer.
- The method of claim 5, wherein each Res2Net module comprises a spatial attention module and/or a channel attention module after the second 1x1 convolutional layer.
- The method of claim 1, wherein the second subnetwork comprises one or more mixed convolution groups between a first 3x3 convolutional layer and a second 3x3 convolutional layer.
- The method of claim 7, wherein each mixed convolution group comprises one or more densely connected mixed convolution blocks.
- The method of claim 8, wherein each mixed convolution group comprises four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
- The method of claim 1, wherein the neural network framework applies inverse discrete wavelet transform to the restored set of wavelet subbands between the first subnetwork and the second subnetwork.
- A system for video processing, the system comprising:a processor; anda memory configured to store instructions, when executed by the processor, to:receive an input image;decompose the input image into a set of wavelet subbands using discrete wavelet transform;input the set of wavelet subbands to a neural network framework,wherein the neural network framework comprises two subnetworks, wherein:the first subnetwork is configured to restore the set of wavelet subbands from high frequency to low frequency and uses the restored high frequency wavelet subbands to restore the low frequency wavelet subbands, andthe second subnetwork is configured to expand a size of a receptive field of the restored set of wavelet subbands using a mixed convolution; andreceive, from the neural network framework, an enhanced version of the input image output from the second subnetwork.
- The system of claim 11, wherein the mixed convolution comprises a combination of a dilated convolution and a standard convolution.
- The system of claim 11, wherein the first subnetwork further comprises one or more Res2NetGroups, wherein each Res2NetGroup comprises one or more Res2Net modules.
- The system of claim 13, wherein each Res2NetGroup further comprises a first 3x3 convolutional layer at a beginning of the Res2NetGroup and a second 3x3 convolutional layer at an ending of the Res2NetGroup.
- The system of claim 13, wherein each Res2Net module comprises one or more 3x3 convolutional layers between a first 1x1 convolutional layer and a second 1x1 convolutional layer.
- The system of claim 15, wherein each Res2Net module comprises a spatial attention and channel attention module after the second 1x1 convolutional layer.
- The system of claim 11, wherein the second subnetwork comprises one or more mixed convolution groups between a first 3x3 convolutional layer and a second 3x3 convolutional layer.
- The system of claim 11, wherein each mixed convolution group comprises one or more densely connected mixed convolution blocks.
- The system of claim 18, wherein each mixed convolution group comprises four mixed convolution blocks with dilated coefficients of 1, 2, 4, and 8, respectively.
- A method for video processing, the method comprising:receiving an input image;applying discrete wavelet transform to the input image to form a set of wavelet subbands;inputting the set of wavelet subbands to a first network configured to restore the set of wavelet subbands from high frequency to low frequency;applying inverse discrete wavelet transform to the restored wavelet subbands to create a reconstructed signal;inputting the reconstructed signal to a second network configured to the second subnetwork is configured to expand a size of a receptive field using a mixed convolution; andreceiving, from the second network, an enhanced version of the input image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/086686 WO2023197219A1 (en) | 2022-04-13 | 2022-04-13 | Cnn-based post-processing filter for video compression with multi-scale feature representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/086686 WO2023197219A1 (en) | 2022-04-13 | 2022-04-13 | Cnn-based post-processing filter for video compression with multi-scale feature representation |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023197219A1 true WO2023197219A1 (en) | 2023-10-19 |
Family
ID=88328535
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/086686 WO2023197219A1 (en) | 2022-04-13 | 2022-04-13 | Cnn-based post-processing filter for video compression with multi-scale feature representation |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023197219A1 (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156919A (en) * | 2014-08-04 | 2014-11-19 | 陕西科技大学 | Motion-blurred image restoration method based on wavelet transform and Hopfield neural network |
JP2015083056A (en) * | 2013-10-25 | 2015-04-30 | 株式会社東芝 | Image processing apparatus and ultrasonic diagnostic apparatus |
CN112801872A (en) * | 2021-02-02 | 2021-05-14 | 上海眼控科技股份有限公司 | Image processing method, device, equipment and storage medium |
CN113256536A (en) * | 2021-06-18 | 2021-08-13 | 之江实验室 | Ultrahigh-dimensional data reconstruction deep learning method based on wavelet analysis |
CN113284051A (en) * | 2021-07-23 | 2021-08-20 | 之江实验室 | Face super-resolution method based on frequency decomposition multi-attention machine system |
CN113409216A (en) * | 2021-06-24 | 2021-09-17 | 北京工业大学 | Image restoration method based on frequency band self-adaptive restoration model |
US20220004810A1 (en) * | 2018-09-28 | 2022-01-06 | Pavel Sinha | Machine learning using structurally regularized convolutional neural network architecture |
-
2022
- 2022-04-13 WO PCT/CN2022/086686 patent/WO2023197219A1/en unknown
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015083056A (en) * | 2013-10-25 | 2015-04-30 | 株式会社東芝 | Image processing apparatus and ultrasonic diagnostic apparatus |
CN104156919A (en) * | 2014-08-04 | 2014-11-19 | 陕西科技大学 | Motion-blurred image restoration method based on wavelet transform and Hopfield neural network |
US20220004810A1 (en) * | 2018-09-28 | 2022-01-06 | Pavel Sinha | Machine learning using structurally regularized convolutional neural network architecture |
CN112801872A (en) * | 2021-02-02 | 2021-05-14 | 上海眼控科技股份有限公司 | Image processing method, device, equipment and storage medium |
CN113256536A (en) * | 2021-06-18 | 2021-08-13 | 之江实验室 | Ultrahigh-dimensional data reconstruction deep learning method based on wavelet analysis |
CN113409216A (en) * | 2021-06-24 | 2021-09-17 | 北京工业大学 | Image restoration method based on frequency band self-adaptive restoration model |
CN113284051A (en) * | 2021-07-23 | 2021-08-20 | 之江实验室 | Face super-resolution method based on frequency decomposition multi-attention machine system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11694125B2 (en) | Image encoder using machine learning and data processing method of the image encoder | |
KR101752612B1 (en) | Method of sample adaptive offset processing for video coding | |
US20090175336A1 (en) | Video coding of filter coefficients based on horizontal and vertical symmetry | |
JP2008543183A (en) | Block noise removal filtering technology for video encoding according to multiple video standards | |
CN111885280B (en) | Hybrid convolutional neural network video coding loop filtering method | |
US20120177104A1 (en) | Reduced Complexity Adaptive Loop Filter (ALF) for Video Coding | |
WO2022116085A1 (en) | Encoding method, decoding method, encoder, decoder, and electronic device | |
US20240346627A1 (en) | Methods and system for video processing | |
WO2019179401A1 (en) | Image filtering method and apparatus, and video codec | |
CN116456083A (en) | Decoding prediction method, device and computer storage medium | |
WO2023197219A1 (en) | Cnn-based post-processing filter for video compression with multi-scale feature representation | |
JP5948659B2 (en) | System, method and computer program for integrating post-processing and pre-processing in video transcoding | |
CN118985002A (en) | Post-processing filter for video compression of multi-scale feature representations based on convolutional neural network | |
Wang et al. | Quadtree-based guided CNN for AV1 in-loop filtering | |
Qi et al. | CNN-based post-processing filter for video compression with multi-scale feature representation | |
WO2024007160A1 (en) | Convolutional neural network (cnn) filter for super-resolution with reference picture resampling (rpr) functionality | |
WO2023123497A1 (en) | Collaborative video processing mechanism and methods of operating the same | |
WO2024007423A1 (en) | Reference picture resampling (rpr) based super-resolution guided by partition information | |
WO2023050591A1 (en) | Methods and systems for video compression | |
WO2023221599A1 (en) | Image filtering method and apparatus and device | |
WO2024212333A1 (en) | Neural network (nn) based in-loop filter | |
WO2024077570A1 (en) | Reference picture resampling (rpr) based super-resolution with wavelet decomposition | |
US20240214605A1 (en) | Transmitter system, receiver system, and method for processing video | |
CN113259672B (en) | Decoding method, encoding method, decoder, encoder, and encoding/decoding system | |
US20230080180A1 (en) | Coding a pixel string vector based on attribute information of the pixel string |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22936869 Country of ref document: EP Kind code of ref document: A1 |