[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024077772A1 - Method and system for image data processing - Google Patents

Method and system for image data processing Download PDF

Info

Publication number
WO2024077772A1
WO2024077772A1 PCT/CN2022/141377 CN2022141377W WO2024077772A1 WO 2024077772 A1 WO2024077772 A1 WO 2024077772A1 CN 2022141377 W CN2022141377 W CN 2022141377W WO 2024077772 A1 WO2024077772 A1 WO 2024077772A1
Authority
WO
WIPO (PCT)
Prior art keywords
size adaptation
size
pixel information
information associated
adaptation parameter
Prior art date
Application number
PCT/CN2022/141377
Other languages
French (fr)
Inventor
Marek Domanski
Slawomir Mackowiak
Olgierd Stankiewicz
Slawomir ROZEK
Tomasz Grajek
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp., Ltd. filed Critical Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Publication of WO2024077772A1 publication Critical patent/WO2024077772A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/29Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding involving scalability at the object level, e.g. video object layer [VOL]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/59Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution

Definitions

  • the present disclosure generally relates to image data processing, and more particularly, to methods and systems for compressing image data based on object size adaptation information.
  • image data Limited communication bandwidth poses challenge to transmitting images or videos (hereinafter collectively referred to as “image data” ) at desired quality.
  • image data images or videos
  • various encoding techniques have been developed to compress the image data before it is transmitted, the increasing demand for massive amount of image data has put a constant pressure on the image/video systems, particularly mobile devices that frequently need to work in networks with limited bandwidth.
  • Embodiments of the present disclosure relate to methods of compressing image data based on object size adaptation information.
  • a computer-implemented encoding method may include: detecting a first object from an input image; determining a first size adaptation parameter associated with the first object; and compressing the input image based on the first size adaptation parameter.
  • a device for encoding image data includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: detecting a first object from an input image; determining a first size adaptation parameter associated with the first object; and compressing the input image based on the first size adaptation parameter.
  • a computer-implemented decoding method may include: receiving one or more bitstreams comprising image data representing a first object in a first size; decoding the one or more bitstreams; determining a first size adaptation parameter associated with the first object; and generating, based on the first size adaptation parameter, a reconstructed image comprising the first object in a second size different from the first size.
  • a device for decoding image data includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: receiving one or more bitstreams comprising image data representing a first object in a first size; decoding the one or more bitstreams; determining a first size adaptation parameter associated with the first object; and generating, based on the first size adaptation parameter, a reconstructed image comprising the first object in a second size different from the first size.
  • aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.
  • FIG. 1 is a block diagram illustrating an exemplary system for encoding and decoding image data, consistent with embodiments of the present disclosure.
  • FIG. 2 is a block diagram showing an exemplary encoding process, consistent with embodiments of the present disclosure.
  • FIG. 3 is a block diagram showing an exemplary decoding process, consistent with embodiments of the present disclosure.
  • FIG. 4 is a block diagram of an exemplary apparatus for encoding or decoding image data, consistent with embodiments of the present disclosure.
  • FIG. 5 is a block diagram illustrating an exemplary process for using object size adaptation information to control encoding of image data, consistent with embodiments of the present disclosure.
  • FIG. 6 is a flowchart of an exemplary method for encoding image data, consistent with embodiments of the present disclosure.
  • FIG. 7 is a schematic diagram illustrating an exemplary process for compressing image data, consistent with embodiments of the present disclosure.
  • FIG. 8 is a block diagram illustrating an exemplary implementation of the method in FIG. 6, consistent with embodiments of the present disclosure.
  • FIG. 9 is a schematic diagram illustrating an atlas image including a collection of rescaled object images, consistent with embodiments of the present disclosure.
  • FIG. 10 is a flowchart of an exemplary method for decoding image data, consistent with embodiments of the present disclosure.
  • the present disclosure is directed to image data compression methods for removing data redundance.
  • the redundance may relate to sizes of objects shown in the source images or videos.
  • storing or transmitting image data representing a large object requires a substantial amount of bits.
  • such large object size may not be necessary for performing certain machine vision tasks, such as object recognition and tracking.
  • the disclosed compression methods reduce the size of (i.e., scale down) an object represented by the image data to such extent that the performance (e.g., precision) of the machine vision tasks is not deteriorated or subject to small deterioration.
  • FIG. 1 is a block diagram illustrating a system 100 for encoding and decoding image data, according to some disclosed embodiments.
  • the image data may include an image (also called a “picture” or “frame” ) , multiple images, or a video.
  • An image is a static picture. Multiple images may be related or unrelated, either spatially or temporary.
  • a video includes a set of images arranged in a temporal sequence.
  • system 100 includes a source device 120 that provides encoded image data to be decoded by a destination device 140.
  • each of source device 120 and destination device 140 may include any of a wide range of devices, including a desktop computer, a notebook (e.g., laptop) computer, a server, a tablet computer, a set-top box, a mobile phone, a vehicle, a camera, an image sensor, a robot, a television, a camera, a wearable device (e.g., a smart watch or a wearable camera) , a display device, a digital media player, a video gaming console, a video streaming device, or the like.
  • Source device 120 and destination device 140 may be equipped for wireless or wired communication.
  • source device 120 may include an image/video source 122, an image/video encoder 124, and an output interface 126.
  • Destination device 140 may include an input interface 142, an image/video decoder 144, and one or more image/video applications 146.
  • Image/video source 122 of source device 120 may include an image/video capture device, such as a camera, an image/video archive containing previously captured video, or an image/video feed interface to receive image/video data from a content provider.
  • image/video source 122 may generate computer graphics-based data as the source image/video, or a combination of live image/video, archived image/video, and computer-generated image/video.
  • the captured, pre-captured, or computer-generated image data may be encoded by image/video encoder 124.
  • the encoded image data may then be output by output interface 126 onto a communication medium 160.
  • Image/video encoder 124 encodes the input image data and outputs an encoded bitstream 162 via output interface 126.
  • Encoded bitstream 162 is transmitted through a communication medium 160, and received by input interface 142.
  • Image/video decoder 144 then decodes encoded bitstream 162 to generate decoded data, which can be utilized by image/video applications 146.
  • Image/video encoder 124 and image/video decoder 144 each may be implemented as any of a variety of suitable encoder or decoder circuitry, such as one or more microprocessors, digital signal processors (DSPs) , application specific integrated circuits (ASICs) , field programmable gate arrays (FPGAs) , discrete logic, software, hardware, firmware, or any combinations thereof.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • image/video encoder 124 or image/video decoder 144 may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques consistent this disclosure.
  • Each of image/video encoder 124 or image/video decoder 144 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.
  • Image/video encoder 124 and image/video decoder 144 may operate according to any video coding standard, such as Advanced Video Coding (AVC) , High Efficiency Video Coding (HEVC) , Versatile Video Coding (VVC) , AOMedia Video 1 (AV1) , Joint Photographic Experts Group (JPEG) , Moving Picture Experts Group (MPEG) , etc.
  • AVC Advanced Video Coding
  • HEVC High Efficiency Video Coding
  • VVC Versatile Video Coding
  • AV1 AOMedia Video 1
  • JPEG Joint Photographic Experts Group
  • MPEG Moving Picture Experts Group
  • image/video encoder 124 and image/video decoder 144 may be customized devices that do not comply with the existing standards.
  • image/video encoder 124 and image/video decoder 144 may each be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of both audio and video in a common data stream or separate data streams.
  • Output interface 126 may include any type of medium or device capable of transmitting encoded bitstream 162 from source device 120 to destination device 140.
  • output interface 126 may include a transmitter or a transceiver configured to transmit encoded bitstream 162 from source device 120 directly to destination device 140 in real-time.
  • Encoded bitstream 162 may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 140.
  • Communication medium 160 may include transient media, such as a wireless broadcast or wired network transmission.
  • communication medium 160 may include a radio frequency (RF) spectrum or one or more physical transmission lines (e.g., a cable) .
  • Communication medium 160 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet.
  • communication medium 160 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 120 to destination device 140.
  • a network server (not shown) may receive encoded bitstream 162 from source device 120 and provide encoded bitstream 162 to destination device 140, e.g., via network transmission.
  • Communication medium 160 may also be in the form of a storage media (e.g., non-transitory storage media) , such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded image data.
  • a computing device of a medium production facility such as a disc stamping facility, may receive encoded image data from source device 120 and produce a disc containing the encoded video data.
  • Input interface 142 may include any type of medium or device capable of receiving information from communication medium 160.
  • the received information includes encoded bitstream 162.
  • input interface 142 may include a receiver or a transceiver configured to receive encoded bitstream 162 in real-time.
  • Image/video applications 146 include various hardware and/or software for utilizing the decoded image data generated by image/video decoder 144.
  • image/video applications 146 may include a display device that displays the decoded image data to a user and may include any of a variety of display devices such as a cathode ray tube (CRT) , a liquid crystal display (LCD) , a plasma display, an organic light emitting diode (OLED) display, or another type of display device.
  • CTR cathode ray tube
  • LCD liquid crystal display
  • OLED organic light emitting diode
  • image/video applications 146 may include one or more processors configured to use the decoded image data to perform various machine-vision applications, such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.
  • various machine-vision applications such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.
  • image/video encoder 124 and image/video decoder 144 may or may not operate according to a video coding standard.
  • an exemplary encoding process performable by image/video encoder 124 and an exemplary decoding process performable by image/video decoder 144 are described below in connection with FIG. 2 and FIG. 3, respectively.
  • FIG. 2 is a block diagram showing an exemplary encoding process 200, according to some embodiments of the present disclosure.
  • Encoding process 200 can be performed by an encoder, such as image/video encoder 124 (FIG. 1) .
  • encoding process 200 includes a picture partitioning stage 210, an inter prediction stage 220, an intra prediction stage 225, a transform stage 230, a quantization stage 235, a rearrangement stage 260, an entropy coding stage 265, an inverse quantization stage 240, an inverse transform stage 245, a filter stage 250, and a buffer stage 255.
  • Picture partitioning stage 210 partitions an input picture 205 into at least one processing unit.
  • the processing unit may be a prediction unit (PU) , a transform unit (TU) , or a coding unit (CU) .
  • Picture partitioning stage 210 partitions a picture into a combination of a plurality of coding units, prediction units, and transform units.
  • a picture may be encoded by selecting a combination of a coding unit, a prediction unit, and a transform unit based on a predetermined criterion (e.g., a cost function) .
  • a predetermined criterion e.g., a cost function
  • one picture may be partitioned into a plurality of coding units.
  • a recursive tree structure such as a quad tree structure may be used.
  • a picture serves a root node and is recursively partitioned into smaller regions, i.e., child nodes.
  • a coding unit that is not partitioned any more according to a predetermined restriction becomes a leaf node. For example, when it is assumed that only square partitioning is possible for a coding unit, the coding unit is partitioned into up to four different coding units.
  • a processing unit for performing prediction may be different from a processing unit for determining a prediction method and specific content.
  • a prediction method and a prediction mode may be determined in a prediction unit, and prediction may be performed in a transform unit.
  • a residual coefficient (residual block) between the reconstructed prediction block and the original block may be input into transform stage 230.
  • prediction mode information, motion vector information and the like used for prediction may be encoded by the entropy coding stage 265 together with the residual coefficient and transferred to a decoder.
  • an original block may be encoded and transmitted to a decoder without generating a prediction block through inter prediction stage 220 and intra prediction stage 225.
  • Inter prediction stage 220 predicts a prediction unit based on information in at least one picture before or after the current picture. In some embodiments, inter prediction stage 220 predicts a prediction unit based on information on a partial area that has been encoded in the current picture. Inter prediction stage 220 further includes a reference picture interpolation stage, a motion prediction stage, and a motion compensation stage (not shown) .
  • the reference picture interpolation stage receives reference picture information from buffer stage 255 and generates pixel information of an integer number of pixels or less from the reference picture.
  • a DCT-based 8-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/4 pixels.
  • a DCT-based 4-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/8 pixels.
  • the motion prediction stage performs motion prediction based on the reference picture interpolated by the reference picture interpolation stage.
  • Various methods such as a full search-based block matching algorithm (FBMA) , a three-step search (TSS) , and a new three-step search algorithm (NTS) may be used as a method of calculating a motion vector.
  • the motion vector may have a motion vector value of a unit of 1/2 or 1/4 pixels based on interpolated pixels.
  • the motion prediction stage predicts a current prediction unit based on a motion prediction mode.
  • Various methods such as a skip mode, a merge mode, an advanced motion vector prediction (AMVP) mode, an intra-block copy mode and the like may be used as the motion prediction mode.
  • AMVP advanced motion vector prediction
  • Intra prediction stage 225 generates a prediction unit based on the information on reference pixels in the neighborhood of the current block, which is pixel information in the current picture.
  • the reference pixel included in the block on which inter prediction has been performed may be used in place of reference pixel information of a block in the neighborhood on which intra prediction has been performed. That is, when a reference pixel is unavailable, at least one reference pixel among available reference pixels may be used in place of unavailable reference pixel information.
  • the prediction mode may have an angular prediction mode that uses reference pixel information according to a prediction direction, and a non-angular prediction mode that does not use directional information when performing prediction.
  • a mode for predicting luminance information may be different from a mode for predicting color difference information, and intra prediction mode information used to predict luminance information or predicted luminance signal information may be used to predict the color difference information.
  • the intra prediction may be performed for the prediction unit based on a pixel on the left side, a pixel on the top-left side, and a pixel on the top of the prediction unit. If the size of the prediction unit is different from the size of the transform unit when the intra prediction is performed, the intra prediction may be performed using a reference pixel based on the transform unit.
  • the intra prediction method generates a prediction block after applying an Adaptive Intra Smoothing (AIS) filter to the reference pixel according to a prediction mode.
  • the type of the AIS filter applied to the reference pixel may vary.
  • the intra prediction mode of the current prediction unit may be predicted from the intra prediction mode of the prediction unit existing in the neighborhood of the current prediction unit.
  • the intra prediction modes of the current prediction unit is predicted using the mode information predicted from the neighboring prediction unit, the following method may be performed. If the intra prediction modes of the current prediction unit is the same as the prediction unit in the neighborhood, information indicating that the prediction modes of the current prediction unit is the same as the prediction unit in the neighborhood may be transmitted using predetermined flag information. If the prediction modes of the current prediction unit and the prediction unit in the neighborhood are different from each other, prediction mode information of the current block may be encoded by performing entropy coding.
  • a residual block which is a difference value of the prediction unit with the original block, is generated.
  • the generated residual block is input into transform stage 230.
  • Transform stage 230 transforms the residual block using a transform method such as Discrete Cosine Transform (DCT) or Discrete Sine Transform (DST) .
  • the DCT transform core includes at least one among DCT2 and DCT8, and the DST transform core includes DST7.
  • Whether or not to apply DCT or DST to transform the residual block may be determined based on intra prediction mode information of a prediction unit used to generate the residual block.
  • the transform on the residual block may be skipped.
  • a flag indicating whether or not to skip the transform on the residual block may be encoded.
  • the transform skip may be allowed for a residual block having a size smaller than or equal to a threshold, a luma component, or a chroma component under the 4: 4: 4 format.
  • Quantization stage 235 quantizes values transformed into the frequency domain by transform stage 230. Quantization coefficients may vary according to the block or the importance of a video. A value calculated by the quantization stage 235 is provided to inverse quantization stage 240 and the rearrangement stage 260.
  • Rearrangement stage 260 rearranges values of the quantized residual coefficients.
  • Rearrangement stage 260 changes coefficients of a two-dimensional block shape into a one-dimensional vector shape through a coefficient scanning method.
  • rearrangement stage 260 may scan DC coefficients up to high-frequency domain coefficients using a zig-zag scan method, and change the coefficients into a one-dimensional vector shape.
  • a vertical scan of scanning the coefficients of a two-dimensional block shape in the column direction and a horizontal scan of scanning the coefficients of a two-dimensional block shape in the row direction may be used instead of the zig-zag scan.
  • a scan method that will be used may be determined among the zig-zag scan, the vertical direction scan, and the horizontal direction scan.
  • Entropy coding stage 265 performs entropy coding based on values determined by rearrangement stage 260.
  • Entropy coding may use various encoding methods such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , Context-Adaptive Binary Arithmetic Coding (CABAC) , and the like.
  • Entropy coding stage 265 encodes various information such as residual coefficient information and block type information of a coding unit, prediction mode information, partitioning unit information, prediction unit information and transmission unit information, motion vector information, reference frame information, block interpolation information, and filtering information input from rearrangement stage 260, inter prediction stage 220, and intra prediction stage 225. Entropy coding stage 265 may also entropy-encode the coefficient value of a coding unit input from rearrangement stage 260. The output of entropy coding stage 265 forms an encoded bitstream 270, which may be transmitted to a decoding device.
  • Inverse quantization stage 240 and inverse transform stage 245 inverse-quantize the values quantized by quantization stage 235 and inverse-transform the values transformed by transform stage 230.
  • the residual coefficient generated by inverse quantization stage 240 and inverse transform stage 245 may be combined with the prediction unit predicted through motion estimation, motion compensation, and/or intra prediction to generate a reconstructed block.
  • Filter stage 250 includes at least one among a deblocking filter, an offset correction unit, and an adaptive loop filter (ALF) .
  • ALF adaptive loop filter
  • the deblocking filter removes block distortion generated by the boundary between blocks in the reconstructed picture.
  • information of the pixels included in several columns or rows included in the block may be used.
  • a strong filter or a weak filter may be applied according to the deblocking filtering strength needed when the deblocking filter is applied to a block.
  • vertical direction filtering and horizontal direction filtering are performed in applying the deblocking filter, horizontal direction filtering and vertical direction filtering may be processed in parallel.
  • the offset correction unit corrects an offset to the original video by the unit of pixel for a video on which the deblocking has been performed. Offset correction for a specific picture may be performed by dividing pixels included in the video into a certain number of areas, determining an area to perform offset, and applying the offset to the area. Alternatively, the offset correction may be performed by applying an offset considering edge information of each pixel.
  • Adaptive Loop Filtering is performed based on a value obtained by comparing the reconstructed and filtered video with the original video. After dividing the pixels included in the video into predetermined groups, one filter to be applied to a corresponding group is determined, and filtering may be performed differently for each group.
  • a luminance signal which is the information related to whether or not to apply ALF, may be transmitted for each coding unit (CU) , and the shape and filter coefficient of an ALF filter to be applied varies according to each block. In some implementations, an ALF filter of the same type (fixed type) is applied regardless of the characteristic of a block to be applied.
  • Buffer stage 255 temporarily stores the reconstructed blocks or pictures calculated through filter stage 250, and provides them to inter prediction stage 220 when inter prediction is performed.
  • FIG. 3 is a block diagram showing an exemplary decoding process 300, according to some embodiments of the present disclosure.
  • Decoding process 300 can be performed by a decoder, such as image/video decoder 144 (FIG. 1) .
  • decoding process 300 includes an entropy decoding stage 310, a rearrangement stage 315, an inverse quantization stage 320, an inverse transform stage 325, an inter prediction stage 330, an intra prediction stage 335, a filter stage 340, and a buffer stage 345.
  • an input bitstream 302 is decoded in a procedure opposite to that of an encoding process, e.g., encoding process 200 (FIG. 2) .
  • entropy decoding stage 310 may perform entropy decoding in a procedure opposite to that performed in an entropy coding stage.
  • entropy decoding stage 310 may use various methods corresponding to those performed in the entropy coding stage, such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , and Context-Adaptive Binary Arithmetic Coding (CABAC) .
  • Entropy decoding stage 310 decodes information related to intra prediction and inter prediction.
  • Rearrangement stage 315 performs rearrangement on the output of entropy decoding stage 310 based on the rearrangement method used in the corresponding encoding process.
  • the coefficients expressed in a one-dimensional vector shape are reconstructed and rearranged as coefficients of two-dimensional block shape.
  • Rearrangement stage 315 receives information related to coefficient scanning performed in the corresponding encoding process and performs reconstruction through a method of inverse-scanning based on the scanning order performed in the corresponding encoding process.
  • Inverse quantization stage 320 performs inverse quantization based on a quantization parameter provided by the encoder and a coefficient value of the rearranged block.
  • Inverse transform stage 325 performs inverse DCT or inverse DST.
  • the DCT transform core may include at least one among DCT2 and DCT8, and the DST transform core may include DST7.
  • inverse transform stage 325 may be skipped.
  • the inverse transform may be performed based on a transmission unit determined in the corresponding encoding process.
  • Inverse transform stage 325 may selectively perform a transform technique (e.g., DCT or DST) according to a plurality of pieces of information such as a prediction method, a size of a current block, a prediction direction and the like.
  • a transform technique e.g., DCT or DST
  • Inter prediction stage 330 and/or intra prediction stage 335 generate a prediction block based on information related to generation of a prediction block (provided by entropy decoding stage 310) and information of a previously decoded block or picture (provided by buffer stage 345) .
  • the information related to generation of a prediction block may include intra prediction mode, motion vector, reference picture, etc.
  • a prediction type selection process may be used to determine whether to use one or both of inter prediction stage 330 and intra prediction stage 335.
  • the prediction type selection process receives various information such as prediction unit information input from the entropy decoding stage 310, prediction mode information of the intra prediction method, information related to motion prediction of an inter prediction method, and the like.
  • the prediction type selection process identifies the prediction unit from the current coding unit and determines whether the prediction unit performs inter prediction or intra prediction.
  • Inter prediction stage 330 performs inter prediction on the current prediction unit based on information included in at least one picture before or after the current picture including the current prediction unit by using information necessary for inter prediction of the current prediction unit.
  • the inter prediction stage 330 may perform inter prediction based on information on a partial area previously reconstructed in the current picture including the current prediction unit. In order to perform inter prediction, it is determined, based on the coding unit, whether the motion prediction method of the prediction unit included in a corresponding coding unit is a skip mode, a merge mode, a motion vector prediction mode (AMVP mode) , or an intra-block copy mode.
  • the motion prediction method of the prediction unit included in a corresponding coding unit is a skip mode, a merge mode, a motion vector prediction mode (AMVP mode) , or an intra-block copy mode.
  • AMVP mode motion vector prediction mode
  • Intra prediction stage 335 generates a prediction block based on the information on the pixel in the current picture.
  • the intra prediction may be performed based on intra prediction mode information of the prediction unit.
  • Intra prediction stage 335 may include an Adaptive Intra Smoothing (AIS) filter, a reference pixel interpolation stage, and a DC filter.
  • the AIS filter is a stage that performs filtering on the reference pixel of the current block, and may determine whether or not to apply the filter according to the prediction mode of the current prediction unit and apply the filter.
  • AIS filtering may be performed on the reference pixel of the current block by using the prediction mode and AIS filter information of the prediction unit used in the corresponding encoding process.
  • the prediction mode of the current block is a mode that does not perform AIS filtering, the AIS filter may not be applied.
  • the reference pixel interpolation stage When the prediction mode of the prediction unit is intra prediction based on a pixel value obtained by interpolating the reference pixel, the reference pixel interpolation stage generates a reference pixel of a pixel unit having an integer value or less by interpolating the reference pixel.
  • the prediction mode of the current prediction unit is a prediction mode that generates a prediction block without interpolating the reference pixel, the reference pixel is not interpolated.
  • the DC filter generates a prediction block through filtering when the prediction mode of the current block is the DC mode.
  • Filter stage 340 may include a deblocking filter, an offset correction unit, and an ALF.
  • Information on whether a deblocking filter is applied to a corresponding block or picture and information on whether a strong filter or a weak filter is applied when a deblocking filter is applied may be provided by the corresponding encoding process.
  • the deblocking filter at filter stage 340 may be provided with information related to the deblocking filter used in the corresponding encoding process.
  • the offset correction unit at filter stage 340 may perform offset correction on the reconstructed video based on the offset correction type and offset value information, which may be provided by the encoder.
  • the ALF at filter stage 340 is applied to a coding unit based on information on whether or not to apply the ALF and information on ALF coefficients, which are provided by the encoder.
  • the ALF information may be provided in a specific parameter set.
  • Buffer stage 345 stores the reconstructed picture or block as a reference picture or a reference block, and outputs it to a downstream application, e.g., image/video applications 146 in FIG. 1.
  • encoding process 200 (FIG. 2) and decoding process 300 (FIG. 3) are provided as examples to illustrate the possible encoding and decoding processes consistent with the present disclosure, and are not meant to limit it.
  • some embodiments encode and decode image data using wavelet transform, rather than directly coding the pixels in blocks.
  • FIG. 4 shows a block diagram of an exemplary apparatus 400 for encoding or decoding image data, according to some embodiments of the present disclosure.
  • apparatus 400 can include processor 402. When processor 402 executes instructions described herein, apparatus 400 can become a specialized machine for processing the image data.
  • Processor 402 can be any type of circuitry capable of manipulating or processing information.
  • processor 402 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like.
  • CPU central processing unit
  • GPU graphics processing unit
  • NPU neural processing unit
  • MCU microcontroller unit
  • an optical processor a programmable logic controller
  • microcontroller a
  • processor 402 can also be a set of processors grouped as a single logical component.
  • processor 402 can include multiple processors, including processor 402a, processor 402b, ..., and processor 402n.
  • Apparatus 400 can also include memory 404 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) .
  • data e.g., a set of instructions, computer codes, intermediate data, or the like
  • the stored data can include program instructions (e.g., program instructions for implementing the processing of image data) and image data to be processed.
  • Processor 402 can access the program instructions and data for processing (e.g., via bus 410) , and execute the program instructions to perform an operation or manipulation on the data.
  • Memory 404 can include a high-speed random-access storage device or a non-volatile storage device.
  • memory 404 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like.
  • RAM random-access memory
  • ROM read-only memory
  • optical disc optical disc
  • magnetic disk magnetic disk
  • hard drive a solid-state drive
  • flash drive a security digital (SD) card
  • SD security digital
  • CF compact flash
  • Memory 404 can also be a group of memories (not shown in FIG. 4) grouped as a single logical component.
  • Bus 410 can be a communication device that transfers data between components inside apparatus 400, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.
  • an internal bus e.g., a CPU-memory bus
  • an external bus e.g., a universal serial bus port, a peripheral component interconnect express port
  • processor 402 and other data processing circuits are collectively referred to as a “data processing circuit” in this disclosure.
  • the data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware.
  • the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 400.
  • Apparatus 400 can further include network interface 406 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) .
  • network interface 406 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.
  • NIC network interface controller
  • RF radio frequency
  • apparatus 400 can further include peripheral interface 408 to provide a connection to one or more peripheral devices.
  • the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.
  • a cursor control device e.g., a mouse, a touchpad, or a touchscreen
  • a keyboard e.g., a keyboard
  • a display e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display
  • an image/video input device e.g., a camera or an input interface coupled to an image/video archive
  • the encoding and decoding processes can be optimized to further remove the redundancy in the image data.
  • redundancy may be present in information not required by the machine vision tasks downstream of the decoder, e.g., image/video applications 146 (FIG. 1) .
  • Such redundancy can be skipped from encoding and transmission, without deteriorating the performance of the machine vision tasks. This is useful in, for example, low bitrate situation where the amount of image data that can be transmitted from an encoder to a decoder per time unit is limited.
  • the redundancy relates to the sizes of the input images, including the sizes of the objects shown in the input images.
  • Object used herein refers to a foreground depicted in an image and is also called an “object representation” or “object image, ” to distinguish it from a physical object existing in the real world.
  • object size used herein refers to image dimensions (i.e., pixel dimensions) of a foreground as shown in an image, rather than the actual size of a physical object.
  • the performance of a machine vision task e.g., object recognition and tracking
  • a convolutional neural network can be trained to detect, from images, objects of a particular type (e.g., face) in a given size, e.g., 64 ⁇ 128 pixels.
  • a convolutional neural network can be trained to achieve the highest precision in detecting objects in the same or similar sizes. Larger object sizes do not necessarily improve the precision of object detection, but rather may cause it to degrade. Accordingly, the extra bits required for representing object sizes larger than 64 ⁇ 128 pixels are not necessary for the machine vision task and can be removed from encoding.
  • FIG. 5 is a block diagram illustrating a process 500 for using object size adaptation information to control encoding of image data, according to some embodiments consistent with the present disclosure.
  • Process 500 may be performed by an encoder, such as image/video encoder 124 (FIG. 1) . As shown in FIG. 5, process 500 starts with the encoder receiving input image data 510 which includes one or more image frames. Prior to encoding, the encoder performs analysis 520 on input image data 510 to generate control information 530 for an encoding process 540.
  • Analysis 520 may include, for example, detecting an object from an input frame and determining an optimal size for the object with respect to a given machine vision task (not shown) performed downstream of the corresponding decoder (not shown) .
  • the machine vision task may analyze image data associated with the object and perform certain operations related to the object, such as detecting, tracking, reshaping, etc.
  • the optimal size for the object is the size that allows the machine vision task to analyze and operate on the object with the maximum precision.
  • the encoder Based on the result of analysis 520, the encoder generates control information 530 to adapt the size of the object for encoding process 540. Specifically, the size of the object may be adapted to reduce the number of bits for representing the object to the extent that the performance of the machine vision task does not deteriorate.
  • control information 530 may instruct encoding process 540 to down-sample (i.e., down scale) the pixel information associated with the object.
  • control information 530 may instruct encoding process 540 to encode the pixel information associated with the object in its initial size; or alternatively control information 530 may instruct encoding process 540 to up-sample (i.e., up scale) the pixel information associated with the object.
  • Encoding process 540 may be any suitable encoding process consistent with the present disclosure.
  • encoding process 540 may be implemented as encoding process 200 (FIG. 2) , and output a bitstream 550 including encoded pixel information associated with the rescaled object.
  • the up-sampling may be used in situations where an object’s size becomes too small after image compression and cannot be effectively recognized by the downstream machine vision application. To solve this problem, the object can be enlarged before the image compression.
  • both down-sampling and up-sampling may be used to pre-process the same input image before it is compressed.
  • some objects in the input image may be down-sampled, while other objects in the input image may be up-sampled.
  • the down-sampling and/or up-sampling can be unbeknownst to the codec.
  • the control of the standard codecs like AVC, HEC, and VVC may be unaware of the logical content like the objects and background, because there is no need to transmit additional information to control the absolute values of the quantization step.
  • FIG. 6 is a flowchart of an exemplary method 600 for encoding image data, according to some embodiments consistent with the present disclosure.
  • Method 600 may be used to compress image data based on object size adaptation information.
  • method 600 may be performed by an image data encoder, such as image/video encoder 124 in FIG. 1.
  • image/video encoder 124 in FIG. 1.
  • method 600 includes the following steps 602-606.
  • the encoder detects one or more objects from an input image.
  • FIG. 7 schematically shows an input image 710, which shows two objects 712 and 714.
  • the encoder may detect each of objects 712 and 714 from input image 710, and determine the rest of input image 710 to be a background 716.
  • the detection of object (s) from an input image includes three stages: image segmentation (i.e., object isolation or recognition) , feature extraction, and object classification.
  • image segmentation stage the encoder may execute an image segmentation algorithm to partition the input image into multiple segments, e.g., sets of pixels each of which representing a portion of the input image.
  • the image segmentation algorithm may assign a label (i.e., category) to each set of pixels, such that pixels with the same label share certain common characteristics.
  • the encoder may then group the pixels according to their labels and designate each group as a distinct object.
  • the encoder may extract, from the input image, features which are characteristic of the object.
  • the features may be based on the size of the object or on its shape, such as the area of the object, the length and the width of the object, the distance around the perimeter of the object, rectangularity of the object shape, circularity of the object shape.
  • the features associated with each object can be combined to form a feature vector.
  • the encoder may classify each recognized object on the basis of the feature vector associated with the respective object.
  • the feature values can be viewed as coordinates of a point in an N-dimensional space (one feature value implies a one-dimensional space, two features imply a two-dimensional space, and so on)
  • the classification is a process for determining the sub-space of the feature space to which the feature vector belongs. Since each sub-space corresponds to a distinct object, the classification accomplishes the task of object identification.
  • the encoder may classify a recognized object into one of multiple predetermined classes, such as human, face, animal, automobile or vegetation.
  • the encoder may perform the classification based on a model.
  • the model may be trained using pre-labeled training data sets that are associated with known objects.
  • the encoder may perform the classification using a machine learning technology, such as convolutional neural network (CNN) , deep neural network (DNN) , etc.
  • CNN convolutional neural network
  • DNN deep neural network
  • the encoder determines a size adaptation parameter for each of the one or more objects.
  • object sizes may impact machine vision tasks’ precision.
  • a neural network is trained for recognizing human faces with a dimension of 224 ⁇ 224 pixels
  • performing inference for objects in the trained size i.e., 224 ⁇ 224 pixels
  • each object class has a corresponding target object size that can maximize the machine vision tasks’ precision in analyzing objects belonging to the particular class.
  • Tables 1-3 show the possible bitrate gains achieved by down scaling various object classes. The bitrate gain in Table 1 is measured by the quality of detection with respect to score parameter from a neural network used for machine vision task.
  • the bitrate gain in Table 2 is measured by the quality of detection with respect to the best attained Intersection over Union (IoU) .
  • the values ranging from 0 to 1 indicate the ratio of a down-scaled object size versus its original size. As shown in Tables 1-3, each object class has certain size (s) that can maximize the bitrate gain.
  • Table 1 Quality of detection versus object size, with respect to score parameter from neural network
  • Tabel 2 Quality of detection versus object size, with respect to best attained IoU
  • the encoder may determine a target object size for each of the one or more objects detected from the input image.
  • the encoder may further determine a size adaptation parameter for each detected object to resize it to the corresponding target size.
  • the size adaptation parameter associated with the first object may specify a ratio for down scaling the first object to its target size; if a second object’s initial size in the input image is smaller than its target size, the size adaptation parameter associated with the second object may specify a ratio for up scaling the second object to its target size; and if a third object’s initial size in the input image is the same as its target size, the size adaptation parameter associated with the third object may specify a 1: 1 ratio for keeping the third object in its initial size.
  • the encoder compresses the input image based on the determined size adaptation parameters. Specifically, prior to encoding, the encoder may modify the image by rescaling each of the one or more detected objects based on their respective size adaptation parameters. In some embodiments, the encoder may perform the rescaling by resampling pixel information associated with each of the one or more detected objects. In particular, in the case that an object’s target size is smaller than its initial size, the encoder may down-sample pixel information associated with the object.
  • the size adaptation parameter associated with the object may be a signal decimation parameter specifying a pattern or frequency for removing the pixels associated with the object, e.g., removing every tenth pixel.
  • the modified input image with the down-sampled object is encoded in a bitstream, which is output by the encoder.
  • the size adaptation parameters associated with the objects may be included in the encoded bitstreams, so that a decoder receiving the bitstream can directly use the size adaptation parameters to scale the objects back to their original sizes.
  • FIG. 7 shows an example of compressing the input image based on the determined size adaptation parameters.
  • the encoder detects object 712, object 714, and background 716 from input image 710.
  • Object 712 and object 714 may have the same or different size adaptation parameters.
  • the encoder may down scale object 712 to object 722 according to the size adaptation parameter associated with object 712, and down scale object 714 to object 724 according to the size adaptation parameter associated with object 714.
  • the encoder may down scale background 716 to background 726, such that object 722, object 724, and background 726 collectively form an image 720 that is proportionally down scaled from input image 710.
  • Image 720 is then encoded by the encoder to generate an encoded bitstream.
  • FIG. 8 is a block diagram illustrating an exemplary implementation of method 600, consistent with embodiments of the present disclosure.
  • input image data 802 includes one or more input images.
  • the encoder For each input image, the encoder performs object detection 810 on the input image to detect one or more initial object images 811a, 811b, ..., 811n.
  • the encoder determines size adaptation information 812 associated with each initial object image 811.
  • Size adaptation information 812 may include a size adaptation parameter determined based on the object class associated with the respective initial object image 811.
  • the encoder modifies the input image by performing size adaptation 813 (e.g., down scaling) on the respective initial object image 811 to generate a respective rescaled object image 814, which is encoded (step 815) to generate an object bitstream 816.
  • size adaptation 813 e.g., down scaling
  • multiple rescaled object images 814a-814n derived from a same input image may be encoded separately (815a-815n) to generate multiple bitstreams 816a-816n, respectively.
  • the modified input image may be encoded to generate a single bitstream (not shown in FIG. 8) .
  • the encoder also performs background extraction 820 on each input image to extract an initial background image 821.
  • Initial background image 821 may be extracted by subtracting initial object images 810a, 810b, ..., 810n from the input image.
  • the encoder may determine size adaptation information 822 associated with initial background image 821. Size adaptation information 822 may be determined based on size adaptation information 812 associated with initial object images 811. For example, size adaptation information 822 may include a size adaptation parameter set to be equal to the size adaptation parameter associated with one of initial object images 811a, 811b, ..., 811n detected from the same input image.
  • the encoder uses size adaptation information 822 to perform size adaptation 823 (e.g., down scaling) on initial background image 821 to generate rescaled background 824, which is then encoded (step 825) to generate a background bitstream 826.
  • size adaptation 823 e.g., down scaling
  • rescaled background image 824 may be encoded in parallel to encoding 815 of rescaled object images 814, to generate separate bitstreams.
  • rescaled background image 824 and one or more rescaled object images 814 may be encoded to generate a single bitstream (not shown in FIG. 8) .
  • size adaptation information 812, 822 may be encoded into object bitstreams 816 and background bitstream 826, so that size adaptation information 812, 822 can be signaled to the decoder and used by the decoder to reconstruct original object images 811 and original background 821.
  • the encoder may encode rescaled objects images 814 and rescaled background image 824 in a single atlas image.
  • FIG. 9 schematically illustrates an atlas image 900, which includes a collection of rescaled object images 901, 904, ..., 912, and a rescaled background image 920.
  • the rescaled object images and background image combined in an atlas are not necessarily from the same input image, but rather may be sourced from multiple different input images.
  • FIG. 10 is a flowchart of an exemplary method 1000 for decoding a bitstream, according to some embodiments consistent with the present disclosure.
  • method 1000 may be performed by an image data decoder, such as image/video decoder 144 in FIG. 1. As shown in FIG. 10, method 1000 includes the following steps 1002-1008.
  • the decoder receives one or more bitstreams including image data representing one or more rescaled objects and a rescaled background.
  • the decoder decodes the one or more bitstreams.
  • the decoder may decode the one or more bitstreams using entropy decoding 310 shown in FIG. 3.
  • the decoder determines size adaptation information associated with the one or more rescaled objects and the rescaled background, respectively.
  • the size adaptation information associated with each object or background may include a size adaptation parameter specifying the rescaling ratio.
  • the size adaption information is signaled in the one or more bitstreams.
  • the decoder can extract the size adaption information from the decoded bitstreams.
  • the size adaption information is not signaled the one or more bitstream.
  • the decoder may determine the object class associated with each rescaled object, and determine the respective size adaptation information based on: the object class associated with the rescaled object, and a target size associated with the object class.
  • the target size may be obtained from a lookup table specifying a relationship between object classes and object sizes. The relationship may be signaled in the one or more bitstreams, or may be predefined and stored in the decoder.
  • the decoder generates, based on the size adaptation information, a reconstructed image including the one or more objects and the background in their respective original sizes. Specifically, the decoder restores the rescaled objects and rescaled background to their original sizes by resampling (e.g., up-sampling) , based on the size adaptation information, pixel information associated with the one or more objects and the background. For example, if the size adaptation information indicates that the rescaled objects and rescaled background are down scaled from their original sizes, the decoder up scales the rescaled objects and rescaled background to their original sizes.
  • resampling e.g., up-sampling
  • the size adaptation information associated with a rescaled object or background may include a signal decimation parameter used for generating the rescaled object or background, and the decoder may interpolate or pad, based on the signal decimation parameter, pixel information associated with the rescaled object or background. For example, if the signal decimation parameter indicates that a rescaled object or a rescaled background is generated by removing every tenth pixel, the decoder may perform an interpolation or padding process to add new pixels for the object or background.
  • the decoder may combine the resampled pixel information associated with the one or more objects and the background, to generate a composite image.
  • the decoder may generate the composite image by pasting the restored object (s) on the restored background.
  • the decoder may overlay one restored object over the other restored object, or adding the pixel values for each pixel in the overlapping area.
  • steps 1006 and 1008 may be omitted by the decoder.
  • the decoder may not need to rescale the objects and background back to their original size.
  • the decoder may reconstruct the image using the scaled objects and background, and the size adaptation information may be omitted from the encoder for encoding and transmission to the decoder.
  • the encoder may scale down the objects and background to improve performance of the machine vision task, and in this circumstance, the decode may not need to rescale the objects and background to their original size for the machine vision tasks.
  • the image data compression techniques described in the present disclosure can be used in low bitrate situation. By performing size adaptation of the input images, the total bitrate for transmitting the image data can be reduced.
  • a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods.
  • a device such as the disclosed encoder and decoder
  • Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same.
  • the device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.
  • the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)

Abstract

Methods and systems for compressing image data based on object size adaptation information. According to certain embodiments, an exemplary encoding method may include: detecting a first object from an input image; determining a first size adaptation parameter associated with the first object; and compressing the input image based on the first size adaptation parameter.

Description

METHOD AND SYSTEM FOR IMAGE DATA PROCESSING TECHNICAL FIELD
The present disclosure generally relates to image data processing, and more particularly, to methods and systems for compressing image data based on object size adaptation information.
BACKGROUND
Limited communication bandwidth poses challenge to transmitting images or videos (hereinafter collectively referred to as “image data” ) at desired quality. Although various encoding techniques have been developed to compress the image data before it is transmitted, the increasing demand for massive amount of image data has put a constant pressure on the image/video systems, particularly mobile devices that frequently need to work in networks with limited bandwidth.
SUMMARY
Embodiments of the present disclosure relate to methods of compressing image data based on object size adaptation information. In some embodiments, a computer-implemented encoding method is provided. The method may include: detecting a first object from an input image; determining a first size adaptation parameter associated with the first object; and compressing the input image based on the first size adaptation parameter.
In some embodiments, a device for encoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: detecting a first object from an input image; determining a first size adaptation parameter associated with the first object; and compressing the input image based on the first size adaptation parameter.
In some embodiments, a computer-implemented decoding method is provided. The method may include: receiving one or more bitstreams comprising image data representing a first object in a first size; decoding the one or more bitstreams; determining a first size adaptation parameter associated with the first object; and generating, based on the first size adaptation parameter, a reconstructed image comprising the first object in a second size different from the first size.
In some embodiments, a device for decoding image data is provided. The device includes a memory storing instructions; and one or more processors configured to execute the instructions to cause the device to perform operations including: receiving one or more bitstreams comprising image data representing a first object in a first size; decoding the one or more bitstreams; determining a first size adaptation parameter associated with the first object; and generating, based on the first size adaptation parameter, a reconstructed image comprising the first object in a second size different from the first size.
Aspects of the disclosed embodiments may include non-transitory, tangible computer-readable media that store software instructions that, when executed by one or more processors, are configured for and capable of performing and executing one or more of the methods, operations, and the like consistent with the disclosed embodiments. Also, aspects of the disclosed embodiments may be performed by one or more processors that are configured as special-purpose processor (s) based on software instructions that are programmed with logic and instructions that perform, when executed, one or more operations consistent with the disclosed embodiments.
Additional objects and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The objects and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram illustrating an exemplary system for encoding and decoding image data, consistent with embodiments of the present disclosure.
FIG. 2 is a block diagram showing an exemplary encoding process, consistent with embodiments of the present disclosure.
FIG. 3 is a block diagram showing an exemplary decoding process, consistent with embodiments of the present disclosure.
FIG. 4 is a block diagram of an exemplary apparatus for encoding or decoding image data, consistent with embodiments of the present disclosure.
FIG. 5 is a block diagram illustrating an exemplary process for using object size adaptation information to control encoding of image data, consistent with embodiments of the present disclosure.
FIG. 6 is a flowchart of an exemplary method for encoding image data, consistent with embodiments of the present disclosure.
FIG. 7 is a schematic diagram illustrating an exemplary process for compressing image data, consistent with embodiments of the present disclosure.
FIG. 8 is a block diagram illustrating an exemplary implementation of the method in FIG. 6, consistent with embodiments of the present disclosure.
FIG. 9 is a schematic diagram illustrating an atlas image including a collection of rescaled object images, consistent with embodiments of the present disclosure.
FIG. 10 is a flowchart of an exemplary method for decoding image data, consistent with embodiments of the present disclosure.
DETAILED DESCRIPTION
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the invention as recited in the appended claims.
The present disclosure is directed to image data compression methods for removing data redundance. In exemplary embodiments, the redundance may relate to sizes of objects shown in the source images or videos. On one hand, storing or transmitting image data representing a large object requires a substantial amount of bits. On the other hand, such large object size may not be necessary for performing certain machine vision tasks, such as object recognition and tracking. To exploit this redundance, the disclosed compression methods reduce the size of (i.e., scale down) an object represented by the image data to such extent that the performance (e.g., precision) of the machine vision tasks is not deteriorated or subject to small deterioration.
FIG. 1 is a block diagram illustrating a system 100 for encoding and decoding image data, according to some disclosed embodiments. The image data may include an image (also called a “picture” or “frame” ) , multiple images, or a video. An image is a static picture. Multiple images may be related or unrelated, either spatially or temporary. A video includes a set of images arranged in a temporal sequence.
As shown in FIG. 1, system 100 includes a source device 120 that provides encoded image data to be decoded by a destination device 140. Consistent with the disclosed embodiments, each of source device 120 and destination device 140 may include any of a wide range of devices,  including a desktop computer, a notebook (e.g., laptop) computer, a server, a tablet computer, a set-top box, a mobile phone, a vehicle, a camera, an image sensor, a robot, a television, a camera, a wearable device (e.g., a smart watch or a wearable camera) , a display device, a digital media player, a video gaming console, a video streaming device, or the like. Source device 120 and destination device 140 may be equipped for wireless or wired communication.
Referring to FIG. 1, source device 120 may include an image/video source 122, an image/video encoder 124, and an output interface 126. Destination device 140 may include an input interface 142, an image/video decoder 144, and one or more image/video applications 146. Image/video source 122 of source device 120 may include an image/video capture device, such as a camera, an image/video archive containing previously captured video, or an image/video feed interface to receive image/video data from a content provider. As a further alternative, image/video source 122 may generate computer graphics-based data as the source image/video, or a combination of live image/video, archived image/video, and computer-generated image/video. The captured, pre-captured, or computer-generated image data may be encoded by image/video encoder 124. The encoded image data may then be output by output interface 126 onto a communication medium 160. Image/video encoder 124 encodes the input image data and outputs an encoded bitstream 162 via output interface 126. Encoded bitstream 162 is transmitted through a communication medium 160, and received by input interface 142. Image/video decoder 144 then decodes encoded bitstream 162 to generate decoded data, which can be utilized by image/video applications 146.
Image/video encoder 124 and image/video decoder 144 each may be implemented as any of a variety of suitable encoder or decoder circuitry, such as one or more microprocessors, digital signal processors (DSPs) , application specific integrated circuits (ASICs) , field programmable gate arrays (FPGAs) , discrete logic, software, hardware, firmware, or any combinations thereof. When the encoding or decoding is implemented partially in software, image/video encoder 124 or image/video decoder 144 may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the techniques consistent this disclosure. Each of image/video encoder 124 or image/video decoder 144 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODEC) in a respective device.
Image/video encoder 124 and image/video decoder 144 may operate according to any video coding standard, such as Advanced Video Coding (AVC) , High Efficiency Video Coding (HEVC) , Versatile Video Coding (VVC) , AOMedia Video 1 (AV1) , Joint Photographic Experts Group (JPEG) , Moving Picture Experts Group (MPEG) , etc. Alternatively, image/video encoder 124 and image/video decoder 144 may be customized devices that do not comply with the existing standards. Although not shown in FIG. 1, in some embodiments, image/video encoder 124 and image/video decoder 144 may each be integrated with an audio encoder and decoder, and may include appropriate MUX-DEMUX units, or other hardware and software, to handle encoding of both audio and video in a common data stream or separate data streams.
Output interface 126 may include any type of medium or device capable of transmitting encoded bitstream 162 from source device 120 to destination device 140. For example, output interface 126 may include a transmitter or a transceiver configured to transmit encoded bitstream 162 from source device 120 directly to destination device 140 in real-time. Encoded bitstream 162 may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to destination device 140.
Communication medium 160 may include transient media, such as a wireless broadcast or wired network transmission. For example, communication medium 160 may include a radio frequency (RF) spectrum or one or more physical transmission lines (e.g., a cable) . Communication medium 160 may form part of a packet-based network, such as a local area network, a wide-area network, or a global network such as the Internet. In some embodiments,  communication medium 160 may include routers, switches, base stations, or any other equipment that may be useful to facilitate communication from source device 120 to destination device 140. For example, a network server (not shown) may receive encoded bitstream 162 from source device 120 and provide encoded bitstream 162 to destination device 140, e.g., via network transmission.
Communication medium 160 may also be in the form of a storage media (e.g., non-transitory storage media) , such as a hard disk, flash drive, compact disc, digital video disc, Blu-ray disc, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded image data. In some embodiments, a computing device of a medium production facility, such as a disc stamping facility, may receive encoded image data from source device 120 and produce a disc containing the encoded video data.
Input interface 142 may include any type of medium or device capable of receiving information from communication medium 160. The received information includes encoded bitstream 162. For example, input interface 142 may include a receiver or a transceiver configured to receive encoded bitstream 162 in real-time.
Image/video applications 146 include various hardware and/or software for utilizing the decoded image data generated by image/video decoder 144. For example, image/video applications 146 may include a display device that displays the decoded image data to a user and may include any of a variety of display devices such as a cathode ray tube (CRT) , a liquid crystal display (LCD) , a plasma display, an organic light emitting diode (OLED) display, or another type of display device. As another example, image/video applications 146 may include one or more processors configured to use the decoded image data to perform various machine-vision applications, such as object recognition and tracking, face recognition, images matching, image/video search, augmented reality, robot vision and navigation, autonomous driving, 3-dimension structure construction, stereo correspondence, motion tracking, etc.
As described above, image/video encoder 124 and image/video decoder 144 may or may not operate according to a video coding standard. For illustrative purpose only, an exemplary encoding process performable by image/video encoder 124 and an exemplary decoding process performable by image/video decoder 144 are described below in connection with FIG. 2 and FIG. 3, respectively.
FIG. 2 is a block diagram showing an exemplary encoding process 200, according to some embodiments of the present disclosure. Encoding process 200 can be performed by an encoder, such as image/video encoder 124 (FIG. 1) . Referring to FIG. 2, encoding process 200 includes a picture partitioning stage 210, an inter prediction stage 220, an intra prediction stage 225, a transform stage 230, a quantization stage 235, a rearrangement stage 260, an entropy coding stage 265, an inverse quantization stage 240, an inverse transform stage 245, a filter stage 250, and a buffer stage 255.
Picture partitioning stage 210 partitions an input picture 205 into at least one processing unit. The processing unit may be a prediction unit (PU) , a transform unit (TU) , or a coding unit (CU) . Picture partitioning stage 210 partitions a picture into a combination of a plurality of coding units, prediction units, and transform units. A picture may be encoded by selecting a combination of a coding unit, a prediction unit, and a transform unit based on a predetermined criterion (e.g., a cost function) .
For example, one picture may be partitioned into a plurality of coding units. In order to partition the coding units in a picture, a recursive tree structure such as a quad tree structure may be used. Specifically, with the recursive tree structure, a picture serves a root node and is recursively partitioned into smaller regions, i.e., child nodes. A coding unit that is not partitioned any more according to a predetermined restriction becomes a leaf node. For example, when it is assumed that only square partitioning is possible for a coding unit, the coding unit is partitioned into up to four different coding units.
It may be determined whether to use inter prediction or to perform intra prediction for a prediction unit, and determine specific information (e.g., intra prediction mode, motion vector, reference picture, etc. ) according to the prediction method. A processing unit for performing prediction may be different from a processing unit for determining a prediction method and specific content. For example, a prediction method and a prediction mode may be determined in a prediction unit, and prediction may be performed in a transform unit. A residual coefficient (residual block) between the reconstructed prediction block and the original block may be input into transform stage 230. In addition, prediction mode information, motion vector information and the like used for prediction may be encoded by the entropy coding stage 265 together with the residual coefficient and transferred to a decoder. In some implementations, an original block may be encoded and transmitted to a decoder without generating a prediction block through inter prediction stage 220 and intra prediction stage 225.
Inter prediction stage 220 predicts a prediction unit based on information in at least one picture before or after the current picture. In some embodiments, inter prediction stage 220 predicts a prediction unit based on information on a partial area that has been encoded in the current picture. Inter prediction stage 220 further includes a reference picture interpolation stage, a motion prediction stage, and a motion compensation stage (not shown) .
The reference picture interpolation stage receives reference picture information from buffer stage 255 and generates pixel information of an integer number of pixels or less from the reference picture. In the case of a luminance pixel, a DCT-based 8-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/4 pixels. In the case of a color difference signal, a DCT-based 4-tap interpolation filter with a varying filter coefficient is used to generate pixel information of an integer number of pixels or less by the unit of 1/8 pixels.
The motion prediction stage performs motion prediction based on the reference picture interpolated by the reference picture interpolation stage. Various methods such as a full search-based block matching algorithm (FBMA) , a three-step search (TSS) , and a new three-step search algorithm (NTS) may be used as a method of calculating a motion vector. The motion vector may have a motion vector value of a unit of 1/2 or 1/4 pixels based on interpolated pixels. The motion prediction stage predicts a current prediction unit based on a motion prediction mode. Various methods such as a skip mode, a merge mode, an advanced motion vector prediction (AMVP) mode, an intra-block copy mode and the like may be used as the motion prediction mode.
Intra prediction stage 225 generates a prediction unit based on the information on reference pixels in the neighborhood of the current block, which is pixel information in the current picture. When a block in the neighborhood of the current prediction unit is a block on which inter prediction has been performed and thus the reference pixel is a pixel on which inter prediction has been performed, the reference pixel included in the block on which inter prediction has been performed may be used in place of reference pixel information of a block in the neighborhood on which intra prediction has been performed. That is, when a reference pixel is unavailable, at least one reference pixel among available reference pixels may be used in place of unavailable reference pixel information.
In the intra prediction, the prediction mode may have an angular prediction mode that uses reference pixel information according to a prediction direction, and a non-angular prediction mode that does not use directional information when performing prediction. A mode for predicting luminance information may be different from a mode for predicting color difference information, and intra prediction mode information used to predict luminance information or predicted luminance signal information may be used to predict the color difference information.
If the size of the prediction unit is the same as the size of the transform unit when intra prediction is performed, the intra prediction may be performed for the prediction unit based on a pixel on the left side, a pixel on the top-left side, and a pixel on the top of the prediction unit. If  the size of the prediction unit is different from the size of the transform unit when the intra prediction is performed, the intra prediction may be performed using a reference pixel based on the transform unit.
The intra prediction method generates a prediction block after applying an Adaptive Intra Smoothing (AIS) filter to the reference pixel according to a prediction mode. The type of the AIS filter applied to the reference pixel may vary. In order to perform the intra prediction method, the intra prediction mode of the current prediction unit may be predicted from the intra prediction mode of the prediction unit existing in the neighborhood of the current prediction unit. When a prediction mode of the current prediction unit is predicted using the mode information predicted from the neighboring prediction unit, the following method may be performed. If the intra prediction modes of the current prediction unit is the same as the prediction unit in the neighborhood, information indicating that the prediction modes of the current prediction unit is the same as the prediction unit in the neighborhood may be transmitted using predetermined flag information. If the prediction modes of the current prediction unit and the prediction unit in the neighborhood are different from each other, prediction mode information of the current block may be encoded by performing entropy coding.
In addition, a residual block, which is a difference value of the prediction unit with the original block, is generated. The generated residual block is input into transform stage 230.
Transform stage 230 transforms the residual block using a transform method such as Discrete Cosine Transform (DCT) or Discrete Sine Transform (DST) . The DCT transform core includes at least one among DCT2 and DCT8, and the DST transform core includes DST7. Whether or not to apply DCT or DST to transform the residual block may be determined based on intra prediction mode information of a prediction unit used to generate the residual block. The transform on the residual block may be skipped. A flag indicating whether or not to skip the transform on the residual block may be encoded. The transform skip may be allowed for a residual block having a size smaller than or equal to a threshold, a luma component, or a chroma component under the 4: 4: 4 format.
Quantization stage 235 quantizes values transformed into the frequency domain by transform stage 230. Quantization coefficients may vary according to the block or the importance of a video. A value calculated by the quantization stage 235 is provided to inverse quantization stage 240 and the rearrangement stage 260.
Rearrangement stage 260 rearranges values of the quantized residual coefficients. Rearrangement stage 260 changes coefficients of a two-dimensional block shape into a one-dimensional vector shape through a coefficient scanning method. For example, rearrangement stage 260 may scan DC coefficients up to high-frequency domain coefficients using a zig-zag scan method, and change the coefficients into a one-dimensional vector shape. According to the size of the transform unit and the intra prediction mode, a vertical scan of scanning the coefficients of a two-dimensional block shape in the column direction and a horizontal scan of scanning the coefficients of a two-dimensional block shape in the row direction may be used instead of the zig-zag scan. In some embodiments, according to the size of the transform unit and the intra prediction mode, a scan method that will be used may be determined among the zig-zag scan, the vertical direction scan, and the horizontal direction scan.
Entropy coding stage 265 performs entropy coding based on values determined by rearrangement stage 260. Entropy coding may use various encoding methods such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , Context-Adaptive Binary Arithmetic Coding (CABAC) , and the like.
Entropy coding stage 265 encodes various information such as residual coefficient information and block type information of a coding unit, prediction mode information, partitioning unit information, prediction unit information and transmission unit information, motion vector information, reference frame information, block interpolation information, and filtering information input from rearrangement stage 260, inter prediction stage 220, and intra  prediction stage 225. Entropy coding stage 265 may also entropy-encode the coefficient value of a coding unit input from rearrangement stage 260. The output of entropy coding stage 265 forms an encoded bitstream 270, which may be transmitted to a decoding device.
Inverse quantization stage 240 and inverse transform stage 245 inverse-quantize the values quantized by quantization stage 235 and inverse-transform the values transformed by transform stage 230. The residual coefficient generated by inverse quantization stage 240 and inverse transform stage 245 may be combined with the prediction unit predicted through motion estimation, motion compensation, and/or intra prediction to generate a reconstructed block.
Filter stage 250 includes at least one among a deblocking filter, an offset correction unit, and an adaptive loop filter (ALF) .
The deblocking filter removes block distortion generated by the boundary between blocks in the reconstructed picture. In order to determine whether or not to perform deblocking, information of the pixels included in several columns or rows included in the block may be used. A strong filter or a weak filter may be applied according to the deblocking filtering strength needed when the deblocking filter is applied to a block. In addition, when vertical direction filtering and horizontal direction filtering are performed in applying the deblocking filter, horizontal direction filtering and vertical direction filtering may be processed in parallel.
The offset correction unit corrects an offset to the original video by the unit of pixel for a video on which the deblocking has been performed. Offset correction for a specific picture may be performed by dividing pixels included in the video into a certain number of areas, determining an area to perform offset, and applying the offset to the area. Alternatively, the offset correction may be performed by applying an offset considering edge information of each pixel.
Adaptive Loop Filtering (ALF) is performed based on a value obtained by comparing the reconstructed and filtered video with the original video. After dividing the pixels included in the video into predetermined groups, one filter to be applied to a corresponding group is determined, and filtering may be performed differently for each group. A luminance signal, which is the information related to whether or not to apply ALF, may be transmitted for each coding unit (CU) , and the shape and filter coefficient of an ALF filter to be applied varies according to each block. In some implementations, an ALF filter of the same type (fixed type) is applied regardless of the characteristic of a block to be applied.
Buffer stage 255 temporarily stores the reconstructed blocks or pictures calculated through filter stage 250, and provides them to inter prediction stage 220 when inter prediction is performed.
FIG. 3 is a block diagram showing an exemplary decoding process 300, according to some embodiments of the present disclosure. Decoding process 300 can be performed by a decoder, such as image/video decoder 144 (FIG. 1) . Referring to FIG. 3, decoding process 300 includes an entropy decoding stage 310, a rearrangement stage 315, an inverse quantization stage 320, an inverse transform stage 325, an inter prediction stage 330, an intra prediction stage 335, a filter stage 340, and a buffer stage 345.
Specifically, an input bitstream 302 is decoded in a procedure opposite to that of an encoding process, e.g., encoding process 200 (FIG. 2) . Specifically, as shown in FIG. 3, entropy decoding stage 310 may perform entropy decoding in a procedure opposite to that performed in an entropy coding stage. For example, entropy decoding stage 310 may use various methods corresponding to those performed in the entropy coding stage, such as Exponential Golomb, Context-Adaptive Variable Length Coding (CAVLC) , and Context-Adaptive Binary Arithmetic Coding (CABAC) . Entropy decoding stage 310 decodes information related to intra prediction and inter prediction.
Rearrangement stage 315 performs rearrangement on the output of entropy decoding stage 310 based on the rearrangement method used in the corresponding encoding process. The coefficients expressed in a one-dimensional vector shape are reconstructed and rearranged as  coefficients of two-dimensional block shape. Rearrangement stage 315 receives information related to coefficient scanning performed in the corresponding encoding process and performs reconstruction through a method of inverse-scanning based on the scanning order performed in the corresponding encoding process.
Inverse quantization stage 320 performs inverse quantization based on a quantization parameter provided by the encoder and a coefficient value of the rearranged block.
Inverse transform stage 325 performs inverse DCT or inverse DST. The DCT transform core may include at least one among DCT2 and DCT8, and the DST transform core may include DST7. Alternatively, when the transform is skipped in the corresponding encoding process, inverse transform stage 325 may be skipped. The inverse transform may be performed based on a transmission unit determined in the corresponding encoding process. Inverse transform stage 325 may selectively perform a transform technique (e.g., DCT or DST) according to a plurality of pieces of information such as a prediction method, a size of a current block, a prediction direction and the like.
Inter prediction stage 330 and/or intra prediction stage 335 generate a prediction block based on information related to generation of a prediction block (provided by entropy decoding stage 310) and information of a previously decoded block or picture (provided by buffer stage 345) . The information related to generation of a prediction block may include intra prediction mode, motion vector, reference picture, etc.
A prediction type selection process may be used to determine whether to use one or both of inter prediction stage 330 and intra prediction stage 335. The prediction type selection process receives various information such as prediction unit information input from the entropy decoding stage 310, prediction mode information of the intra prediction method, information related to motion prediction of an inter prediction method, and the like. The prediction type selection process identifies the prediction unit from the current coding unit and determines whether the prediction unit performs inter prediction or intra prediction. Inter prediction stage 330 performs inter prediction on the current prediction unit based on information included in at least one picture before or after the current picture including the current prediction unit by using information necessary for inter prediction of the current prediction unit. In some embodiments, the inter prediction stage 330 may perform inter prediction based on information on a partial area previously reconstructed in the current picture including the current prediction unit. In order to perform inter prediction, it is determined, based on the coding unit, whether the motion prediction method of the prediction unit included in a corresponding coding unit is a skip mode, a merge mode, a motion vector prediction mode (AMVP mode) , or an intra-block copy mode.
Intra prediction stage 335 generates a prediction block based on the information on the pixel in the current picture. When the prediction unit is a prediction unit that has performed intra prediction, the intra prediction may be performed based on intra prediction mode information of the prediction unit. Intra prediction stage 335 may include an Adaptive Intra Smoothing (AIS) filter, a reference pixel interpolation stage, and a DC filter. The AIS filter is a stage that performs filtering on the reference pixel of the current block, and may determine whether or not to apply the filter according to the prediction mode of the current prediction unit and apply the filter. AIS filtering may be performed on the reference pixel of the current block by using the prediction mode and AIS filter information of the prediction unit used in the corresponding encoding process. When the prediction mode of the current block is a mode that does not perform AIS filtering, the AIS filter may not be applied.
When the prediction mode of the prediction unit is intra prediction based on a pixel value obtained by interpolating the reference pixel, the reference pixel interpolation stage generates a reference pixel of a pixel unit having an integer value or less by interpolating the reference pixel. When the prediction mode of the current prediction unit is a prediction mode that generates a prediction block without interpolating the reference pixel, the reference pixel is not interpolated.  The DC filter generates a prediction block through filtering when the prediction mode of the current block is the DC mode.
The reconstructed block or picture is provided to filter stage 340. Filter stage 340 may include a deblocking filter, an offset correction unit, and an ALF.
Information on whether a deblocking filter is applied to a corresponding block or picture and information on whether a strong filter or a weak filter is applied when a deblocking filter is applied may be provided by the corresponding encoding process. The deblocking filter at filter stage 340 may be provided with information related to the deblocking filter used in the corresponding encoding process.
The offset correction unit at filter stage 340 may perform offset correction on the reconstructed video based on the offset correction type and offset value information, which may be provided by the encoder.
The ALF at filter stage 340 is applied to a coding unit based on information on whether or not to apply the ALF and information on ALF coefficients, which are provided by the encoder. The ALF information may be provided in a specific parameter set.
Buffer stage 345 stores the reconstructed picture or block as a reference picture or a reference block, and outputs it to a downstream application, e.g., image/video applications 146 in FIG. 1.
The above-described encoding process 200 (FIG. 2) and decoding process 300 (FIG. 3) are provided as examples to illustrate the possible encoding and decoding processes consistent with the present disclosure, and are not meant to limit it. For example, some embodiments encode and decode image data using wavelet transform, rather than directly coding the pixels in blocks.
FIG. 4 shows a block diagram of an exemplary apparatus 400 for encoding or decoding image data, according to some embodiments of the present disclosure. As shown in FIG. 4, apparatus 400 can include processor 402. When processor 402 executes instructions described herein, apparatus 400 can become a specialized machine for processing the image data. Processor 402 can be any type of circuitry capable of manipulating or processing information. For example, processor 402 can include any combination of any number of a central processing unit (or “CPU” ) , a graphics processing unit (or “GPU” ) , a neural processing unit ( “NPU” ) , a microcontroller unit ( “MCU” ) , an optical processor, a programmable logic controller, a microcontroller, a microprocessor, a digital signal processor, an intellectual property (IP) core, a Programmable Logic Array (PLA) , a Programmable Array Logic (PAL) , a Generic Array Logic (GAL) , a Complex Programmable Logic Device (CPLD) , a Field-Programmable Gate Array (FPGA) , a System On Chip (SoC) , an Application-Specific Integrated Circuit (ASIC) , or the like. In some embodiments, processor 402 can also be a set of processors grouped as a single logical component. For example, as shown in FIG. 4, processor 402 can include multiple processors, including processor 402a, processor 402b, ..., and processor 402n.
Apparatus 400 can also include memory 404 configured to store data (e.g., a set of instructions, computer codes, intermediate data, or the like) . For example, as shown in FIG. 4, the stored data can include program instructions (e.g., program instructions for implementing the processing of image data) and image data to be processed. Processor 402 can access the program instructions and data for processing (e.g., via bus 410) , and execute the program instructions to perform an operation or manipulation on the data. Memory 404 can include a high-speed random-access storage device or a non-volatile storage device. In some embodiments, memory 404 can include any combination of any number of a random-access memory (RAM) , a read-only memory (ROM) , an optical disc, a magnetic disk, a hard drive, a solid-state drive, a flash drive, a security digital (SD) card, a memory stick, a compact flash (CF) card, or the like. Memory 404 can also be a group of memories (not shown in FIG. 4) grouped as a single logical component.
Bus 410 can be a communication device that transfers data between components inside apparatus 400, such as an internal bus (e.g., a CPU-memory bus) , an external bus (e.g., a universal serial bus port, a peripheral component interconnect express port) , or the like.
For ease of explanation without causing ambiguity, processor 402 and other data processing circuits (e.g., analog/digital converters, application-specific circuits, digital signal processors, interface circuits, not shown in FIG. 4) are collectively referred to as a “data processing circuit” in this disclosure. The data processing circuit can be implemented entirely as hardware, or as a combination of software, hardware, or firmware. In addition, the data processing circuit can be a single independent module or can be combined entirely or partially into any other component of apparatus 400.
Apparatus 400 can further include network interface 406 to provide wired or wireless communication with a network (e.g., the Internet, an intranet, a local area network, a mobile communications network, or the like) . In some embodiments, network interface 406 can include any combination of any number of a network interface controller (NIC) , a radio frequency (RF) module, a transponder, a transceiver, a modem, a router, a gateway, a wired network adapter, a wireless network adapter, a Bluetooth adapter, an infrared adapter, a near-field communication ( “NFC” ) adapter, a cellular network chip, or the like.
In some embodiments, optionally, apparatus 400 can further include peripheral interface 408 to provide a connection to one or more peripheral devices. As shown in FIG. 4, the peripheral device can include, but is not limited to, a cursor control device (e.g., a mouse, a touchpad, or a touchscreen) , a keyboard, a display (e.g., a cathode-ray tube display, a liquid crystal display, or a light-emitting diode display) , an image/video input device (e.g., a camera or an input interface coupled to an image/video archive) , or the like.
Consistent with the disclosed embodiments, the encoding and decoding processes can be optimized to further remove the redundancy in the image data. For example, redundancy may be present in information not required by the machine vision tasks downstream of the decoder, e.g., image/video applications 146 (FIG. 1) . Such redundancy can be skipped from encoding and transmission, without deteriorating the performance of the machine vision tasks. This is useful in, for example, low bitrate situation where the amount of image data that can be transmitted from an encoder to a decoder per time unit is limited.
In some embodiments, the redundancy relates to the sizes of the input images, including the sizes of the objects shown in the input images. “Object” used herein refers to a foreground depicted in an image and is also called an “object representation” or “object image, ” to distinguish it from a physical object existing in the real world. Moreover, “object size” used herein refers to image dimensions (i.e., pixel dimensions) of a foreground as shown in an image, rather than the actual size of a physical object. Consistent with the present disclosure, the performance of a machine vision task (e.g., object recognition and tracking) may depend on the size and/or resolution (defined as the number of pixels displayed per inch of an image) of an object representation analyzed by the machine vision task. For example, a convolutional neural network (CNN) can be trained to detect, from images, objects of a particular type (e.g., face) in a given size, e.g., 64×128 pixels. Thus, such trained CNN may be able to achieve the highest precision in detecting objects in the same or similar sizes. Larger object sizes do not necessarily improve the precision of object detection, but rather may cause it to degrade. Accordingly, the extra bits required for representing object sizes larger than 64×128 pixels are not necessary for the machine vision task and can be removed from encoding.
To exploit the above-described redundancy associated with the object sizes, an object representation may be down-scaled before it is encoded, so as to reduce the number of bits that need to be encoded and transmitted. FIG. 5 is a block diagram illustrating a process 500 for using object size adaptation information to control encoding of image data, according to some embodiments consistent with the present disclosure. Process 500 may be performed by an encoder, such as image/video encoder 124 (FIG. 1) . As shown in FIG. 5, process 500 starts with  the encoder receiving input image data 510 which includes one or more image frames. Prior to encoding, the encoder performs analysis 520 on input image data 510 to generate control information 530 for an encoding process 540. Analysis 520 may include, for example, detecting an object from an input frame and determining an optimal size for the object with respect to a given machine vision task (not shown) performed downstream of the corresponding decoder (not shown) . The machine vision task may analyze image data associated with the object and perform certain operations related to the object, such as detecting, tracking, reshaping, etc. The optimal size for the object is the size that allows the machine vision task to analyze and operate on the object with the maximum precision. Based on the result of analysis 520, the encoder generates control information 530 to adapt the size of the object for encoding process 540. Specifically, the size of the object may be adapted to reduce the number of bits for representing the object to the extent that the performance of the machine vision task does not deteriorate. For example, when the object has an initial size lager than the determined optimal size, control information 530 may instruct encoding process 540 to down-sample (i.e., down scale) the pixel information associated with the object. As another example, when the object has an initial size smaller than the determined optimal size, control information 530 may instruct encoding process 540 to encode the pixel information associated with the object in its initial size; or alternatively control information 530 may instruct encoding process 540 to up-sample (i.e., up scale) the pixel information associated with the object.
In the case that the object is down scaled, the number of bits to be encoded by encoding process 540 is reduced and therefore a higher degree of compression of the image data can be achieved. Encoding process 540 may be any suitable encoding process consistent with the present disclosure. For example, encoding process 540 may be implemented as encoding process 200 (FIG. 2) , and output a bitstream 550 including encoded pixel information associated with the rescaled object.
The up-sampling may be used in situations where an object’s size becomes too small after image compression and cannot be effectively recognized by the downstream machine vision application. To solve this problem, the object can be enlarged before the image compression.
Consistent with the disclosed embodiments, both down-sampling and up-sampling may be used to pre-process the same input image before it is compressed. Specifically, some objects in the input image may be down-sampled, while other objects in the input image may be up-sampled. The down-sampling and/or up-sampling can be unbeknownst to the codec. For example, the control of the standard codecs like AVC, HEC, and VVC may be unaware of the logical content like the objects and background, because there is no need to transmit additional information to control the absolute values of the quantization step.
Next, exemplary encoding and decoding methods consistent with the process 500 are described in detail.
FIG. 6 is a flowchart of an exemplary method 600 for encoding image data, according to some embodiments consistent with the present disclosure. Method 600 may be used to compress image data based on object size adaptation information. For example, method 600 may be performed by an image data encoder, such as image/video encoder 124 in FIG. 1. As shown in FIG. 6, method 600 includes the following steps 602-606.
At step 602, the encoder detects one or more objects from an input image. For example, FIG. 7 schematically shows an input image 710, which shows two objects 712 and 714. The encoder may detect each of objects 712 and 714 from input image 710, and determine the rest of input image 710 to be a background 716.
In some embodiments, the detection of object (s) from an input image includes three stages: image segmentation (i.e., object isolation or recognition) , feature extraction, and object classification. At the image segmentation stage, the encoder may execute an image segmentation algorithm to partition the input image into multiple segments, e.g., sets of pixels each of which representing a portion of the input image. The image segmentation algorithm may assign a label  (i.e., category) to each set of pixels, such that pixels with the same label share certain common characteristics. The encoder may then group the pixels according to their labels and designate each group as a distinct object.
After the input image is segmented, i.e., after at least one object is recognized, the encoder may extract, from the input image, features which are characteristic of the object. The features may be based on the size of the object or on its shape, such as the area of the object, the length and the width of the object, the distance around the perimeter of the object, rectangularity of the object shape, circularity of the object shape. The features associated with each object can be combined to form a feature vector.
At the classification stage, the encoder may classify each recognized object on the basis of the feature vector associated with the respective object. The feature values can be viewed as coordinates of a point in an N-dimensional space (one feature value implies a one-dimensional space, two features imply a two-dimensional space, and so on) , and the classification is a process for determining the sub-space of the feature space to which the feature vector belongs. Since each sub-space corresponds to a distinct object, the classification accomplishes the task of object identification. For example, the encoder may classify a recognized object into one of multiple predetermined classes, such as human, face, animal, automobile or vegetation. In some embodiments, the encoder may perform the classification based on a model. The model may be trained using pre-labeled training data sets that are associated with known objects. In some embodiments, the encoder may perform the classification using a machine learning technology, such as convolutional neural network (CNN) , deep neural network (DNN) , etc.
Still referring to FIG. 6, at step 604, the encoder determines a size adaptation parameter for each of the one or more objects. Specifically, object sizes may impact machine vision tasks’ precision. For example, when a neural network is trained for recognizing human faces with a dimension of 224×224 pixels, performing inference for objects in the trained size (i.e., 224×224 pixels) usually results in the highest precision. Thus, each object class has a corresponding target object size that can maximize the machine vision tasks’ precision in analyzing objects belonging to the particular class. For example, the following Tables 1-3 show the possible bitrate gains achieved by down scaling various object classes. The bitrate gain in Table 1 is measured by the quality of detection with respect to score parameter from a neural network used for machine vision task. The bitrate gain in Table 2 is measured by the quality of detection with respect to the best attained Intersection over Union (IoU) . The bitrate gain in Table 3 is measured by the quality of detection with respect to matching categorization with threshold of IoU=0.5. In Tables 1-3, the values ranging from 0 to 1 indicate the ratio of a down-scaled object size versus its original size. As shown in Tables 1-3, each object class has certain size (s) that can maximize the bitrate gain.
Table 1: Quality of detection versus object size, with respect to score parameter from neural network
Figure PCTCN2022141377-appb-000001
Figure PCTCN2022141377-appb-000002
Tabel 2: Quality of detection versus object size, with respect to best attained IoU
Figure PCTCN2022141377-appb-000003
Figure PCTCN2022141377-appb-000004
Tabel 3: Quality of detection versus object size, with respect to matching categorization with threshold of IoU=0.5
Figure PCTCN2022141377-appb-000005
Figure PCTCN2022141377-appb-000006
Based on the relationship between object classes and their corresponding target object sizes, the encoder may determine a target object size for each of the one or more objects detected from the input image. The encoder may further determine a size adaptation parameter for each detected object to resize it to the corresponding target size. For example, if a first object’s initial size in the input image is larger than its target size, the size adaptation parameter associated with the first object may specify a ratio for down scaling the first object to its target size; if a second object’s initial size in the input image is smaller than its target size, the size adaptation parameter associated with the second object may specify a ratio for up scaling the second object to its target size; and if a third object’s initial size in the input image is the same as its target size, the size  adaptation parameter associated with the third object may specify a 1: 1 ratio for keeping the third object in its initial size.
At step 606, the encoder compresses the input image based on the determined size adaptation parameters. Specifically, prior to encoding, the encoder may modify the image by rescaling each of the one or more detected objects based on their respective size adaptation parameters. In some embodiments, the encoder may perform the rescaling by resampling pixel information associated with each of the one or more detected objects. In particular, in the case that an object’s target size is smaller than its initial size, the encoder may down-sample pixel information associated with the object. For example, the size adaptation parameter associated with the object may be a signal decimation parameter specifying a pattern or frequency for removing the pixels associated with the object, e.g., removing every tenth pixel. The modified input image with the down-sampled object is encoded in a bitstream, which is output by the encoder. In some embodiments, the size adaptation parameters associated with the objects may be included in the encoded bitstreams, so that a decoder receiving the bitstream can directly use the size adaptation parameters to scale the objects back to their original sizes.
FIG. 7 shows an example of compressing the input image based on the determined size adaptation parameters. As described above, the encoder detects object 712, object 714, and background 716 from input image 710. Object 712 and object 714 may have the same or different size adaptation parameters. The encoder may down scale object 712 to object 722 according to the size adaptation parameter associated with object 712, and down scale object 714 to object 724 according to the size adaptation parameter associated with object 714. Moreover, the encoder may down scale background 716 to background 726, such that object 722, object 724, and background 726 collectively form an image 720 that is proportionally down scaled from input image 710. Image 720 is then encoded by the encoder to generate an encoded bitstream.
FIG. 8 is a block diagram illustrating an exemplary implementation of method 600, consistent with embodiments of the present disclosure. As shown in FIG. 8, input image data 802 includes one or more input images. For each input image, the encoder performs object detection 810 on the input image to detect one or more initial object images 811a, 811b, ..., 811n. The encoder then determines size adaptation information 812 associated with each initial object image 811. Size adaptation information 812 may include a size adaptation parameter determined based on the object class associated with the respective initial object image 811. Based on size adaptation information 812, the encoder modifies the input image by performing size adaptation 813 (e.g., down scaling) on the respective initial object image 811 to generate a respective rescaled object image 814, which is encoded (step 815) to generate an object bitstream 816. In some embodiments, multiple rescaled object images 814a-814n derived from a same input image may be encoded separately (815a-815n) to generate multiple bitstreams 816a-816n, respectively. In some embodiments, rather than encoding the multiple rescaled object images 814a-814n separately, the modified input image may be encoded to generate a single bitstream (not shown in FIG. 8) .
Still referring to FIG. 8, the encoder also performs background extraction 820 on each input image to extract an initial background image 821. Initial background image 821 may be extracted by subtracting initial object images 810a, 810b, ..., 810n from the input image. The encoder may determine size adaptation information 822 associated with initial background image 821. Size adaptation information 822 may be determined based on size adaptation information 812 associated with initial object images 811. For example, size adaptation information 822 may include a size adaptation parameter set to be equal to the size adaptation parameter associated with one of initial object images 811a, 811b, ..., 811n detected from the same input image. The encoder uses size adaptation information 822 to perform size adaptation 823 (e.g., down scaling) on initial background image 821 to generate rescaled background 824, which is then encoded (step 825) to generate a background bitstream 826. As shown in FIG. 8, rescaled background image 824 may be encoded in parallel to encoding 815 of rescaled object images 814, to generate  separate bitstreams. Alternatively, rescaled background image 824 and one or more rescaled object images 814 may be encoded to generate a single bitstream (not shown in FIG. 8) .
In some embodiments, size adaptation information 812, 822 may be encoded into object bitstreams 816 and background bitstream 826, so that size adaptation information 812, 822 can be signaled to the decoder and used by the decoder to reconstruct original object images 811 and original background 821.
In some embodiments, the encoder may encode rescaled objects images 814 and rescaled background image 824 in a single atlas image. For example, FIG. 9 schematically illustrates an atlas image 900, which includes a collection of rescaled object images 901, 904, ..., 912, and a rescaled background image 920. Consistent with the disclosed embodiments, the rescaled object images and background image combined in an atlas are not necessarily from the same input image, but rather may be sourced from multiple different input images.
On the decoder side, the bitstreams output from the encoder can be decoded and used to restore the objects and backgrounds in their original sizes. FIG. 10 is a flowchart of an exemplary method 1000 for decoding a bitstream, according to some embodiments consistent with the present disclosure. For example, method 1000 may be performed by an image data decoder, such as image/video decoder 144 in FIG. 1. As shown in FIG. 10, method 1000 includes the following steps 1002-1008.
At step 1002, the decoder receives one or more bitstreams including image data representing one or more rescaled objects and a rescaled background.
At step 1004, the decoder decodes the one or more bitstreams. For example, the decoder may decode the one or more bitstreams using entropy decoding 310 shown in FIG. 3.
Still referring to FIG. 10, at step 1006, the decoder determines size adaptation information associated with the one or more rescaled objects and the rescaled background, respectively. In some embodiments, the size adaptation information associated with each object or background may include a size adaptation parameter specifying the rescaling ratio. In some embodiments, the size adaption information is signaled in the one or more bitstreams. Thus, the decoder can extract the size adaption information from the decoded bitstreams. In some embodiments, the size adaption information is not signaled the one or more bitstream. The decoder may determine the object class associated with each rescaled object, and determine the respective size adaptation information based on: the object class associated with the rescaled object, and a target size associated with the object class. The target size may be obtained from a lookup table specifying a relationship between object classes and object sizes. The relationship may be signaled in the one or more bitstreams, or may be predefined and stored in the decoder.
At step 1008, the decoder generates, based on the size adaptation information, a reconstructed image including the one or more objects and the background in their respective original sizes. Specifically, the decoder restores the rescaled objects and rescaled background to their original sizes by resampling (e.g., up-sampling) , based on the size adaptation information, pixel information associated with the one or more objects and the background. For example, if the size adaptation information indicates that the rescaled objects and rescaled background are down scaled from their original sizes, the decoder up scales the rescaled objects and rescaled background to their original sizes. In some embodiments, the size adaptation information associated with a rescaled object or background may include a signal decimation parameter used for generating the rescaled object or background, and the decoder may interpolate or pad, based on the signal decimation parameter, pixel information associated with the rescaled object or background. For example, if the signal decimation parameter indicates that a rescaled object or a rescaled background is generated by removing every tenth pixel, the decoder may perform an interpolation or padding process to add new pixels for the object or background.
In some embodiments, the decoder may combine the resampled pixel information associated with the one or more objects and the background, to generate a composite image. For example, the decoder may generate the composite image by pasting the restored object (s) on the  restored background. When two restored objects overlap with each, the decoder may overlay one restored object over the other restored object, or adding the pixel values for each pixel in the overlapping area.
In some embodiments,  steps  1006 and 1008 may be omitted by the decoder. For example, when the downstream application is to perform machine vision tasks, the decoder may not need to rescale the objects and background back to their original size. The decoder may reconstruct the image using the scaled objects and background, and the size adaptation information may be omitted from the encoder for encoding and transmission to the decoder. As described above, the encoder may scale down the objects and background to improve performance of the machine vision task, and in this circumstance, the decode may not need to rescale the objects and background to their original size for the machine vision tasks.
The image data compression techniques described in the present disclosure can be used in low bitrate situation. By performing size adaptation of the input images, the total bitrate for transmitting the image data can be reduced.
In some embodiments, a non-transitory computer-readable storage medium including instructions is also provided, and the instructions may be executed by a device (such as the disclosed encoder and decoder) , for performing the above-described methods. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM or any other flash memory, NVRAM, a cache, a register, any other memory chip or cartridge, and networked versions of the same. The device may include one or more processors (CPUs) , an input/output interface, a network interface, and/or a memory.
It should be noted that, the relational terms herein such as “first” and “second” are used only to differentiate an entity or operation from another entity or operation, and do not require or imply any actual relationship or sequence between these entities or operations. Moreover, the words “comprising, ” “having, ” “containing, ” and “including, ” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.
As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a database may include A or B, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or A and B. As a second example, if it is stated that a database may include A, B, or C, then, unless specifically stated otherwise or infeasible, the database may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.
It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention should only be limited by the appended claims.

Claims (57)

  1. A computer-implemented encoding method, comprising:
    detecting a first object from an input image;
    determining a first size adaptation parameter associated with the first object; and
    compressing the input image based on the first size adaptation parameter.
  2. The method of claim 1, wherein:
    detecting the first object from the input image comprises determining an object class associated with the first object; and
    determining the first size adaptation parameter comprises determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  3. The method of claim 1 or 2, further comprising:
    outputting a bitstream comprising information associated with the first size adaptation parameter.
  4. The method of any of claims 1-3, wherein compressing the input image based on the first size adaptation parameter comprises:
    modifying the input image by resampling, based on the first size adaptation parameter, pixel information associated with the first object; and
    compressing the modified input image.
  5. The method of claim 4, wherein the first size adaptation parameter comprises a signal decimation parameter, and resampling the pixel information associated with the first object comprises:
    decimating, based on the signal decimation parameter, the pixel information associated with the first object.
  6. The method of any of claims 1-5, further comprising:
    extracting a background of the input image; and
    before compressing the input image, resampling pixel information associated with the background.
  7. The method of claim 6, further comprising outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the background.
  8. The method of any of claims 1-7, further comprising:
    detecting a second object from the input image;
    determining a second size adaptation parameter associated with the second object; and
    before compressing the input image, resampling, based on the second size adaptation parameter, pixel information associated with the second object.
  9. The method of claim 8, further comprising outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the second object.
  10. The method of any of claims 1-9, wherein compressing the input image comprises:
    detecting one or more objects from one or more input images;
    extracting a background of at least one of the one or more input images;
    determining size adaptation parameters associated with the one or more objects, respectively;
    resampling, based on the size adaptation parameters, pixel information associated with the one or more objects and the background; and
    generating an encoded image representing the resampled one or more objects and the resampled background.
  11. A device comprising:
    a memory storing instructions; and
    one or more processor configured to execute the instructions to cause the device to performing operations comprising:
    detecting a first object from an input image;
    determining a first size adaptation parameter associated with the first object; and
    compressing the input image based on the first size adaptation parameter.
  12. The device of claim 11, wherein:
    detecting the first object from the input image comprises determining an object class associated with the first object; and
    determining the first size adaptation parameter comprises determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  13. The device of claim 11 or 12, wherein the operations further comprise:
    outputting a bitstream comprising information associated with the first size adaptation parameter.
  14. The device of any of claims 11-13, wherein compressing the input image based on the first size adaptation parameter comprises:
    modifying the input image by resampling, based on the first size adaptation parameter, pixel information associated with the first object; and
    compressing the modified input image.
  15. The device of claim 14, wherein the first size adaptation parameter comprises a signal decimation parameter, and resampling the pixel information associated with the first object comprises:
    decimating, based on the signal decimation parameter, the pixel information associated with the first object.
  16. The device of any of claims 11-15, wherein the operations further comprise:
    extracting a background of the input image; and
    before compressing the input image, resampling pixel information associated with the background.
  17. The device of claim 16, wherein the operations further comprise outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the background.
  18. The device of any of claims 11-17, wherein the operations further comprise:
    detecting a second object from the input image;
    determining a second size adaptation parameter associated with the second object; and
    before compressing the input image, resampling, based on the second size adaptation parameter, pixel information associated with the second object.
  19. The device of claim 18, wherein the operations further comprise outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the second object.
  20. The device of any of claims 11-19, wherein compressing the input image comprises:
    detecting one or more objects from one or more input images;
    extracting a background of at least one of the one or more input images;
    determining size adaptation parameters associated with the one or more objects, respectively;
    resampling, based on the size adaptation parameters, pixel information associated with the one or more objects and the background; and
    generating an encoded image representing the resampled one or more objects and the resampled background.
  21. A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:
    detecting a first object from an input image;
    determining a first size adaptation parameter associated with the first object; and
    compressing the input image based on the first size adaptation parameter.
  22. The non-transitory computer-readable medium of claim 21, wherein:
    detecting the first object from the input image comprises determining an object class associated with the first object; and
    determining the first size adaptation parameter comprises determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  23. The non-transitory computer-readable medium of claim 21 or 22, wherein the method further comprises:
    outputting a bitstream comprising information associated with the first size adaptation parameter.
  24. The non-transitory computer-readable medium of any of claims 21-23, wherein compressing the input image based on the first size adaptation parameter comprises:
    modifying the input image by resampling, based on the first size adaptation parameter, pixel information associated with the first object; and
    compressing the modified input image.
  25. The non-transitory computer-readable medium of claim 24, wherein the first size adaptation parameter comprises a signal decimation parameter, and resampling the pixel information associated with the first object comprises:
    decimating, based on the signal decimation parameter, the pixel information associated with the first object.
  26. The non-transitory computer-readable medium of any of claims 21-25, wherein the method further comprises:
    extracting a background of the input image; and
    before compressing the input image, resampling pixel information associated with the background.
  27. The non-transitory computer-readable medium of claim 26, wherein the method further comprises outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the background.
  28. The non-transitory computer-readable medium of any of claims 21-27, wherein the method further comprises:
    detecting a second object from the input image;
    determining a second size adaptation parameter associated with the second object; and
    before compressing the input image, resampling, based on the second size adaptation parameter, pixel information associated with the second object.
  29. The non-transitory computer-readable medium of claim 28, wherein the method further comprises outputting:
    a first bitstream comprising compressed pixel information associated with the first object, and
    a second bitstream comprising compressed pixel information associated with the second object.
  30. The non-transitory computer-readable medium of any of claims 21-29, wherein compressing the input image comprises:
    detecting one or more objects from one or more input images;
    extracting a background of at least one of the one or more input images;
    determining size adaptation parameters associated with the one or more objects, respectively;
    resampling, based on the size adaptation parameters, pixel information associated with the one or more objects and the background; and
    generating an encoded image representing the resampled one or more objects and the resampled background.
  31. A computer-implemented decoding method, comprising:
    receiving one or more bitstreams comprising image data representing a first object in a first size;
    decoding the one or more bitstreams;
    determining a first size adaptation parameter associated with the first object; and
    generating, based on the first size adaptation parameter, a reconstructed image comprising the first object in a second size different from the first size.
  32. The method of claim 31, wherein determining the first size adaptation parameter associated with the first object comprises:
    determining an object class associated with the first object; and
    determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  33. The method of claim 32, wherein the one or more bitstreams comprise information signaling a relationship between object classes and object sizes, and the method further comprises:
    determining the first size adaptation parameter based on the signaled relationship between object classes and object sizes.
  34. The method of claim 32, further comprising:
    determining the first size adaptation parameter based on a predetermined relationship between object classes and object sizes.
  35. The method of claim 31, wherein the one or more bitstreams comprise information signaling the first size adaptation parameter.
  36. The method of any of claims 31-35, wherein generating the reconstructed image comprising the first object in the second size comprises:
    resampling, based on the first size adaptation parameter, pixel information associated with the first object.
  37. The method of claim 36, wherein the first size adaptation parameter comprises a signal decimation parameter, and generating the reconstructed image comprising the first object in the second size comprises:
    interpolating, based on the signal decimation parameter, the pixel information associated with the first object.
  38. The method of any of claims 31-37, further comprising:
    extracting, from the decoded one or more bitstreams, pixel information associated with a background;
    resampling, based on the first size adaptation parameter, the pixel information associated with a background; and
    generating the reconstructed image based on the resampled pixel information associated with the background.
  39. The method of any of claims 31-38, further comprising:
    extracting, from the decoded one or more bitstreams, pixel information associated with a plurality of objects;
    determining a plurality of size adaptation parameters associated with the plurality of objects, respectively;
    resampling, based on the plurality of size adaptation parameters, the pixel information associated with the plurality of objects; and
    combining the resampled pixel information to generate a composite image.
  40. A device comprising:
    a memory storing instructions; and
    one or more processor configured to execute the instructions to cause the device to performing operations comprising:
    receiving one or more bitstreams comprising image data representing a first object in a first size;
    decoding the one or more bitstreams;
    determining a first size adaptation parameter associated with the first object; and
    generating a reconstructed image comprising the first object in a second size different from the first size, the reconstructing being based on the first size adaptation parameter.
  41. The device of claim 40, wherein determining the first size adaptation parameter associated with the first object comprises:
    determining an object class associated with the first object; and
    determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  42. The device of claim 41, wherein the one or more bitstreams comprise information signaling a relationship between object classes and object sizes, and the operations further comprise:
    determining the first size adaptation parameter based on the signaled relationship between object classes and object sizes.
  43. The device of claim 41, wherein the operations further comprise:
    determining the first size adaptation parameter based on a predetermined relationship between object classes and object sizes.
  44. The device of claim 40, wherein the one or more bitstreams comprise information signaling the first size adaptation parameter.
  45. The device of any of claims 40-44, wherein generating the reconstructed image comprising the first object in the second size comprises:
    resampling, based on the first size adaptation parameter, pixel information associated with the first object.
  46. The device of claim 45, wherein the first size adaptation parameter comprises a signal decimation parameter, and generating the reconstructed image comprising the first object in the second size comprises:
    interpolating, based on the signal decimation parameter, the pixel information associated with the first object.
  47. The device of any of claims 40-46, wherein the operations further comprise:
    extracting, from the decoded one or more bitstreams, pixel information associated with a background;
    resampling, based on the first size adaptation parameter, the pixel information associated with a background; and
    generating the reconstructed image based on the resampled pixel information associated with the background.
  48. The device of any of claims 40-47, wherein the operations further comprise:
    extracting, from the decoded one or more bitstreams, pixel information associated with a plurality of objects;
    determining a plurality of size adaptation parameters associated with the plurality of objects, respectively;
    resampling, based on the plurality of size adaptation parameters, the pixel information associated with the plurality of objects; and
    combining the resampled pixel information to generate a composite image.
  49. A non-transitory computer-readable medium storing a set of instructions that is executable by one or more processors of a device to cause the device to perform a method comprising:
    receiving one or more bitstreams comprising image data representing a first object in a first size;
    decoding the one or more bitstreams;
    determining a first size adaptation parameter associated with the first object; and
    generating a reconstructed image comprising the first object in a second size different from the first size, the reconstructing being based on the first size adaptation parameter.
  50. The non-transitory computer-readable medium of claim 49, wherein determining the first size adaptation parameter associated with the first object comprises:
    determining an object class associated with the first object; and
    determining the first size adaptation parameter based on:
    the object class associated with the first object, and
    a target size associated with the object class.
  51. The non-transitory computer-readable medium of claim 50, wherein the one or more bitstreams comprise information signaling a relationship between object classes and object sizes, and the method further comprises:
    determining the first size adaptation parameter based on the signaled relationship between object classes and object sizes.
  52. The non-transitory computer-readable medium of claim 50, wherein the method further comprises:
    determining the first size adaptation parameter based on a predetermined relationship between object classes and object sizes.
  53. The non-transitory computer-readable medium of claim 49, wherein the one or more bitstreams comprise information signaling the first size adaptation parameter.
  54. The non-transitory computer-readable medium of any of claims 49-53, wherein generating the reconstructed image comprising the first object in the second size comprises:
    resampling, based on the first size adaptation parameter, pixel information associated with the first object.
  55. The non-transitory computer-readable medium of claim 54, wherein the first size adaptation parameter comprises a signal decimation parameter, and generating the reconstructed image comprising the first object in the second size comprises:
    interpolating, based on the signal decimation parameter, the pixel information associated with the first object.
  56. The non-transitory computer-readable medium of any of claims 49-55, wherein the method further comprises:
    extracting, from the decoded one or more bitstreams, pixel information associated with a background;
    resampling, based on the first size adaptation parameter, the pixel information associated with a background; and
    generating the reconstructed image based on the resampled pixel information associated with the background.
  57. The non-transitory computer-readable medium of any of claims 49-56, wherein the method further comprises:
    extracting, from the decoded one or more bitstreams, pixel information associated with a plurality of objects;
    determining a plurality of size adaptation parameters associated with the plurality of objects, respectively;
    resampling, based on the plurality of size adaptation parameters, the pixel information associated with the plurality of objects; and
    combining the resampled pixel information to generate a composite image.
PCT/CN2022/141377 2022-10-11 2022-12-23 Method and system for image data processing WO2024077772A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22461617.7 2022-10-11
EP22461617 2022-10-11

Publications (1)

Publication Number Publication Date
WO2024077772A1 true WO2024077772A1 (en) 2024-04-18

Family

ID=83691087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141377 WO2024077772A1 (en) 2022-10-11 2022-12-23 Method and system for image data processing

Country Status (1)

Country Link
WO (1) WO2024077772A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104040576A (en) * 2011-12-14 2014-09-10 电子湾有限公司 Multiple-angle imagery of physical objects
CN104969262A (en) * 2013-03-08 2015-10-07 英特尔公司 Techniques for image encoding based on region of interest
CN109598298A (en) * 2018-11-29 2019-04-09 上海皓桦科技股份有限公司 Image object recognition methods and system
CN110225341A (en) * 2019-06-03 2019-09-10 中国科学技术大学 A kind of code flow structure image encoding method of task-driven
CN110659724A (en) * 2019-09-12 2020-01-07 复旦大学 Target detection convolutional neural network construction method based on target scale range
US20200053362A1 (en) * 2018-08-10 2020-02-13 Apple Inc. Deep Quality Enhancement of Adaptive Downscaled Coding for Image Compression

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104040576A (en) * 2011-12-14 2014-09-10 电子湾有限公司 Multiple-angle imagery of physical objects
CN104969262A (en) * 2013-03-08 2015-10-07 英特尔公司 Techniques for image encoding based on region of interest
US20200053362A1 (en) * 2018-08-10 2020-02-13 Apple Inc. Deep Quality Enhancement of Adaptive Downscaled Coding for Image Compression
CN109598298A (en) * 2018-11-29 2019-04-09 上海皓桦科技股份有限公司 Image object recognition methods and system
CN110225341A (en) * 2019-06-03 2019-09-10 中国科学技术大学 A kind of code flow structure image encoding method of task-driven
CN110659724A (en) * 2019-09-12 2020-01-07 复旦大学 Target detection convolutional neural network construction method based on target scale range

Similar Documents

Publication Publication Date Title
US20240283946A1 (en) Block partitioning methods for video coding
US11838541B2 (en) Matrix weighted intra prediction of video signals
CN114788264A (en) Method for signaling virtual boundary and surround motion compensation
WO2024140745A1 (en) Method and apparatus for temporal resampling
WO2022117010A1 (en) Methods and systems for cross-component sample adaptive offset
CN114731440A (en) Lossless encoding of video data
CN115152236A (en) Method for processing chrominance signal
CN114902670B (en) Method and apparatus for signaling sub-image division information
CN116918333A (en) Method, apparatus, and non-transitory computer readable medium for cross-component sample offset compensation
CN115836525A (en) Method and system for prediction from multiple cross components
CN115398897A (en) Method and apparatus for deriving temporal motion information
JP2023507935A (en) Method and apparatus for encoding video data in palette mode
WO2024077772A1 (en) Method and system for image data processing
WO2024077797A1 (en) Method and system for retargeting image
KR20230115043A (en) Video processing method and video processing apparatus using super resolution deep learning network based on image quality
CN112313950B (en) Video image component prediction method, device and computer storage medium
CN116601960A (en) Method and system for using adaptive loop filter
CN115552900A (en) Method of signaling maximum transform size and residual coding
US20240223813A1 (en) Method and apparatuses for using face video generative compression sei message
US20240121395A1 (en) Methods and non-transitory computer readable storage medium for pre-analysis based resampling compression for machine vision
US20240121445A1 (en) Pre-analysis based image compression methods
WO2024083092A1 (en) Keypoints based video compression
US12003728B2 (en) Methods and systems for temporal resampling for multi-task machine vision
WO2024213070A1 (en) Object maskinformation for supplementalenhancement information message
WO2022062957A1 (en) Supplemental enhancement information message in video coding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22961950

Country of ref document: EP

Kind code of ref document: A1