WO2023005740A1 - Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device - Google Patents
Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device Download PDFInfo
- Publication number
- WO2023005740A1 WO2023005740A1 PCT/CN2022/106507 CN2022106507W WO2023005740A1 WO 2023005740 A1 WO2023005740 A1 WO 2023005740A1 CN 2022106507 W CN2022106507 W CN 2022106507W WO 2023005740 A1 WO2023005740 A1 WO 2023005740A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- level
- semantic information
- features
- low
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title abstract description 9
- 238000010191 image analysis Methods 0.000 claims abstract description 131
- 238000000034 method Methods 0.000 claims abstract description 77
- 238000012545 processing Methods 0.000 claims description 46
- 230000011218 segmentation Effects 0.000 claims description 42
- 238000001514 detection method Methods 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 11
- 238000003703 image analysis method Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 11
- 238000005516 engineering process Methods 0.000 claims description 10
- 230000000875 corresponding effect Effects 0.000 description 57
- 238000010586 diagram Methods 0.000 description 37
- 230000008569 process Effects 0.000 description 18
- 239000000284 extract Substances 0.000 description 17
- 230000005540 biological transmission Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 9
- 238000013527 convolutional neural network Methods 0.000 description 8
- 239000003623 enhancer Substances 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 206010000117 Abnormal behaviour Diseases 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/132—Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
Definitions
- the embodiments of the present application relate to the field of image technology, and specifically relate to an image encoding, decoding, reconstruction, and analysis method, system, and electronic equipment.
- images can be divided into static images and image frames of dynamic videos.
- the image may be encoded using image coding technology.
- Image coding also known as image compression, refers to the technology of representing the information contained in the image with fewer bits under the condition of satisfying a certain image quality.
- the receiving end can decode the image to realize image reconstruction. Based on the reconstructed image, the user can watch the image at the receiving end to meet the user's image viewing needs; its typical application scenarios are image display, video playback, etc.
- the embodiments of the present application provide an image encoding, decoding, reconstruction, analysis method, system, and electronic device to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
- the embodiment of the present application provides an image coding method, including:
- an embodiment of the present application provides an image coding system, including:
- a semantic extractor is used to extract high-level semantic information of the original image, and the high-level semantic information is used for image analysis tasks of the original image;
- a feature extractor configured to extract low-level features of the original image, and the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;
- a first encoder configured to encode the high-level semantic information to generate a first code stream
- the second encoder is configured to encode the low-level features to generate a second code stream.
- the embodiment of the present application provides an image decoding method, including:
- the high-level semantic information is used to perform an image analysis task of the original image
- an image decoding system including:
- the first decoder is configured to decode the first code stream corresponding to the high-level semantic information of the original image to obtain the high-level semantic information; the high-level semantic information is used to perform an image analysis task of the original image;
- the second decoder is configured to decode the second code stream corresponding to the low-level features of the original image to obtain the low-level features
- the predictor performs image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
- the embodiment of the present application provides an image reconstruction method, including:
- the embodiment of the present application provides an image analysis method, including:
- the target code stream is obtained from multiple code streams; the multiple code streams at least include at least one first code stream corresponding to at least one high-level semantic information of the original image, and the code stream corresponding to the original image The second code stream corresponding to the low-level feature; wherein, a high-level semantic information of the original image corresponds to a first code stream, and the target code stream is the high-level semantic information in the at least one first code stream suitable for the image analysis task the first code stream of
- the image analysis task is performed according to the decoded high-level semantic information.
- an embodiment of the present application provides an electronic device, including at least one memory and at least one processor, the memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions.
- Executing instructions to execute the image coding method described in the first aspect above, or the image decoding method described in the third aspect above, or the image reconstruction method described in the fifth aspect above, or the image reconstruction method described in the above first aspect The image analysis method described in the six aspects.
- the embodiment of the present application provides a storage medium, the storage medium stores one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the above-mentioned first aspect is implemented.
- the image coding method provided in the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams.
- the embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
- FIG. 1 is a block diagram of an image transmission system provided by an embodiment of the present application.
- Fig. 2A is a flow chart of the image encoding and decoding method provided by the embodiment of the present application.
- FIG. 2B is a schematic structural diagram of an image coding system provided by an embodiment of the present application.
- FIG. 2C is a schematic structural diagram of an image encoding and decoding system provided by an embodiment of the present application.
- FIG. 3A is another flow chart of the image encoding and decoding method provided by the embodiment of the present application.
- FIG. 3B is another schematic structural diagram of an image coding system provided by an embodiment of the present application.
- FIG. 3C is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
- FIG. 4A is a flowchart of a method for obtaining a predicted image according to an embodiment of the present application.
- FIG. 4B is a schematic structural diagram of the predictor provided by the embodiment of the present application.
- FIG. 4C is a schematic structural diagram of a convolutional network provided by an embodiment of the present application.
- FIG. 4D is another structural schematic diagram of the predictor provided by the embodiment of the present application.
- FIG. 5A is a schematic structural diagram of a feature extractor provided in an embodiment of the present application.
- FIG. 5B is another schematic structural diagram of the feature extractor provided by the embodiment of the present application.
- FIG. 6 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
- FIG. 7 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
- FIG. 8 is a flow chart of the image analysis method provided by the embodiment of the present application.
- FIG. 9A is a schematic diagram of effect comparison provided by the embodiment of the present application.
- FIG. 9B is a schematic diagram of another effect comparison provided by the embodiment of the present application.
- FIG. 9C is a schematic diagram of yet another effect comparison provided by the embodiment of the present application.
- FIG. 9D is a schematic diagram of another effect comparison provided by the embodiment of the present application.
- FIG. 10 is a diagram of an application example provided by the embodiment of the present application.
- FIG. 11 is a block diagram of an electronic device provided by an embodiment of the present application.
- FIG. 1 exemplarily shows a block diagram of an image transmission system 100 provided by an embodiment of the present application.
- the image transmission system 100 may include: an image sending end 110 and an image receiving end 120 .
- the sending end 110 can set the image coding system 111 to realize image coding by using the image coding scheme provided by the embodiment of the present application;
- the receiving end 120 can set the image decoding system 121 to use the image decoding scheme provided by the embodiment of the present application, Implement image decoding.
- the image transmission system 100 shown in FIG. 1 can be applied to image transmission between any devices, including but not limited to: image transmission between terminals, image transmission between terminals and servers, and image transmission between servers,
- the terminal may include smart hardware (for example, smart hardware with image acquisition capabilities such as a smart camera), user equipment such as a mobile phone, and a computer.
- the device is not fixed to be the sending end and the receiving end, but is adjusted according to the sending and receiving angles of the image. For example, a certain device becomes the sending end when sending an image, but it will become the receiver.
- one of the terminal and the server becomes the sending end when sending an image, so as to use the image coding scheme provided by the embodiment of the present application to realize image coding; when one of the terminal and the server receives the image, it becomes the receiving end to use
- the image decoding solution provided in the embodiment of the present application realizes image decoding.
- the sending end 110 when encoding the image, the sending end 110 needs to support the image reconstruction task oriented to user vision and the image analysis task oriented to machine vision of the receiving end 120 .
- the user vision-oriented image reconstruction task can be understood as implementing image reconstruction at the receiving end 120 so that the user can watch the image at the receiving end 120 .
- the image analysis task for machine vision can be understood as analyzing and processing the image from the perspective of the computer at the receiving end; for example, the receiving end implements image classification, object detection, instance segmentation, etc. in the image.
- the application scenarios of machine vision-oriented image analysis tasks include, but are not limited to: license plate recognition and road planning in intelligent transportation systems, object detection and road tracking in automatic driving systems, and human face recognition in smart medical systems. And facial expression detection and analysis, abnormal behavior detection, etc.
- the sending end 110 can use the image coding system 111 to implement image encoding, and support user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks; the receiving end 120 can use the image decoding system 121 to implement image decoding , and specifically perform an image reconstruction task oriented to the user's vision, and provide processing data for the receiving end 120 to perform an image analysis task.
- FIG. 2A exemplarily shows a flow chart of the image encoding and decoding method provided by the embodiment of the present application.
- the image encoding system 111 may perform an image encoding process
- the image decoding system 121 may perform an image decoding process.
- the process flow of the method may include the following steps.
- step S20 the image coding system acquires the original image.
- the original image can be regarded as an image that needs to be encoded, for example, an image collected by the sending end 110 using a camera, or an image created by drawing or programming.
- the original image can be input into the image encoding system 111, and the original image is encoded by the image encoding system 111.
- step S21 the image coding system extracts high-level semantic information of the original image.
- step S22 the image coding system extracts low-level features of the original image.
- the high-level semantic information of the image may be the semantics expressed by the image, for example, the objects expressed by the image (such as people, plants, animals, manufactured objects, etc.), the types of objects, the relationship between objects, and so on.
- the low-level features of an image can be understood as detailed feature information such as the color, texture, and texture of the image.
- the low-level features of an image can express the rich visual detail information of the image, but the semantic information at the conceptual level is less; semantic information.
- the low-level features may be the features of the visual layer of the image, and the high-level semantic information may be the information of the conceptual layer of the image.
- the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. Since the low-level features express the rich visual detail information of the original image, and the high-level semantic information expresses the semantics of the original image at the conceptual level, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image.
- step S23 the image encoding system encodes the high-level semantic information to generate a first code stream.
- step S24 the image encoding system encodes the low-level features to generate a second code stream.
- the image analysis task and the image reconstruction task of the original image can be performed by the receiving end 120, so that after the image coding system obtains the high-level semantic information and low-level features, it can encode the high-level semantic information and low-level features respectively, In order to obtain the first code stream corresponding to the high-level semantic information and the second code stream corresponding to the low-level features.
- the first code stream and the second code stream can be transmitted from the sending end 110 to the receiving end 120 .
- high-level semantic information and low-level features can be encoded using the same encoding method, for example, high-level semantic information and low-level features can be encoded using the same encoder in an image encoding system.
- the same encoder may be a lossless encoder to achieve lossless encoding of high-level semantic information and lossless encoding of low-level features.
- step S30 the image decoding system decodes the first code stream to obtain high-level semantic information, and the high-level semantic information is used to perform the image analysis task of the original image.
- the image decoding system After the image decoding system acquires the first code stream transmitted by the sending end 110, it may decode the first code stream to obtain high-level semantic information.
- the high-level semantic information obtained by the image decoding system may provide processing data for the receiver 120 to perform image analysis tasks.
- the receiving end 120 can set an image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image decoding system can import the decoded high-level semantic information into the image analysis logic, and the image analysis logic is based on the high-level Semantic information performs specific image analysis tasks, thereby meeting the image analysis needs of machine vision-oriented raw images.
- image analysis logic can perform image analysis tasks such as image classification, object detection, and instance segmentation.
- the image analysis logic may also be set in an external device communicatively connected with the receiving end 120 .
- step S31 the image decoding system decodes the second code stream to obtain low-level features.
- step S32 the image decoding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.
- the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image. Based on this, after decoding the first code stream and the second code stream, the image decoding system can perform image reconstruction on the original image based on the decoded high-level semantic information and low-level features, so as to realize the image reconstruction task of the original image.
- the image decoding system uses high-level semantic information as guidance information for image reconstruction, and reconstructs specific image details expressed by low-level features to obtain a predicted image.
- high-level semantic information and low-level features of the original image are combined.
- the embodiment of the present application can ensure that the basic structure of the predicted image is similar to or even consistent with the original image; at the same time, in the reconstruction process of the original image, the embodiment of the present application Combined with the rich image details expressed by low-level signals, it is possible to accurately reconstruct the specific details in the original image (such as the color, texture, texture, etc.) of the object in the original image, thereby ensuring the accuracy of the local details of the predicted image; Therefore, the embodiments of the present application combine the high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image can have higher accuracy.
- the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams.
- the embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
- FIG. 2B exemplarily shows a schematic structural diagram of the image coding system 111 provided by the embodiment of the present application.
- the image encoding system 111 may include: a semantic extractor 210 , a feature extractor 211 , a first encoder 212 and a second encoder 213 .
- the semantic extractor 210 is used to extract high-level semantic information of the original image
- the feature extractor 211 is used to extract low-level features of the original image.
- the semantic extractor 210 and the feature extractor 211 may be different network layers in the convolutional neural network, and the layer of the semantic extractor 210 in the convolutional neural network is higher than that of the feature extractor 211 .
- the higher the level of the network layer in the convolutional neural network the more the processing result of the network layer tends to the semantic information of the image. Conversely, the lower the level of the network layer, the more the processing result of the network layer tends to the low-level detail features of the image. .
- the convolutional neural network can include a backbone network, and the backbone network can include a standard convolutional layer at a low level, and a high-level semantic information extraction layer at a high level; when an image is input to a convolutional neural network for processing
- the processing result of the standard convolution layer can be used as the low-level feature of the image
- the processing result of the high-level semantic information extraction layer can be used as the high-level semantic information of the image.
- the structure of the convolutional neural network shown in this paragraph is only an optional structure, and the embodiment of the present application can also use convolutional neural networks with other structures, and use the low-level network layer in the convolutional neural network
- the output result of the network layer is used as the low-level feature
- the output result of the high-level network layer is used as the high-level semantic information.
- the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. That is to say, after the receiving end 120 obtains the high-level semantic information of the original image, the receiving end 120 can analyze and process the original image according to the high-level semantic information, so as to realize image analysis tasks such as image classification, object detection, and instance segmentation of the original image. Since both low-level features and high-level semantic information cannot fully express the information of the original image, in the embodiment of the present application, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image.
- the embodiments of the present application can carry out the original image based on the low-level features and high-level semantic information of the original image.
- Image Reconstruction to achieve image reconstruction tasks from raw images. For example, after the receiving end 120 obtains the low-level features and high-level semantic information of the original image, the receiving end 120 may reconstruct the original image according to the low-level features and high-level semantic information of the original image.
- the first encoder 212 may encode the high-level semantic information to generate a first code stream.
- the first code stream can be transmitted to the receiving end 120 .
- the second encoder 213 may encode the low-level features to generate a second code stream.
- the second code stream can be transmitted to the receiving end 120 .
- the first encoder 212 and the second encoder 213 may be the same encoder, that is, the embodiment of the present application may use the same encoder to encode high-level semantic information and low-level features respectively.
- the first encoder 212 and the second encoder 213 may be the same lossless encoder.
- a lossless encoder such as a FLIF (Free Lossless Image Format, free lossless image format) encoder.
- FLIF Free Lossless Image Format, free lossless image format
- the lossless encoder is only the same encoder form that the first encoder 212 and the second encoder 213 can choose, and the embodiment of the present application can also support the first encoder 212 and the second encoder 213 in other forms the same encoder.
- the embodiment of the present application may also support that the first encoder 212 and the second encoder 213 are different encoders.
- the sending end 110 may transmit the first code stream and the second code stream to the receiving end 120 . Therefore, the receiving end 120 can decode the first code stream and the second code stream, use the high-level semantic information to realize the image analysis task, and use the high-level semantic information and low-level features to realize the image reconstruction task.
- FIG. 2C further shows a schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 2B .
- the decoding process of the image decoding system 121 shown in FIG. 2C may be an inverse process of the encoding process of the image encoding system 111 .
- the image decoding system 121 may include: a first decoder 220 , a second decoder 221 and a predictor 222 .
- the first decoder 220 is used to decode the first code stream to obtain high-level semantic information of the original image.
- the high-level semantic information obtained by the first decoder 220 can be used for the image analysis task of the original image.
- the high-level semantic information output by the first decoder 220 can be imported into image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image analysis logic performs specific image analysis tasks based on the high-level semantic information.
- the image analysis logic can be set at the receiving end 120 , or can be set at an external device communicatively connected with the receiving end 120 .
- the second decoder 221 is used to decode the second code stream to obtain low-level features of the original image.
- the first decoder 220 and the second decoder 221 may be the same decoder, for example, the same lossless decoder. Lossless decoders such as FLIF decoders.
- the high-level semantic information obtained by the first decoder 220 and the low-level features obtained by the second decoder 221 can be imported into the predictor 222 .
- the predictor 222 is used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image.
- the predicted image obtained by the predictor 222 can be displayed to the user at the receiving end 120, so as to meet the viewing requirement of the original image oriented to the user's vision.
- the image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.
- the embodiment of the present application may first The image coding system 111 uses high-level semantic information and low-level features to perform image reconstruction to obtain a predicted image of the original image, and then further determines the difference information between the predicted image and the original image, and further transmits the difference information to the image decoding system 121 . Furthermore, the image decoding system 121 can further use the difference information to perform image enhancement on the reconstructed image on the basis of combining high-level semantic information and low-level features for image reconstruction, so as to make the final reconstructed image more accurate. Based on this, FIG.
- FIG. 3A exemplarily shows another flowchart of the image encoding and decoding method provided by the embodiment of the present application. As shown in FIG. 2A and FIG. 3A , the method flow shown in FIG. 3A further includes the following steps on the basis of the method flow shown in FIG. 2A .
- step S25 the image coding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.
- step S26 the image coding system determines difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image in the image reconstruction task.
- step S27 the image encoding system encodes the difference information to generate a third code stream.
- the image coding system in addition to performing steps S20 to S24, the image coding system further performs steps S25 to S27.
- the image coding system can use high-level semantic information as guidance information for image reconstruction to reconstruct specific image details expressed by low-level features, so as to obtain a predicted image.
- the image coding system can compare the predicted image with the original image to determine the difference information between the predicted image and the original image.
- the difference information can perform image enhancement on the predicted image reconstructed at the receiving end in the image reconstruction task at the receiving end to Make the enhanced image closer to the original image.
- the image coding system may encode the difference information to generate a third code stream.
- the third code stream may be transmitted from the sending end to the receiving end.
- the above difference information may be residual information between the predicted image and the original image.
- the image coding system may perform lossy coding on the difference information to generate the third code stream.
- Lossy coding such as VVC (Versatile Video Coding, multifunctional video coding).
- VVC Very Video Coding, multifunctional video coding
- the embodiment of the present application can determine the coding loss of the difference information based on the negative correlation between the image quality requirement (QP) of the reconstructed image and the coding loss of the difference information, so that A third code stream suitable for image quality requirements is obtained.
- the encoding loss of the difference information may be controlled based on the image quality requirements of the reconstructed image. For example, if the image quality requirement of the reconstructed image is higher, the embodiment of the present application can control the coding loss of the difference information to be lower; if the image quality requirement is lower, the embodiment of the present application can control the coding loss of the difference information to be higher.
- the embodiment of the present application can set the image quality requirements based on the network bandwidth, for example, the network bandwidth and the image quality requirements are positively correlated, that is, the higher the network bandwidth, the higher the image quality requirements; thus, the present application
- third code streams with different data sizes can be obtained by encoding, so as to adapt to different network bandwidth conditions.
- step S33 the image decoding system decodes the third code stream to obtain difference information.
- step S34 the image decoding system performs image enhancement processing on the predicted image according to the difference information, so as to obtain an enhanced image for display.
- the image decoding system further performs steps S33 to S34 in addition to steps S30 to S32.
- the image decoding system after the image decoding system performs image reconstruction based on low-level features and high-level semantic information to obtain the predicted image, it can further use the difference information obtained by decoding the third code stream to perform image enhancement processing on the predicted image to obtain enhanced image.
- the enhanced image can be used as the final reconstructed image in the embodiment of the present application, and displayed to the user for viewing.
- the image decoding system can perform lossy decoding on the third code stream. For example, VVC decoding is performed on the third code stream.
- the image encoding and decoding method provided by the embodiment of the present application can determine the difference information between the predicted image and the original image, and then use the difference information to enhance the reconstructed predicted image at the receiving end, so that the final reconstructed enhanced image can be more accurate. Close to the original image, improving the accuracy of image reconstruction.
- the image encoding system 111 may decide whether to encode and transmit the difference information based on the network bandwidth of the sending end 110 .
- the image encoding system 111 may not perform encoding and transmission of difference information; for example, the image encoding system 111 After the difference information is obtained, the execution of step S27 can be cancelled, and the difference information is not encoded. Furthermore, the image coding system 111 may wait until the network bandwidth of the sending end is higher than the bandwidth threshold, and then perform encoding and transmission of the difference information, so as to transmit the difference information to the receiving end.
- the embodiment of the present application can perform encoding and transmission of difference information. For example, after obtaining the difference information, the image coding system 111 may directly encode the difference information to generate a third code stream, and the sending end transmits the third code stream to the receiving end.
- the receiving end may play video images on the receiving end based on the information continuously transmitted by the sending end. If the current network bandwidth is low, the embodiment of the present application can reduce the definition of the video image played by the receiving end. At this time, the sending end may not encode and transmit the difference information, but only transmit the first code stream and the second code stream to the The receiving end; thus, the receiving end performs image reconstruction based on low-level features and high-level semantic information at this time and displays the reconstructed image, which can ensure that the receiving end can continuously play video images while reducing the definition of video images. If the current network bandwidth is relatively high, the embodiment of the present application can improve the clarity of the video image played by the receiving end.
- the sending end can encode and transmit the difference information, and the first code stream, the second code stream and the third code stream can be encoded and transmitted.
- FIG. 3B exemplarily shows another structural diagram of the image coding system 111 provided by the embodiment of the present application.
- the image encoding system 111 may further include: a predictor 222 , a comparator 214 and a third encoder 215 .
- the high-level semantic information extracted by the semantic extractor 210 and the low-level features extracted by the feature extractor 211 can be imported into the predictor 222 .
- the predictor 222 can be used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image.
- the predicted image obtained by the predictor 222 can be input into the comparator 214, and the comparator 214 can be used to compare the predicted image with the original image to obtain difference information between the predicted image and the original image.
- the difference information can be used to enhance the predicted image in the image reconstruction task at the receiving end.
- comparator 214 may include a subtractor. The subtractor can perform residual processing on the predicted image and the original image to obtain residual information between the predicted image and the original image, and the residual information can be used as the above-mentioned difference information.
- the difference information obtained by the comparator 214 can be input into the third encoder 215, and the third encoder 215 can be used to encode the difference information to generate a third code stream.
- the third encoder 215 may comprise a lossy encoder.
- the lossy coder is, for example, a VVC (Versatile Video Coding, multifunctional video coding) coder.
- the embodiment of the present application can determine the encoding loss of the difference information based on the negative correlation between the image quality requirement (QP) and the encoding loss of the difference information, so as to obtain The third code stream adapted to the image quality requirements.
- the sending end 110 may transmit the third code stream to the receiving end 120 .
- FIG. 3C further shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 3B .
- the image coding system 111 After the image coding system 111 generates the first code stream, the second code stream and the third code stream, the first code stream, the second code stream and the third code stream may be transmitted to the image decoding system 121 of the receiving end 120 . Therefore, the image decoding system 121 can realize image decoding based on the structure shown in FIG. 3C , so as to be compatible with the image analysis task and image reconstruction task of the original image.
- the image decoding system 121 shown in FIG. 3C further includes: a third decoder 223 and an image intensifier 224 .
- the third decoder 223 is used to decode the third code stream to obtain difference information (eg residual information) between the original image and the predicted image.
- the third decoder 223 may be different from the first decoder 220 and the second decoder 221 , for example, the third decoder 223 may include a lossy decoder. Lossy decoders such as VVC decoders.
- the image enhancer 224 may be configured to perform image enhancement processing on the predicted image obtained by the predictor 222 according to the difference information, so as to obtain an enhanced image. Since the difference information expresses the difference between the predicted image and the original image, the embodiment of the present application further introduces the difference information to perform image enhancement processing on the predicted image, which can make the enhanced image after image enhancement closer to the original image, thereby improving the Accuracy for image reconstruction tasks.
- the enhanced image obtained by the image intensifier 224 can be used as the final image displayed to the user, so as to meet the viewing requirements of the original image oriented to the user's vision.
- the sending end 110 can transmit the first code stream, the second code stream and the third code stream to Receiver 120. Furthermore, the image decoding system 121 can determine whether to decode the first code stream to support the image analysis task based on the task instruction of the receiving end 120, or decode the first code stream, the second code stream and the third code stream to perform image reconstruction task.
- the image decoding system 121 can decode the first code stream to obtain high-level semantic information, and transmit the high-level semantic information to the image analysis logic to support the execution of the image analysis tasks.
- the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream to obtain high-level semantic information, low-level features and difference information; thus, the image decoding system 121
- the high-level semantic information and low-level features can be used for image reconstruction to obtain a predicted image, and the difference information can be further used to perform image enhancement processing on the predicted image to obtain an enhanced image.
- the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream, and use the high-level semantic information to perform the image analysis task. Semantic information, low-level features and difference information perform image reconstruction tasks.
- the task instruction is adapted to the user's current needs, for example, the user automatically triggers different task instructions at the receiving end.
- the task instruction may also be set by default by the system. It can be understood that the code stream selected for decoding based on the task instruction described in this paragraph is also applicable to the situation where the image coding system shown in FIG. 2C transmits the first code stream and the second code stream. In this case, the image The decoding system 121 may not perform the decoding and image enhancement processing of the third code stream, and other processes may refer to the descriptions of the above corresponding parts in the same way, and will not be expanded here.
- the image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision, and the embodiments of the present application are used in image reconstruction
- the difference information between the original image and the first reconstructed predicted image is used in the task, which can further improve the accuracy of image reconstruction.
- the embodiment of the present application provides an optional image reconstruction method, including the following steps: obtaining high-level semantic information and low-level features of the original image; performing image processing on the original image according to the low-level features and high-level semantic information reconstructing to obtain a predicted image; and acquiring difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image.
- the embodiment of the present application may use high-level semantic information as guidance information for image reconstruction, and reconstruct specific image details expressed by low-level features to obtain the predicted image.
- FIG. 4A exemplarily shows a flowchart of a method for obtaining a predicted image according to an embodiment of the present application.
- the flow of the method can be implemented by the predictor 222, and the predictor 222 referred to here can be set in the image encoding system, and can also be set in the image decoding system. Referring to FIG. 4A , the method flow may include the following steps.
- step S40 low-level features are processed to obtain target features for further processing with high-level semantic information.
- the embodiment of the present application may upsample low-level features to obtain target features for further processing (for example, stacking processing) with high-level semantic information.
- the upsampling may be nearest neighbor upsampling with a set multiple, for example, nearest neighbor upsampling by 8 times.
- the embodiment of the present application can process (for example, upsampling) the low-level features based on the resolution of the high-level semantic information, so that the resolution of the target feature after processing the low-level features is the same as the resolution of the high-level semantic information
- the embodiment of the present application may also support that the resolution of the target feature processed by the low-level feature is different from that of the high-level semantic information, and is not limited to the resolution of the two being the same.
- step S41 convolution processing is performed after stacking target features and high-level semantic information to obtain convolution features.
- the target features obtained in step S40 can be stacked with high-level semantic information to obtain stacked features, and the stacked features can be input into a convolutional network for convolution processing to obtain convolutional features output by the convolutional network.
- the convolutional network may first perform multiple convolutions on the stacked features to obtain the first convolutional features; then the convolutional network may perform multiple filtering processes on the first convolutional features to obtain Filtering features; the filtering features can be further subjected to multiple convolutions by the convolutional network to obtain second convolutional features; then, the second convolutional features can be output as convolutional features by the activation function of the convolutional network.
- step S42 the convolutional features are combined with low-level features to obtain a predicted image.
- the embodiment of the present application can combine the convolutional features with low-level features to obtain a predicted image.
- convolutional features may be added to low-level features to obtain a predicted image.
- the combination of convolutional features and low-level features is not limited to addition, and the embodiments of the present application may also support other combinations.
- FIG. 4B exemplarily shows a schematic structural diagram of the predictor 222 provided in the embodiment of the present application.
- the predictor 222 may include: an upsampler 410 , a stacker 420 , a convolutional network 430 and an adder 440 .
- the upsampler 410 can be used to upsample low-level features to obtain target features for further processing with high-level semantic information.
- the upsampler 410 may be used to upsample low-level features to obtain target features with the same resolution as high-level semantic information.
- the embodiment of the present application may also support that the resolution of the target feature obtained by upsampling the low-level feature is different from that of the high-level semantic information.
- the upsampler may perform nearest neighbor upsampling by a set factor (for example, 8 times nearest neighbor upsampling) on low-level features.
- the stacker 420 can stack the high-level semantic information and the target features obtained by the upsampler 410 to obtain stacked features.
- the stacked features may be input into a convolutional network 430 .
- the convolutional network 430 can perform convolution processing on the stacked features to obtain convolutional features.
- the adder 440 can be used to add the low-level features and the convolutional features output by the convolutional network 430 to obtain a predicted image.
- the predictor 222 provided in the embodiment of the present application can combine high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image has higher accuracy.
- FIG. 4C exemplarily shows a schematic structural diagram of the convolutional network 430 provided by the embodiment of the present application.
- the convolutional network 430 may include: a first set of convolutional layers 431 , a plurality of convolutional filter banks 432 , a second set of convolutional layers 433 , and an activation function 434 .
- the first group of convolutional layers 431 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the first group of convolutional layers 431 may be different.
- a convolution filter bank 432 may include multiple layers of convolution filters, and the filtering configuration of each layer of convolution filters in a convolution filter bank 432 may be the same.
- the second group of convolutional layers 433 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the second group of convolutional layers 433 may be different.
- the convolution configuration of the convolutional layer may include one or more of the number of output filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, whether to set a linear rectification unit, etc. item.
- this embodiment of the present application can also support the same convolution configuration of some convolution layers in the first group of convolution layers 431 and the second group of convolution layers 433 , or even the same convolution configuration of all convolution layers.
- the stacked features after stacking the target features and high-level semantic information can be input to the first set of convolutional layers 431, and the first set of convolutional layers 431 can perform convolution processing on the stacked features through multi-layer convolutional layers , to get the first convolution feature.
- the first set of convolutional layers 431 can output the first convolutional features to multiple convolutional filter banks 432 , and the multiple convolutional filter banks 432 can filter the first convolutional features to obtain filtered features.
- a plurality of convolutional filter banks 432 can output filtering features to the second set of convolutional layers 433, and the second set of convolutional layers 433 can perform convolution processing on the filtering features through multiple convolutional layers to obtain the second convolutional features .
- the second set of convolutional layers 433 can output the second convolutional features to the activation function 434 , and the activation function 434 outputs the corresponding convolutional features.
- the number of convolutional layers set in the first group of convolutional layers 431 and the second group of convolutional layers 433, and the convolution configuration of each convolutional layer can be determined according to actual conditions. No limit.
- the number of convolution filters set in one convolution filter bank 432 and the filtering configuration of each convolution filter may be determined according to actual conditions, which are not limited by this embodiment of the present application.
- the embodiment of the present application can define the number of convolutional layers in the first set of convolutional layers 431 and the second set of convolutional layers 433, and the output filtering of each convolutional layer The number of filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, and whether to set a linear rectification unit, so as to configure the first set of convolutional layers 431 and the second set of convolutional layers 433.
- FIG. 4D exemplarily shows another schematic structural diagram of the predictor 222 provided in the embodiment of the present application. As shown in FIG. 4C and FIG. 4D , the predictor 222 shown in FIG. 4D is specific to the convolutional network 430 structure.
- the first set of convolutional layers 431 may include 4 sequentially connected convolutional layers, namely: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU .
- Conv60 indicates that the number of output filters of the convolutional layer is 60, 7x7 indicates the size of the convolution kernel of the output filter, and s1 indicates that the convolution step of the output filter is 1 (correspondingly, s2 means that the convolution step size is 2, and so on),
- Norm means to set the normalization function (correspondingly, if there is no Norm in the convolution configuration, it means that the normalization function is not set in the convolution layer),
- ReLU means to set Linear rectification unit (correspondingly, if there is no ReLU in the convolution configuration, it means that the linear rectification unit is not set in the convolution layer).
- the first convolutional layer includes 60 output filters, a normalization function and a linear rectification unit, and the convolution kernel size of each output filter is 7x7 , with a convolution step size of 1.
- the specific configurations of other convolutional layers in the first group of convolutional layers 431 can be explained in the same way.
- the plurality of convolutional filter banks 432 may include nine sequentially connected convolutional filter banks.
- a convolution filter bank can include 2 layers of convolution filters, each layer of convolution filters can include 480 convolution filters, the convolution kernel size of each convolution filter is 3x3, and the convolution step size is 1 .
- the second group of convolutional layers 433 may include 4 sequentially connected convolutional layers, namely: ConvT240 3x3 s2 Norm ReLU, ConvT120 3x3 s2 Norm ReLU, ConvT60 3x3 s2 Norm ReLU, Conv3 7x7 s1.
- ConvT indicates deconvolution
- ConvT240 indicates that the number of output filters of the convolutional layer performing deconvolution is 240.
- the last convolutional layer Multilayer does not set normalization function and linear rectification unit.
- the configuration meaning of each convolutional layer in the second group of convolutional layers 433 can be explained similarly with reference to the previous description, and will not be expanded here.
- the specific structure of the convolutional network 430 shown in FIG. 4D is only an optional structure.
- the structure of the convolutional network 430 shown in FIG. 4D can also be deformed, adjusted or replaced, as long as
- the convolutional network 430 provided in the embodiment of the present application can perform convolution processing on the stacked features of target features and high-level semantic information, and the obtained convolutional features can express richer image information.
- the feature extractor 211 of the image coding system 111 may use multiple convolutional layers and activation functions to extract low-level features of the original image.
- the multi-layer convolutional layer in the feature extractor 22 can extract features from the original image, and then output as low-level features by the activation function.
- FIG. 5A shows a schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application.
- the feature extractor 211 may include sequentially connected multi-layer convolutional layers and activation functions. The convolutional configurations of multiple convolutional layers can be different or partially the same.
- the multi-layer convolutional layer can extract features from the original image, and the features extracted by the multi-layer convolutional layer can be output as low-level features through an activation function.
- FIG. 5B shows another schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application.
- the multi-layer convolutional layer may include 5 layers of convolutional layers, which are: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU, Conv3 3x3 s1 Norm ReLU.
- the specific number of multi-layer convolutional layers and the specific configuration of each convolutional layer shown in FIG. 5B are only an optional example, which is not limited by this embodiment of the present application.
- the low-level features can compactly express the detailed features of the original image, for example, the low-level features can represent the compact features of the original image with a set scale.
- the low-level features may represent compact features with a size of 1/64 of the original image (eg, the low-level features may be represented as 1/8 of the original image in both width and height dimensions).
- the embodiment of the present application can be used in the first convolutional layer and the last layer of the feature extractor.
- Use the ReflectionPad (mirror padding) of the first set size before the activation function for example, use a ReflectionPad with a size of 3; and use the second set size before the remaining network layers of the feature extractor (such as the remaining convolutional layers) ReflectionPad, for example, a ReflectionPad whose length and width are 1.
- this embodiment of the present application may use ChannelNorm technology to obtain compact low-level features that express fine details of the image.
- the embodiment of the present application can determine the low-level features of the original image corresponding to each channel based on the channel normalization technology, so as to determine the low-level features of the original image according to the low-level features of the original image corresponding to each channel.
- the feature extractor 211 can use multi-layer convolutional layers based on channel normalization technology to determine the low-level features of the original image corresponding to each channel; furthermore, the low-level features of the original image corresponding to each channel can be used in the The activation functions are combined to obtain the low-level features of the original image.
- the embodiment of the present application can determine the two-dimensional position of the current channel of the original image The corresponding unit pixel at , the mean value corresponding to all channels of the original image at the two-dimensional position, and the mean square error corresponding to all the channels at the two-dimensional position; thus, the embodiment of the present application can be based on the corresponding The unit pixel, the mean value and mean square error corresponding to all channels of the original image at the two-dimensional position, and the set offset of the current channel determine the low-level features of the original image corresponding to the current channel.
- the embodiment of the present application may use the following formula to calculate the corresponding low-level features of the original image in the c-th channel based on the channel normalization technology Low-level features.
- f chw indicates the low-level feature corresponding to the cth channel of the original image
- c indicates the cth channel of the original image
- c belongs to 1 to M
- f chw indicates the unit corresponding to the cth channel of the original image at the h and w positions pixel
- ⁇ hw represents the mean value of the M channels at the h and w positions of the original image
- ⁇ c and ⁇ c represent the learned offset of the c-th channel.
- the embodiment of the present application uses the channel normalization technology to determine the low-level features of the original image corresponding to each channel, and then combine the low-level features of the original image corresponding to the current channel to obtain the low-level features of the original image, which can ensure that the original image is rich in low-level features.
- the size of the low-level features can be significantly reduced; further, the embodiments of the present application can reduce the encoding and decoding overhead of the low-level features, and the transmission overhead of the second code stream corresponding to the low-level features between the sending end and the receiving end.
- the image coding system 111 may extract corresponding high-level semantic information based on the type of the image analysis task.
- the image coding system 111 may set a semantic extractor corresponding to the type of the image analysis task, so as to extract high-level semantic information corresponding to the type of the image analysis task from the original image.
- one high-level semantic information can support at least one type of image analysis tasks, that is, one high-level semantic information can support one type or multiple types of image analysis tasks at the same time.
- high-level semantic information can include any one of instance segmentation map information supporting instance segmentation tasks, image classification tasks, and object detection tasks, and stick figure information supporting human body pose recognition tasks.
- FIG. 6 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
- the semantic extractor 210 set in the image coding system 111 may be an instance segmenter 610 , and the instance segmenter 610 may extract instance segmentation map information from an original image based on an instance segmentation technique.
- the instance segmentation map information may be used as high-level semantic information of the original image.
- the instance segmentation map information can be imported into the predictor 222 and the FLIF encoder 620 , and the FLIF encoder 620 can be the same lossless encoder used by the first encoder 212 and the second encoder 213 .
- the FLIF encoder 620 can perform lossless encoding on the instance segmentation map information to generate a first code stream.
- the first code stream can be transmitted to the image decoding system 121 .
- the feature extractor 211 extracts low-level features of the original image, and the low-level features can be imported into the predictor 222 and the FLIF encoder 620 .
- the FLIF encoder 620 can perform lossless encoding on low-level features to generate a second code stream.
- the second code stream can be transmitted to the image decoding system 121 .
- the predictor 222 can perform image reconstruction according to instance segmentation map information and low-level features to obtain a predicted image.
- the subtracter 630 may determine residual information of the predicted image and the original image.
- the residual information can be imported into the VVC encoder 640 .
- the VVC encoder 640 can perform lossy encoding on the residual information to generate a third code stream.
- the third code stream can be transmitted to the image decoding system 121 .
- the subtractor 630 may be an optional form of the comparator 214
- the VVC encoder 640 may be a lossy encoder used by the third encoder 215 .
- the FLIF decoder 650 can decode the first code stream and the second code stream to obtain instance segmentation map information and low-level features.
- the FLIF decoder 650 may be the same lossless decoder used by the first decoder 220 and the second decoder 221 .
- the instance segmentation map information obtained by the FLIF decoder 650 can be imported into instance segmentation logic, image classification logic, and object detection logic to implement image analysis tasks such as instance segmentation, image classification, and object detection of the original image.
- the instance segmentation map information and low-level features obtained by the FLIF decoder 650 can be imported into the predictor 222 in the image decoding system 121 .
- the predictor 222 may perform image reconstruction on the original image according to the instance segmentation map information and low-level features to obtain a predicted image.
- the predicted image obtained by the predictor 222 can be imported into the image enhancer 224 .
- the VVC decoder 660 can decode the third code stream to obtain residual information of the predicted image and the original image.
- the VVC decoder 660 may be a lossy decoder used by the third decoder 223 .
- the residual information may be introduced into the image enhancer 224 .
- the image enhancer 224 may perform image enhancement processing on the predicted image according to the residual information to obtain an enhanced image.
- the enhanced image can be displayed to the user to meet the user's vision-oriented image reconstruction requirements.
- FIG. 6 uses the instance segmentation map information as high-level semantic information, and specifically illustrates the implementation process of the image encoding system and the image decoding system provided by the embodiment of the present application in the image reconstruction task and the instance segmentation task. It can be understood that, on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C , the embodiment of the present application can also use instance segmentation map information as high-level semantic information.
- the high-level semantic information of the original image can also be the stick figure information of the original image, which is used to support the image analysis task of human body gesture recognition; in this case, the embodiment of the present application can use the The instance segmenter 610 in the architecture is replaced by a semantic extractor that supports the extraction of stickman graph information, and the instance segmentation graph information is replaced by stickman graph information.
- stick figure information can also be used as high-level semantic information on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C .
- the image coding system 111 may set multiple semantic extractors 210 to extract multiple high-level semantic information from the original image, so as to support different types of image analysis tasks.
- FIG. 7 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application. As shown in FIG. 6 and FIG. 7 , the image encoding system 111 shown in FIG. 7 may be provided with multiple semantic extractors (for example, n semantic extractors 2101 to 210 n shown in FIG. 7 ).
- Multiple semantic extractors are used to extract high-level semantic information from the original image one by one to obtain multiple high-level semantic information (for example, high-level semantic information 1 to n).
- the semantic extractor 2101 can extract the high-level semantic information 1 of the original image
- the semantic extractor 210n can extract the high-level semantic information n of the original image.
- a high-level semantic information extracted by a semantic extractor can be used to support at least one type of image analysis task of the original image, for example, any high-level semantic information in the high-level semantic information 1 to n can support one or more types of image analysis of the original image Task.
- the plurality of high-level semantic information may include instance segmentation map information and stickman map information, wherein the instance segmentation map information may support one or more of instance segmentation, image classification, and object detection of the original image. Similar to image analysis tasks, the stick figure information can support the human body pose recognition task of the original image.
- the FLIF encoder 620 may perform lossless encoding on the multiple high-level semantic information to obtain multiple first code streams. For example, after the semantic extractors 2101 to 210n extract high-level semantic information from the original image one by one to obtain the high-level semantic information 1 to n, the FLIF encoder 620 can perform lossless encoding on the high-level semantic information 1 to n respectively to generate the first code stream 1 to n.
- one high-level semantic information among multiple high-level semantic information (for example, one high-level semantic information among instance segmentation map information, stickman map information, etc.) can be imported into the predictor 222 (that is, high-level semantic information 1
- One of the high-level semantic information in to n can be imported into the predictor 222
- the low-level features extracted by the feature extractor 211 of the original image can be imported into the predictor 222. Therefore, the predictor 222 can perform image reconstruction according to the low-level features and one high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image.
- the predictor 222 can perform image reconstruction according to the low-level features and one high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image.
- the predictor 222 may also perform image reconstruction according to low-level features and at least two high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image.
- the predictor 222 may further introduce at least one high-level semantic information among multiple high-level semantic information during image reconstruction according to low-level features and instance segmentation map information, so as to make the reconstructed predicted image more accurate.
- the embodiment of the present application can import the high-level semantic information or at least two high-level semantic information with the highest semantic level among multiple high-level semantic information into the predictor 222; thus, the predictor 222 can In the semantic information, one or more high-level semantic information with the highest semantic level is used for image reconstruction to obtain a more accurate prediction image.
- the embodiment of the present application can also combine any at least two high-level semantic information among multiple high-level semantic information with low-level features to be used for image reconstruction of the original image, and is not limited to be used for image Semantic hierarchy of reconstructed high-level semantic information.
- a plurality of first code streams generated by encoding multiple high-level semantic information 1 to n, a second code stream generated by encoding low-level features, and residual information of predicted images and original images are generated by encoding
- the third code stream can be transmitted to the image decoding system 121 .
- the FLIF decoder 650 can decode any first code stream and the second code stream among multiple first code streams, so as to obtain a high-level semantic information and low-level features , the high-level semantic information and low-level features can be imported into the predictor 222; the predictor 222 can perform image reconstruction according to the high-level semantic information and low-level features to obtain a predicted image; the image enhancer 224 can decode the third The residual information obtained by the code stream is used to perform image enhancement on the predicted image to obtain an enhanced image and complete the image reconstruction task.
- the FLIF decoder 650 can select a corresponding first code stream from multiple first code streams for decoding based on the current image analysis task, so as to obtain high-level semantic information suitable for the current image analysis task to support the current Execution of image analysis tasks.
- the receiving end 120 can obtain the task instruction indicated by the user, and the task instruction can indicate the current image analysis task that the receiving end needs to perform, so that the image decoding system 121 can select from multiple first code streams that are compatible with the current
- the first code stream corresponding to the high-level semantic information used by the image analysis task is decoded to obtain the high-level semantic information suitable for the current image analysis task.
- the task instruction may indicate multiple current image analysis tasks, so that the image decoding system 121 may select a first code stream corresponding to each current image analysis task from multiple first code streams for decoding, In order to obtain high-level semantic information suitable for each current image analysis task.
- the embodiment of the present application can decode the first code stream corresponding to the instance segmentation map information; if the current image analysis task When the task is human body posture recognition, the embodiment of the present application can decode the first code stream corresponding to the stick figure information.
- the current image analysis tasks can be of one type or more types, and the details can be determined by user requirements or system settings.
- the FLIF decoder 650 can select at least two first code streams from multiple first code streams to decode, so as to obtain high-level semantic information applicable to multiple types of current image analysis tasks .
- the FLIF decoder 650 can also decode multiple first code streams, so as to simultaneously support multiple types of image analysis tasks.
- the image coding system and the image decoding system provided by the embodiments of the present application can extract multiple high-level semantic information from the original image through multiple semantic extractors, so as to realize image analysis tasks supporting multiple types of original images, and multiple high-level semantic information Any high-level semantic information in can be combined with the low-level features of the original image to achieve image reconstruction tasks supporting the original image. It can be seen that in the embodiment of the present application, a set of image encoding system and corresponding image decoding system can be used to realize fusion image reconstruction tasks and multiple types of image analysis tasks.
- FIG. 8 exemplarily shows a flow chart of the image analysis method provided by the embodiment of the present application.
- the method flow can be implemented by the receiver. As shown in FIG. 8 , the method flow may include the following steps.
- step S80 based on the image analysis task of the original image, a target code stream is obtained from multiple code streams.
- the sending end can transmit multiple code streams to the receiving end.
- the multiple code streams may include a first code stream corresponding to high-level semantic information of the original image, and a second code stream corresponding to low-level features of the original image.
- the plurality of code streams may further include a third code stream corresponding to difference information between the predicted image and the original image.
- the number of first code streams in the plurality of code streams may be at least one (that is, one or more first code streams), and the at least one first code stream may correspond to at least one high-level semantic information of the original image, and One high-level semantic information corresponds to one first code stream.
- the receiving end can obtain a target code stream suitable for the image analysis task from the multiple code streams based on the current image analysis task to be performed on the original image. Since in the embodiment of the present application, the high-level semantic information carried by the first code stream is used for the image analysis task of the original image, the embodiment of the present application can specifically obtain the target code from at least one first code stream of the multiple code streams flow.
- the target code stream may be the first code stream whose high-level semantic information is applicable to the image analysis task in the at least one first code stream.
- the embodiment of the present application may obtain the information applicable to The target stream of the image analysis task.
- the at least two first code streams include the first code stream corresponding to the instance segmentation graph information and the first code stream corresponding to the stickman graph information.
- the image analysis tasks currently to be performed are instance segmentation, image classification, and object detection Any one of them, the embodiment of the present application can obtain the first code stream corresponding to the instance segmentation map information from at least two first code streams as the target code stream; if the image analysis task currently to be performed is human body posture identification, the embodiment of the present application may obtain the first code stream corresponding to the stick figure information from at least two first code streams as the target code stream.
- the image analysis task currently to be performed may be a preset fixed image analysis task, such as instance segmentation, image classification, object detection, Any task in human gesture recognition is set as an image analysis task fixedly performed by the receiving end.
- only one first code stream among the multiple code streams can be used as the target code stream.
- step S81 the target code stream is decoded to obtain high-level semantic information suitable for image analysis tasks.
- the embodiments of the present application may perform lossless decoding on the target code stream to obtain high-level semantic information suitable for image analysis tasks.
- step S82 an image analysis task is performed according to the decoded high-level semantic information.
- the receiver can perform image analysis tasks on the original image through specific image analysis logic (such as image analysis logic that performs tasks such as image classification, object detection, and instance segmentation).
- image analysis logic such as image analysis logic that performs tasks such as image classification, object detection, and instance segmentation.
- steps S80 to S81 can be implemented by an image decoding system at the receiving end, and step S82 can be implemented by an image analysis logic configured at the receiving end to perform an image analysis task.
- the image analysis method provided by the embodiment of the present application can perform specific image analysis tasks under the encoding and decoding scheme and framework of the fusion image reconstruction task and image analysis task, thereby providing technical support for the image analysis task at the receiving end, and can be used in different
- the implementation is applicable under the type of image analysis tasks.
- the image encoding and decoding scheme provided by the embodiment of the present application has higher image reconstruction quality and better image analysis quality.
- VCM Video Coding for Machin, machine video coding
- the image encoding and decoding scheme provided by the embodiment of the application is compared with the VTM scheme, and it can be obtained as shown in Fig. 9A, Fig. 9B, Fig. 9C and Fig. 9D The effect comparison diagram shown.
- the comparison process between the image encoding and decoding scheme provided by the embodiment of the present application and the VTM scheme is as follows.
- Image reconstruction quality comparison using the scheme provided by the embodiment of the application and the VTM scheme to perform image compression and reconstruction respectively, and obtain SSIM (Structural Similarity) and corresponding BPP (Bits Per Pixel, pixel depth), PSNR (Peak Signal to Noise Ratio, peak signal-to-noise ratio) and the corresponding BPP.
- SSIM Structuretural Similarity
- corresponding BPP Bit Pixel, pixel depth
- PSNR Peak Signal to Noise Ratio, peak signal-to-noise ratio
- FIG. 9A A schematic diagram of the effect comparison between the solution provided by the embodiment of the present application and the VTM solution is drawn using PSNR and the corresponding BPP, as shown in FIG. 9B .
- the performance comparison of the image analysis task of machine vision, taking the object detection task as the image analysis task, and the mAP (a common indicator of the object detection task) as the performance index of the image analysis task as an example, will use the embodiment of this application to perform object detection
- the performance index of the task is compared with the performance index of the object detection task using the image compressed by VTM to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, As shown in Figure 9C.
- the performance index of the instance segmentation task using the embodiment of the application is compared with the performance index of the instance segmentation task using the image compressed by VTM , to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, as shown in FIG. 9D .
- the image encoding and decoding scheme provided by the embodiment of the present application has better performance than the current VVC VTM scheme with better performance.
- High image reconstruction quality for example, the scheme provided by the embodiment of the present application is 4dB higher than VTM in PSNR
- the image encoding and decoding scheme provided by the embodiment of the present application is compared with VTM
- It has better image analysis quality for example, in terms of mAP, the solution provided by the embodiment of the present application completely surpasses VTM).
- the image encoding and decoding scheme provided in the embodiment of the present application proposes a reasonable and efficient encoding and decoding framework, which can effectively deal with the fusion problem of VCM in user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.
- the image encoding and decoding solutions provided by the embodiments of the present application can effectively integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware. It is understandable that with the rise of smart cities and deep learning, smart hardware generates massive images every day to better meet various information interaction needs; the images generated by smart hardware can be presented to users for viewing, so that images, For purposes such as video surveillance, machine vision analysis systems can also be relied on for corresponding analysis and decision-making tasks (for example, license plate recognition in intelligent transportation systems, road planning; object detection in automatic driving systems, road tracking; human intelligence in smart medical systems Face and expression detection and analysis, abnormal behavior detection, etc.); therefore, a set of encoding and decoding schemes and frameworks are needed to integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware.
- analysis and decision-making tasks for example, license plate recognition in intelligent transportation systems, road planning; object detection in automatic driving systems, road tracking; human intelligence in smart medical systems Face and expression detection and analysis, abnormal behavior detection, etc.
- Fig. 10 shows an application example diagram provided by the embodiment of the present application.
- the camera 910 is an intelligent hardware capable of collecting video, and can collect traffic video images.
- the traffic video images collected by the camera 910 can be encoded by the image encoding system 111 provided in the embodiment of the present application, so as to output the first code stream, the second code stream and the third code stream to the traffic command center 920 .
- the traffic command center 920 can use the image decoding system 121 provided by the embodiment of the present application to perform decoding processing, so as to reconstruct the traffic video image and output the high-level semantic information of the traffic video image.
- the traffic video image reconstructed by the image decoding system 121 can be displayed on the monitoring screen of the traffic command center 920 to realize traffic video monitoring.
- the high-level semantic information of the traffic video image output by the image decoding system 121 can be imported into the license plate recognition system of the traffic command center 920 to realize the license plate recognition of vehicles in the traffic video image.
- FIG. 10 is only an optional application example of the image encoding and decoding scheme provided by the embodiment of the present application, and the embodiment of the present application can be applied in any scene requiring image reconstruction tasks and image analysis tasks.
- the embodiment of the present application is compatible with image coding for user vision and machine vision tasks.
- the encoding and decoding scheme provided by the embodiment of the present application can be widely used in the related processing of image data in the smart city system, thereby effectively improving the compression of image data efficiency, reduce the burden of network bandwidth, reduce the workload of cloud services, reduce the storage consumption of image data, etc., thereby reducing the operating cost of smart cities.
- the embodiment of the present application also provides an electronic device, such as the sending end 110 or the receiving end 120 .
- Fig. 11 shows a block diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 11 , the electronic device may include: at least one processor 1 , at least one communication interface 2 , at least one memory 3 and at least one communication bus 4 .
- processor 1 there are at least one processor 1 , communication interface 2 , memory 3 , and communication bus 4 , and the processor 1 , communication interface 2 , and memory 3 communicate with each other through the communication bus 4 .
- the communication interface 2 may be an interface of a communication module for network communication.
- processor 1 may be CPU (central processing unit), GPU (Graphics Processing Unit, graphics processing unit), NPU (embedded neural network processor), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array ), TPU (tensor processing unit), AI chip, specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application, etc.
- the memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
- the memory 3 stores one or more computer-executable instructions
- the processor 1 invokes one or more computer-executable instructions to execute the image encoding method provided by the embodiment of the present application, or execute the image coding method provided by the embodiment of the present application.
- the embodiment of the present application also provides a storage medium, which can store one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the image coding method provided by the embodiment of the present application is implemented, or The image decoding method provided in the embodiment of the present application, or the image reconstruction method provided in the embodiment of the present application, or the image analysis method provided in the embodiment of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Provided in the embodiments of the present application are image encoding, decoding, reconstruction, and analysis methods, a system, and an electronic device, the image encoding method comprising: acquiring an original image; extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; extracting a low-level feature of the original image, the low-level feature and the high-level semantic information being used for an image reconstruction task of the original image; encoding the high-level semantic information to generate a first code stream; and encoding the low-level feature to generate a second code stream. In the embodiments of the present application, an encoding scheme is used for fusing user-vision-oriented image reconstruction tasks and machine-vision-oriented image analysis tasks.
Description
本申请要求2021年07月28日递交的申请号为202110860294.5、发明名称为“图像编码、解码、重建、分析方法、系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110860294.5 and the title of the invention "image coding, decoding, reconstruction, analysis method, system and electronic equipment" submitted on July 28, 2021, the entire content of which is incorporated by reference in In this application.
本申请实施例涉及图像技术领域,具体涉及一种图像编码、解码、重建、分析方法、系统及电子设备。The embodiments of the present application relate to the field of image technology, and specifically relate to an image encoding, decoding, reconstruction, and analysis method, system, and electronic equipment.
图像作为一种具有视觉效果的数据,可以分为静态图像和动态视频的图像帧。图像在产生之后(例如图像在采集或者制作之后),为便于图像的数字化传输,可以利用图像编码技术对图像进行编码。图像编码也称图像压缩,是指在满足一定图像质量的条件下,以较少的比特表示图像所包含信息的技术。图像编码并传输给接收端之后,接收端可以进行图像解码,以实现图像重建。基于重建的图像,用户能够在接收端观看到图像,满足用户的图像观看需求;其典型应用场景例如图片展示、视频播放等。As a kind of data with visual effects, images can be divided into static images and image frames of dynamic videos. After the image is generated (for example, after the image is collected or produced), in order to facilitate the digital transmission of the image, the image may be encoded using image coding technology. Image coding, also known as image compression, refers to the technology of representing the information contained in the image with fewer bits under the condition of satisfying a certain image quality. After the image is encoded and transmitted to the receiving end, the receiving end can decode the image to realize image reconstruction. Based on the reconstructed image, the user can watch the image at the receiving end to meet the user's image viewing needs; its typical application scenarios are image display, video playback, etc.
随着物联网、智能城市、智慧办公等技术和需求的兴起,图像除满足传统的用户观看需求外,还需满足机器视觉(也称为计算机视觉)的图像分析需求。然而目前并没有一套图像编码方案,能够兼容面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务,因此如何提供新型的图像编码方案,以融合图像重建任务以及图像分析任务,成为了本领域技术人员亟需解决的技术问题。With the rise of technologies and requirements such as the Internet of Things, smart cities, and smart offices, images need to meet the image analysis needs of machine vision (also known as computer vision) in addition to traditional user viewing needs. However, there is currently no image coding scheme that is compatible with user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks. Therefore, how to provide a new type of image coding scheme to integrate image reconstruction tasks and image analysis tasks has become an important issue in this field. Technical problems urgently needed to be solved by those skilled in the art.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种图像编码、解码、重建、分析方法、系统及电子设备,以融合面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务。In view of this, the embodiments of the present application provide an image encoding, decoding, reconstruction, analysis method, system, and electronic device to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
为实现上述目的,本申请实施例提供如下技术方案。To achieve the above purpose, the embodiments of the present application provide the following technical solutions.
第一方面,本申请实施例提供一种图像编码方法,包括:In the first aspect, the embodiment of the present application provides an image coding method, including:
获取原始图像;get the original image;
提取所述原始图像的高层语义信息,所述高层语义信息用于所述原始图像的图像分析任务;以及,extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; and,
提取所述原始图像的低层特征,所述低层特征与所述高层语义信息用于所述原始图像的图像重建任务;extracting low-level features of the original image, the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;
对所述高层语义信息进行编码,以生成第一码流;以及,Encoding the high-level semantic information to generate a first code stream; and,
对所述低层特征进行编码,以生成第二码流。Encoding the low-level features to generate a second code stream.
第二方面,本申请实施例提供一种图像编码系统,包括:In a second aspect, an embodiment of the present application provides an image coding system, including:
语义提取器,用于提取原始图像的高层语义信息,所述高层语义信息用于原始图像的图像分析任务;A semantic extractor is used to extract high-level semantic information of the original image, and the high-level semantic information is used for image analysis tasks of the original image;
特征提取器,用于提取所述原始图像的低层特征,所述低层特征与所述高层语义信息用于所述原始图像的图像重建任务;a feature extractor, configured to extract low-level features of the original image, and the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;
第一编码器,用于对所述高层语义信息进行编码,以生成第一码流;a first encoder, configured to encode the high-level semantic information to generate a first code stream;
第二编码器,用于对所述低层特征进行编码,以生成第二码流。The second encoder is configured to encode the low-level features to generate a second code stream.
第三方面,本申请实施例提供一种图像解码方法,包括:In a third aspect, the embodiment of the present application provides an image decoding method, including:
获取原始图像的高层语义信息对应的第一码流,以及原始图像的低层特征对应的第二码流;Obtain a first code stream corresponding to the high-level semantic information of the original image, and a second code stream corresponding to the low-level features of the original image;
对所述第一码流进行解码,以得到所述高层语义信息,所述高层语义信息用于执行所述原始图像的图像分析任务;Decoding the first code stream to obtain the high-level semantic information, the high-level semantic information is used to perform an image analysis task of the original image;
对所述第二码流进行解码,以得到所述低层特征;Decoding the second code stream to obtain the low-level features;
根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像。Perform image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
第四方面,本申请实施例提供一种图像解码系统,包括:In a fourth aspect, the embodiment of the present application provides an image decoding system, including:
第一解码器,用于对原始图像的高层语义信息对应的第一码流进行解码,以得到所述高层语义信息;所述高层语义信息用于执行所述原始图像的图像分析任务;The first decoder is configured to decode the first code stream corresponding to the high-level semantic information of the original image to obtain the high-level semantic information; the high-level semantic information is used to perform an image analysis task of the original image;
第二解码器,对原始图像的低层特征对应的第二码流进行解码,以得到所述低层特征;The second decoder is configured to decode the second code stream corresponding to the low-level features of the original image to obtain the low-level features;
预测器,根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像。The predictor performs image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
第五方面,本申请实施例提供一种图像重建方法,包括:In the fifth aspect, the embodiment of the present application provides an image reconstruction method, including:
获取原始图像的高层语义信息以及低层特征;Obtain high-level semantic information and low-level features of the original image;
根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像;以及,performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image; and,
获取所述预测图像与所述原始图像的差异信息,所述差异信息用于增强所述预测图像。Acquiring difference information between the predicted image and the original image, where the difference information is used to enhance the predicted image.
第六方面,本申请实施例提供一种图像分析方法,包括:In a sixth aspect, the embodiment of the present application provides an image analysis method, including:
基于原始图像的图像分析任务,从多个码流中获取目标码流;所述多个码流至少包括与原始图像的至少一个高层语义信息对应的至少一个第一码流,以及与原始图像的低层特征对应的第二码流;其中,原始图像的一个高层语义信息对应一个第一码流,所述目标码流为所述至少一个第一码流中高层语义信息适用于所述图像分析任务的第一码流;Based on the image analysis task of the original image, the target code stream is obtained from multiple code streams; the multiple code streams at least include at least one first code stream corresponding to at least one high-level semantic information of the original image, and the code stream corresponding to the original image The second code stream corresponding to the low-level feature; wherein, a high-level semantic information of the original image corresponds to a first code stream, and the target code stream is the high-level semantic information in the at least one first code stream suitable for the image analysis task the first code stream of
对所述目标码流进行解码,以得到适用于所述图像分析任务的高层语义信息;Decoding the target code stream to obtain high-level semantic information applicable to the image analysis task;
根据解码得到的高层语义信息,执行所述图像分析任务。The image analysis task is performed according to the decoded high-level semantic information.
第七方面,本申请实施例提供一种电子设备,包括至少一个存储器和至少一个处理 器,所述存储器存储一条或多条计算机可执行指令,所述处理器调用所述一条或多条计算机可执行指令,以执行如上述第一方面所述的图像编码方法,或者,如上述第三方面所述的图像解码方法,或者,如上述第五方面所述的图像重建方法,或者,如上述第六方面所述的图像分析方法。In a seventh aspect, an embodiment of the present application provides an electronic device, including at least one memory and at least one processor, the memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions. Executing instructions to execute the image coding method described in the first aspect above, or the image decoding method described in the third aspect above, or the image reconstruction method described in the fifth aspect above, or the image reconstruction method described in the above first aspect The image analysis method described in the six aspects.
第八方面,本申请实施例提供一种存储介质,所述存储介质存储一条或多条计算机可执行指令,所述一条或多条计算机可执行指令被执行时实现如上述第一方面所述的图像编码方法,或者,如上述第三方面所述的图像解码方法,或者,如上述第五方面所述的图像重建方法,或者,如上述第六方面所述的图像分析方法。In the eighth aspect, the embodiment of the present application provides a storage medium, the storage medium stores one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the above-mentioned first aspect is implemented. The image encoding method, or the image decoding method described in the third aspect above, or the image reconstruction method described in the fifth aspect above, or the image analysis method described in the sixth aspect above.
本申请实施例提供的图像编码方法可分别对原始图像提取高层语义信息和低层特征。由于高层语义信息能够表达原始图像概念层面的语义,因此高层语义信息可用于原始图像的图像分析任务。而高层语义信息和低层特征可相结合,以用于原始图像的图像重建任务。进而,本申请实施例通过对高层语义信息和低层特征分别进行编码,以生成第一码流和第二码流,使得高层语义信息和低层特征可通过码流形式传输给接收端。本申请实施例实现了使用一套编码方案,融合面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务。The image coding method provided in the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams. The embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.
图1为本申请实施例提供的图像传输系统的框图。FIG. 1 is a block diagram of an image transmission system provided by an embodiment of the present application.
图2A为本申请实施例提供的图像编、解方法的流程图。Fig. 2A is a flow chart of the image encoding and decoding method provided by the embodiment of the present application.
图2B为本申请实施例提供的图像编码系统的结构示意图。FIG. 2B is a schematic structural diagram of an image coding system provided by an embodiment of the present application.
图2C为本申请实施例提供的图像编、解码系统的结构示意图。FIG. 2C is a schematic structural diagram of an image encoding and decoding system provided by an embodiment of the present application.
图3A为本申请实施例提供的图像编、解码方法的另一流程图。FIG. 3A is another flow chart of the image encoding and decoding method provided by the embodiment of the present application.
图3B为本申请实施例提供的图像编码系统的另一结构示意图。FIG. 3B is another schematic structural diagram of an image coding system provided by an embodiment of the present application.
图3C为本申请实施例提供的图像编、解码系统的另一结构示意图。FIG. 3C is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
图4A为本申请实施例得到预测图像的方法流程图。FIG. 4A is a flowchart of a method for obtaining a predicted image according to an embodiment of the present application.
图4B为本申请实施例提供的预测器的结构示意图。FIG. 4B is a schematic structural diagram of the predictor provided by the embodiment of the present application.
图4C为本申请实施例提供的卷积网络的结构示意图。FIG. 4C is a schematic structural diagram of a convolutional network provided by an embodiment of the present application.
图4D为本申请实施例提供的预测器的另一结构示意图。FIG. 4D is another structural schematic diagram of the predictor provided by the embodiment of the present application.
图5A为本申请实施例提供的特征提取器的结构示意图。FIG. 5A is a schematic structural diagram of a feature extractor provided in an embodiment of the present application.
图5B为本申请实施例提供的特征提取器的另一结构示意图。FIG. 5B is another schematic structural diagram of the feature extractor provided by the embodiment of the present application.
图6为本申请实施例提供的图像编、解码系统的再一结构示意图。FIG. 6 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
图7为本申请实施例提供的图像编、解码系统的又一结构示意图。FIG. 7 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.
图8为本申请实施例提供的图像分析方法的流程图。FIG. 8 is a flow chart of the image analysis method provided by the embodiment of the present application.
图9A为本申请实施例提供的效果对比示意图。FIG. 9A is a schematic diagram of effect comparison provided by the embodiment of the present application.
图9B为本申请实施例提供的另一效果对比示意图。FIG. 9B is a schematic diagram of another effect comparison provided by the embodiment of the present application.
图9C为本申请实施例提供的再一效果对比示意图。FIG. 9C is a schematic diagram of yet another effect comparison provided by the embodiment of the present application.
图9D为本申请实施例提供的又一效果对比示意图。FIG. 9D is a schematic diagram of another effect comparison provided by the embodiment of the present application.
图10为本申请实施例提供的应用示例图。FIG. 10 is a diagram of an application example provided by the embodiment of the present application.
图11为本申请实施例提供的电子设备的框图。FIG. 11 is a block diagram of an electronic device provided by an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
图1示例性的示出了本申请实施例提供的图像传输系统100的框图。如图1所示,该图像传输系统100可以包括:图像的发送端110和接收端120。其中,发送端110可以设置图像编码系统111,以利用本申请实施例提供的图像编码方案,实现图像编码;接收端120可以设置图像解码系统121,以利用本申请实施例提供的图像解码方案,实现图像解码。FIG. 1 exemplarily shows a block diagram of an image transmission system 100 provided by an embodiment of the present application. As shown in FIG. 1 , the image transmission system 100 may include: an image sending end 110 and an image receiving end 120 . Wherein, the sending end 110 can set the image coding system 111 to realize image coding by using the image coding scheme provided by the embodiment of the present application; the receiving end 120 can set the image decoding system 121 to use the image decoding scheme provided by the embodiment of the present application, Implement image decoding.
图1所示的图像传输系统100可以适用于任意设备之间的图像传输,包括但不限于:终端之间的图像传输、终端与服务器之间的图像传输、服务器与服务器之间的图像传输,其中,终端可以包括智能硬件(例如智能摄像头等带有图像采集能力的智能硬件)、手机、电脑等用户设备。在本申请实施例中,设备并不固定成为发送端和接收端,而是视图像的发送和接收角度适应调整,例如,某一设备在发送图像时成为发送端,而在接收图像时又会成为接收端。在一些实施例中,终端和服务器中的一方发送图像时成为发送端,以利用本申请实施例提供的图像编码方案,实现图像编码;终端和服务器中的一方接收图像时成为接收端,以利用本申请实施例提供的图像解码方案,实现图像解码。The image transmission system 100 shown in FIG. 1 can be applied to image transmission between any devices, including but not limited to: image transmission between terminals, image transmission between terminals and servers, and image transmission between servers, Wherein, the terminal may include smart hardware (for example, smart hardware with image acquisition capabilities such as a smart camera), user equipment such as a mobile phone, and a computer. In the embodiment of this application, the device is not fixed to be the sending end and the receiving end, but is adjusted according to the sending and receiving angles of the image. For example, a certain device becomes the sending end when sending an image, but it will become the receiver. In some embodiments, one of the terminal and the server becomes the sending end when sending an image, so as to use the image coding scheme provided by the embodiment of the present application to realize image coding; when one of the terminal and the server receives the image, it becomes the receiving end to use The image decoding solution provided in the embodiment of the present application realizes image decoding.
在本申请实施例中,发送端110在对图像进行编码时,需支持接收端120面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务。面向用户视觉的图像重建任务可以理解为是在接收端120实现图像重建,以使得用户能够在接收端120观看到图像。面向机器视觉的图像分析任务可以理解为接收端从计算机角度对图像进行分析处理;例如,接收端实现图像中的图像分类、物体检测、实例分割等。在一些实施例中,面向机器视觉的图像分析任务的应用场景包括但不限于:智能交通系统中的车牌识别、道路规划,自动驾驶系统中的物体检测、道路跟踪,智慧医疗系统中的人脸及表情检测分析,异常行为检测等。In the embodiment of the present application, when encoding the image, the sending end 110 needs to support the image reconstruction task oriented to user vision and the image analysis task oriented to machine vision of the receiving end 120 . The user vision-oriented image reconstruction task can be understood as implementing image reconstruction at the receiving end 120 so that the user can watch the image at the receiving end 120 . The image analysis task for machine vision can be understood as analyzing and processing the image from the perspective of the computer at the receiving end; for example, the receiving end implements image classification, object detection, instance segmentation, etc. in the image. In some embodiments, the application scenarios of machine vision-oriented image analysis tasks include, but are not limited to: license plate recognition and road planning in intelligent transportation systems, object detection and road tracking in automatic driving systems, and human face recognition in smart medical systems. And facial expression detection and analysis, abnormal behavior detection, etc.
在本申请实施例中,发送端110可以利用图像编码系统111实现图像编码,并支持面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务;接收端120可以利用图像解码系统121实现图像解码,并具体执行面向用户视觉的图像重建任务,以及为接收端120执行图像分析任务提供处理数据。在一些实施例中,图2A示例的示出了本申请实施例提供的图像编、解方法的流程图。该方法流程可由图像编码系统111执行图像编码过程,由图像解码系统121执行图像解码过程。如图2A所示,该方法流程可以包括如下步骤。In the embodiment of the present application, the sending end 110 can use the image coding system 111 to implement image encoding, and support user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks; the receiving end 120 can use the image decoding system 121 to implement image decoding , and specifically perform an image reconstruction task oriented to the user's vision, and provide processing data for the receiving end 120 to perform an image analysis task. In some embodiments, FIG. 2A exemplarily shows a flow chart of the image encoding and decoding method provided by the embodiment of the present application. In the process of the method, the image encoding system 111 may perform an image encoding process, and the image decoding system 121 may perform an image decoding process. As shown in FIG. 2A , the process flow of the method may include the following steps.
在步骤S20中,图像编码系统获取原始图像。In step S20, the image coding system acquires the original image.
原始图像可以视为是需要进行编码的图像,例如,发送端110使用摄像头采集的图像,或者以绘图、编程方式制作的图像。原始图像可输入图像编码系统111,由图像编码系统111对原始图像进行编码。The original image can be regarded as an image that needs to be encoded, for example, an image collected by the sending end 110 using a camera, or an image created by drawing or programming. The original image can be input into the image encoding system 111, and the original image is encoded by the image encoding system 111.
在步骤S21中,图像编码系统提取原始图像的高层语义信息。In step S21, the image coding system extracts high-level semantic information of the original image.
在步骤S22中,图像编码系统提取原始图像的低层特征。In step S22, the image coding system extracts low-level features of the original image.
图像的高层语义信息可以是图像表达的语义,例如,图像表达的对象(例如人物、植物、动物、制造物等物体)、对象的类型、对象之间的关系等。图像的低层特征可以理解为是图像的颜色、纹理和质地等细节特征信息。The high-level semantic information of the image may be the semantics expressed by the image, for example, the objects expressed by the image (such as people, plants, animals, manufactured objects, etc.), the types of objects, the relationship between objects, and so on. The low-level features of an image can be understood as detailed feature information such as the color, texture, and texture of the image.
在一些实施例中,图像的低层特征可以表达图像丰富的视觉细节信息,但概念层面的语义信息较少;例如,低层特征可以准确表达图像中对象的颜色,但是缺少对象的类型等概念层面的语义信息。作为一种可选实现,低层特征可以是图像的视觉层的特征,高层语义信息可以是图像的概念层的信息。In some embodiments, the low-level features of an image can express the rich visual detail information of the image, but the semantic information at the conceptual level is less; semantic information. As an optional implementation, the low-level features may be the features of the visual layer of the image, and the high-level semantic information may be the information of the conceptual layer of the image.
在本申请实施例中,高层语义信息能够表达原始图像在概念层面的语义,因此高层语义信息可用于原始图像的图像分析任务。由于低层特征表达原始图像丰富的视觉细节信息,高层语义信息表达原始图像在概念层面的语义,因此低层特征与高层语义信息可以相结合,以用于原始图像的图像重建任务。In the embodiment of the present application, the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. Since the low-level features express the rich visual detail information of the original image, and the high-level semantic information expresses the semantics of the original image at the conceptual level, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image.
在步骤S23中,图像编码系统对高层语义信息进行编码,以生成第一码流。In step S23, the image encoding system encodes the high-level semantic information to generate a first code stream.
在步骤S24中,图像编码系统对低层特征进行编码,以生成第二码流。In step S24, the image encoding system encodes the low-level features to generate a second code stream.
在本申请实施例中,原始图像的图像分析任务和图像重建任务可以由接收端120执行,从而图像编码系统在得到高层语义信息和低层特征之后,可以分别对高层语义信息和低层特征进行编码,以得到高层语义信息对应的第一码流和低层特征对应的第二码流。第一码流和第二码流可由发送端110传输给接收端120。In the embodiment of the present application, the image analysis task and the image reconstruction task of the original image can be performed by the receiving end 120, so that after the image coding system obtains the high-level semantic information and low-level features, it can encode the high-level semantic information and low-level features respectively, In order to obtain the first code stream corresponding to the high-level semantic information and the second code stream corresponding to the low-level features. The first code stream and the second code stream can be transmitted from the sending end 110 to the receiving end 120 .
在一些实施例中,高层语义信息和低层特征可以使用相同的编码方式进行编码,例如,高层语义信息和低层特征可以使用图像编码系统中的同一编码器进行编码。该同一编码器可以是无损编码器,以实现对高层语义信息的无损编码和对低层特征的无损编码。In some embodiments, high-level semantic information and low-level features can be encoded using the same encoding method, for example, high-level semantic information and low-level features can be encoded using the same encoder in an image encoding system. The same encoder may be a lossless encoder to achieve lossless encoding of high-level semantic information and lossless encoding of low-level features.
在步骤S30中,图像解码系统对第一码流进行解码,以得到高层语义信息,该高层 语义信息用于执行原始图像的图像分析任务。In step S30, the image decoding system decodes the first code stream to obtain high-level semantic information, and the high-level semantic information is used to perform the image analysis task of the original image.
图像解码系统获取到发送端110传输的第一码流之后,可对第一码流进行解码,从而得到高层语义信息。在一些实施例中,图像解码系统得到的高层语义信息可以为接收端120执行图像分析任务提供处理数据。在进一步的一些实施例中,接收端120可以设置执行图像分析任务的图像分析逻辑(例如图像分析模型),图像解码系统可将解码得到的高层语义信息导入图像分析逻辑,由图像分析逻辑基于高层语义信息执行具体的图像分析任务,从而满足面向机器视觉的原始图像的图像分析需求。作为可选实现,图像分析逻辑可以执行图像分类、物体检测、实例分割等图像分析任务。作为可能的替代实现,图像分析逻辑也可以设置于与接收端120通信连接的外部设备。After the image decoding system acquires the first code stream transmitted by the sending end 110, it may decode the first code stream to obtain high-level semantic information. In some embodiments, the high-level semantic information obtained by the image decoding system may provide processing data for the receiver 120 to perform image analysis tasks. In some further embodiments, the receiving end 120 can set an image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image decoding system can import the decoded high-level semantic information into the image analysis logic, and the image analysis logic is based on the high-level Semantic information performs specific image analysis tasks, thereby meeting the image analysis needs of machine vision-oriented raw images. As an optional implementation, image analysis logic can perform image analysis tasks such as image classification, object detection, and instance segmentation. As a possible alternative implementation, the image analysis logic may also be set in an external device communicatively connected with the receiving end 120 .
在步骤S31中,图像解码系统对第二码流进行解码,以得到低层特征。In step S31, the image decoding system decodes the second code stream to obtain low-level features.
在步骤S32中,图像解码系统根据低层特征以及高层语义信息,对原始图像进行图像重建,以得到预测图像。In step S32, the image decoding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.
由于低层特征和高层语义信息均无法单一的完整表达原始图像的信息,因此在本申请实施例中,低层特征和高层语义信息可以相结合,以用于原始图像的图像重建任务。基于此,图像解码系统在解码第一码流和第二码流之后,可基于解码得到的高层语义信息和低层特征,对原始图像进行图像重建,以实现原始图像的图像重建任务。Since both low-level features and high-level semantic information cannot fully express the information of the original image, in the embodiment of the present application, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image. Based on this, after decoding the first code stream and the second code stream, the image decoding system can perform image reconstruction on the original image based on the decoded high-level semantic information and low-level features, so as to realize the image reconstruction task of the original image.
在一些实施例中,图像解码系统以高层语义信息作为图像重建的指导信息,对低层特征表达的具体图像细节进行重建,以得到预测图像。本申请实施例在原始图像的重建过程中,结合了原始图像的高层语义信息和低层特征,由于高层语义信息能够对原始图像中对象的边界、对象之间的关系(例如对象之间的遮挡关系等)进行精准的表达,因此基于高层语义信息进行图像重建,本申请实施例可保证预测图象与原始图像的基本结构类似甚至相一致;同时,在原始图像的重建过程中,本申请实施例结合低层信号表达的丰富图像细节,可实现对原始图像中的具体细节(例如原始图像中对象的颜色,纹理,质地等)进行准确的重建,进而保证了预测图像在局部细节上的准确性;因此本申请实施例结合原始图像的高层语义信息和低层特征进行原始图像的重建,能够使得重建得到的预测图像具有较高的准确性。In some embodiments, the image decoding system uses high-level semantic information as guidance information for image reconstruction, and reconstructs specific image details expressed by low-level features to obtain a predicted image. In the embodiment of the present application, in the reconstruction process of the original image, the high-level semantic information and low-level features of the original image are combined. Since the high-level semantic information can understand the boundaries of objects in the original image and the relationship between objects (such as the occlusion relationship between objects) etc.) for accurate expression, so image reconstruction is performed based on high-level semantic information, and the embodiment of the present application can ensure that the basic structure of the predicted image is similar to or even consistent with the original image; at the same time, in the reconstruction process of the original image, the embodiment of the present application Combined with the rich image details expressed by low-level signals, it is possible to accurately reconstruct the specific details in the original image (such as the color, texture, texture, etc.) of the object in the original image, thereby ensuring the accuracy of the local details of the predicted image; Therefore, the embodiments of the present application combine the high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image can have higher accuracy.
由本申请实施例提供的图像编码方案来看,本申请实施例可分别对原始图像提取高层语义信息和低层特征。由于高层语义信息能够表达原始图像概念层面的语义,因此高层语义信息可用于原始图像的图像分析任务。而高层语义信息和低层特征可相结合,以用于原始图像的图像重建任务。进而,本申请实施例通过对高层语义信息和低层特征分别进行编码,以生成第一码流和第二码流,使得高层语义信息和低层特征可通过码流形式传输给接收端。本申请实施例实现了使用一套编码方案,融合面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务。Judging from the image coding scheme provided by the embodiment of the present application, the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams. The embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.
基于图2A所示的方法流程,站在图像编码系统111的角度,图2B示例性的示出了 本申请实施例提供的图像编码系统111的结构示意图。如图2B所示,图像编码系统111可以包括:语义提取器210、特征提取器211、第一编码器212和第二编码器213。Based on the method flow shown in FIG. 2A, from the perspective of the image coding system 111, FIG. 2B exemplarily shows a schematic structural diagram of the image coding system 111 provided by the embodiment of the present application. As shown in FIG. 2B , the image encoding system 111 may include: a semantic extractor 210 , a feature extractor 211 , a first encoder 212 and a second encoder 213 .
结合图2B所示,当原始图像输入图像编码系统111后,语义提取器210用于提取原始图像的高层语义信息,特征提取器211用于提取原始图像的低层特征。As shown in FIG. 2B , when the original image is input into the image coding system 111 , the semantic extractor 210 is used to extract high-level semantic information of the original image, and the feature extractor 211 is used to extract low-level features of the original image.
在一些实施例中,语义提取器210和特征提取器211可以是卷积神经网络中的不同网络层,并且语义提取器210在卷积神经网络中的层级高于特征提取器211的层级。网络层在卷积神经网络中的层级越高,则网络层的处理结果越趋向于图像的语义信息,反之,网络层的层级越低,则网络层的处理结果越趋向于图像的低层细节特征。In some embodiments, the semantic extractor 210 and the feature extractor 211 may be different network layers in the convolutional neural network, and the layer of the semantic extractor 210 in the convolutional neural network is higher than that of the feature extractor 211 . The higher the level of the network layer in the convolutional neural network, the more the processing result of the network layer tends to the semantic information of the image. Conversely, the lower the level of the network layer, the more the processing result of the network layer tends to the low-level detail features of the image. .
作为一种可选实现,卷积神经网络可以包括主干网络,主干网络可以包括位于低层级的标准卷积层,和位于高层级的高层语义信息提取层;当图像输入卷积神经网络进行处理时,本申请实施例可将标准卷积层的处理结果作为图像的低层特征,将高层语义信息提取层的处理结果作为图像的高层语义信息。需要说明的是,本段示出的卷积神经网络的结构仅是一种可选结构,本申请实施例也可使用其他结构的卷积神经网络,并以卷积神经网络中低层级网络层的输出结果作为低层特征,高层级网络层的输出结果作为高层语义信息。As an optional implementation, the convolutional neural network can include a backbone network, and the backbone network can include a standard convolutional layer at a low level, and a high-level semantic information extraction layer at a high level; when an image is input to a convolutional neural network for processing In the embodiment of the present application, the processing result of the standard convolution layer can be used as the low-level feature of the image, and the processing result of the high-level semantic information extraction layer can be used as the high-level semantic information of the image. It should be noted that the structure of the convolutional neural network shown in this paragraph is only an optional structure, and the embodiment of the present application can also use convolutional neural networks with other structures, and use the low-level network layer in the convolutional neural network The output result of the network layer is used as the low-level feature, and the output result of the high-level network layer is used as the high-level semantic information.
在本申请实施例中,高层语义信息能够表达原始图像在概念层面的语义,因此高层语义信息可用于原始图像的图像分析任务。也就是说,在接收端120获得原始图像的高层语义信息之后,接收端120可根据高层语义信息对原始图像进行分析处理,以实现原始图像的图像分类、物体检测、实例分割等图像分析任务。由于低层特征和高层语义信息均无法单一的完整表达原始图像的信息,因此在本申请实施例中,低层特征和高层语义信息可以相结合,以用于原始图像的图像重建任务。在一些实施例中,由于低层特征表达原始图像丰富的视觉细节信息,高层语义信息表达原始图像在概念层面的语义,本申请实施例可以根据原始图像的低层特征和高层语义信息,对原始图像进行图像重建,以实现原始图像的图像重建任务。例如,在接收端120获得原始图像的低层特征和高层语义信息之后,接收端120可根据原始图像的低层特征和高层语义信息,对原始图像进行图像重建。In the embodiment of the present application, the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. That is to say, after the receiving end 120 obtains the high-level semantic information of the original image, the receiving end 120 can analyze and process the original image according to the high-level semantic information, so as to realize image analysis tasks such as image classification, object detection, and instance segmentation of the original image. Since both low-level features and high-level semantic information cannot fully express the information of the original image, in the embodiment of the present application, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image. In some embodiments, since the low-level features express the rich visual detail information of the original image, and the high-level semantic information expresses the semantics of the original image at the conceptual level, the embodiments of the present application can carry out the original image based on the low-level features and high-level semantic information of the original image. Image Reconstruction to achieve image reconstruction tasks from raw images. For example, after the receiving end 120 obtains the low-level features and high-level semantic information of the original image, the receiving end 120 may reconstruct the original image according to the low-level features and high-level semantic information of the original image.
回到图2B所示,语义提取器210提取原始图像的高层语义信息之后,第一编码器212可以对高层语义信息进行编码,以生成第一码流。第一码流可以传输到接收端120。特征提取器211提取原始图像的低层特征之后,第二编码器213可以对低层特征进行编码,以生成第二码流。第二码流可以传输到接收端120。Referring back to FIG. 2B , after the semantic extractor 210 extracts the high-level semantic information of the original image, the first encoder 212 may encode the high-level semantic information to generate a first code stream. The first code stream can be transmitted to the receiving end 120 . After the feature extractor 211 extracts the low-level features of the original image, the second encoder 213 may encode the low-level features to generate a second code stream. The second code stream can be transmitted to the receiving end 120 .
在一些实施例中,第一编码器212和第二编码器213可以是同一编码器,也就是说,本申请实施例可使用同一编码器对高层语义信息和低层特征分别进行编码。例如,第一编码器212和第二编码器213可以为同一无损编码器。在可能的示例实现中,无损编码器例如FLIF(Free Lossless Image Format,自由的无损图像格式)编码器。需要说明 的是,无损编码器仅是第一编码器212和第二编码器213可选的同一编码器形式,本申请实施例也可支持第一编码器212和第二编码器213为其他形式的同一编码器。在另一些实施例中,本申请实施例也可支持第一编码器212和第二编码器213为不同的编码器。In some embodiments, the first encoder 212 and the second encoder 213 may be the same encoder, that is, the embodiment of the present application may use the same encoder to encode high-level semantic information and low-level features respectively. For example, the first encoder 212 and the second encoder 213 may be the same lossless encoder. In a possible example implementation, a lossless encoder such as a FLIF (Free Lossless Image Format, free lossless image format) encoder. It should be noted that the lossless encoder is only the same encoder form that the first encoder 212 and the second encoder 213 can choose, and the embodiment of the present application can also support the first encoder 212 and the second encoder 213 in other forms the same encoder. In some other embodiments, the embodiment of the present application may also support that the first encoder 212 and the second encoder 213 are different encoders.
在进一步的一些实施例中,图像编码系统111生成第一码流和第二码流之后,发送端110可将第一码流和第二码流传输给接收端120。从而,接收端120可解码第一码流和第二码流,利用高层语义信息实现图像分析任务,利用高层语义信息和低层特征实现图像重建任务。In some further embodiments, after the image coding system 111 generates the first code stream and the second code stream, the sending end 110 may transmit the first code stream and the second code stream to the receiving end 120 . Therefore, the receiving end 120 can decode the first code stream and the second code stream, use the high-level semantic information to realize the image analysis task, and use the high-level semantic information and low-level features to realize the image reconstruction task.
基于图2A所示方法流程,图2C在图2B基础上进一步示出了本申请实施例提供的图像编、解码系统的结构示意图。可以理解的是,图2C所示图像解码系统121的解码过程可以与图像编码系统111的编码过程为逆过程。如图2C所示,图像解码系统121可以包括:第一解码器220、第二解码器221和预测器222。Based on the method flow shown in FIG. 2A , FIG. 2C further shows a schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 2B . It can be understood that the decoding process of the image decoding system 121 shown in FIG. 2C may be an inverse process of the encoding process of the image encoding system 111 . As shown in FIG. 2C , the image decoding system 121 may include: a first decoder 220 , a second decoder 221 and a predictor 222 .
其中,第一解码器220用于对第一码流进行解码,以得到原始图像的高层语义信息。第一解码器220得到的高层语义信息可用于原始图像的图像分析任务。在进一步的一些实施例中,第一解码器220输出的高层语义信息可以导入执行图像分析任务的图像分析逻辑(例如图像分析模型),由图像分析逻辑基于高层语义信息执行具体的图像分析任务。图像分析逻辑可以设置于接收端120,也可以设置于与接收端120通信连接的外部设备。Wherein, the first decoder 220 is used to decode the first code stream to obtain high-level semantic information of the original image. The high-level semantic information obtained by the first decoder 220 can be used for the image analysis task of the original image. In some further embodiments, the high-level semantic information output by the first decoder 220 can be imported into image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image analysis logic performs specific image analysis tasks based on the high-level semantic information. The image analysis logic can be set at the receiving end 120 , or can be set at an external device communicatively connected with the receiving end 120 .
第二解码器221用于对第二码流进行解码,以得到原始图像的低层特征。The second decoder 221 is used to decode the second code stream to obtain low-level features of the original image.
在一些实施例中,第一解码器220与第二解码器221可以是同一解码器,例如同一无损解码器。无损解码器例如FLIF解码器。In some embodiments, the first decoder 220 and the second decoder 221 may be the same decoder, for example, the same lossless decoder. Lossless decoders such as FLIF decoders.
第一解码器220得到的高层语义信息以及第二解码器221得到的低层特征,可导入预测器222。预测器222用于根据低层特征以及高层语义信息,对原始图像进行图像重建,以得到预测图像。预测器222得到的预测图像可以在接收端120展示给用户观看,从而满足面向用户视觉的原始图像的观看需求。The high-level semantic information obtained by the first decoder 220 and the low-level features obtained by the second decoder 221 can be imported into the predictor 222 . The predictor 222 is used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image. The predicted image obtained by the predictor 222 can be displayed to the user at the receiving end 120, so as to meet the viewing requirement of the original image oriented to the user's vision.
本申请实施例提供的图像编码、解码系统可以使用一套相对应的编码框架和解码框架,实现融合面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务。The image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.
在进一步的一些实施例中,如果图像解码系统121单纯利用高层语义信息和低层特征进行原始图像的图像重建,有可能使得重建的图像与原始图像存在较大的偏差,因此本申请实施例可先在图像编码系统111利用高层语义信息和低层特征进行图像重建,得到原始图像的预测图像,然后进一步确定预测图像与原始图像之间的差异信息,将该差异信息进一步传输给图像解码系统121。进而,图像解码系统121能够在结合高层语义信息和低层特征进行图像重建的基础上,进一步利用该差异信息对重建的图像进行图像增强,以使得最终重建的图像更为准确。基于此,图3A示例性的示出了本申请实施例提供的图像编、解码方法的另一流程图。结合图2A和图3A所示,图3A所示方法流程在图2A所示方法流程的基础上,进一步包括如下步骤。In some further embodiments, if the image decoding system 121 simply uses high-level semantic information and low-level features to reconstruct the original image, there may be a large deviation between the reconstructed image and the original image, so the embodiment of the present application may first The image coding system 111 uses high-level semantic information and low-level features to perform image reconstruction to obtain a predicted image of the original image, and then further determines the difference information between the predicted image and the original image, and further transmits the difference information to the image decoding system 121 . Furthermore, the image decoding system 121 can further use the difference information to perform image enhancement on the reconstructed image on the basis of combining high-level semantic information and low-level features for image reconstruction, so as to make the final reconstructed image more accurate. Based on this, FIG. 3A exemplarily shows another flowchart of the image encoding and decoding method provided by the embodiment of the present application. As shown in FIG. 2A and FIG. 3A , the method flow shown in FIG. 3A further includes the following steps on the basis of the method flow shown in FIG. 2A .
在步骤S25中,图像编码系统根据低层特征以及高层语义信息,对原始图像进行图像重建,以得到预测图像。In step S25, the image coding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.
在步骤S26中,图像编码系统确定预测图像与原始图像的差异信息,差异信息用于在图像重建任务中增强预测图像。In step S26, the image coding system determines difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image in the image reconstruction task.
在步骤S27中,图像编码系统对差异信息进行编码,以生成第三码流。In step S27, the image encoding system encodes the difference information to generate a third code stream.
在本申请实施例中,图像编码系统除执行步骤S20至步骤S24之外,还进一步执行步骤S25至步骤S27。In the embodiment of the present application, in addition to performing steps S20 to S24, the image coding system further performs steps S25 to S27.
在一些实施例中,图像编码系统可以高层语义信息作为图像重建的指导信息,对低层特征表达的具体图像细节进行重建,从而得到预测图像。图像编码系统可将预测图像与原始图像进行比对,以确定出预测图像与原始图像的差异信息,该差异信息能够在接收端的图像重建任务中,对接收端重建的预测图像进行图像增强,以使得增强图像更为贴近原始图像。为使得差异信息能够传输到接收端,图像编码系统在获得差异信息之后,可以对差异信息进行编码,以生成第三码流。第三码流可以由发送端传输给接收端。In some embodiments, the image coding system can use high-level semantic information as guidance information for image reconstruction to reconstruct specific image details expressed by low-level features, so as to obtain a predicted image. The image coding system can compare the predicted image with the original image to determine the difference information between the predicted image and the original image. The difference information can perform image enhancement on the predicted image reconstructed at the receiving end in the image reconstruction task at the receiving end to Make the enhanced image closer to the original image. In order to enable the difference information to be transmitted to the receiving end, after the image coding system obtains the difference information, it may encode the difference information to generate a third code stream. The third code stream may be transmitted from the sending end to the receiving end.
在一些实施例中,上述差异信息可以是预测图像与原始图像的残差信息。In some embodiments, the above difference information may be residual information between the predicted image and the original image.
在一些实施例中,图像编码系统可以对差异信息进行有损编码,以生成第三码流。有损编码例如VVC(Versatile Video Coding,多功能视频编码)。作为可选实现,由于有损编码存在一定的编码损失,本申请实施例可以基于重建图像的图像质量要求(QP)与差异信息的编码损失呈负相关的关系,确定差异信息的编码损失,从而得到适应于图像质量要求的第三码流。In some embodiments, the image coding system may perform lossy coding on the difference information to generate the third code stream. Lossy coding such as VVC (Versatile Video Coding, multifunctional video coding). As an optional implementation, since there is a certain coding loss in lossy coding, the embodiment of the present application can determine the coding loss of the difference information based on the negative correlation between the image quality requirement (QP) of the reconstructed image and the coding loss of the difference information, so that A third code stream suitable for image quality requirements is obtained.
作为一种可选实现,本申请实施例可以基于重建图像的图像质量要求,控制差异信息的编码损失。例如,如果重建图像的图像质量要求越高,则本申请实施例可控制差异信息的编码损失越低,如果图像质量要求越低,则本申请实施例可控制差异信息的编码损失越高。在进一步的一些实施例中,本申请实施例可基于网络带宽设置图像质量要求,例如,网络带宽与图像质量要求呈正相关关系,即网络带宽越高,则图像质量要求越高;从而,本申请实施例可根据不同的图像质量要求,编码得到不同数据量大小的第三码流,以适应于不同的网络带宽情况。As an optional implementation, in this embodiment of the present application, the encoding loss of the difference information may be controlled based on the image quality requirements of the reconstructed image. For example, if the image quality requirement of the reconstructed image is higher, the embodiment of the present application can control the coding loss of the difference information to be lower; if the image quality requirement is lower, the embodiment of the present application can control the coding loss of the difference information to be higher. In some further embodiments, the embodiment of the present application can set the image quality requirements based on the network bandwidth, for example, the network bandwidth and the image quality requirements are positively correlated, that is, the higher the network bandwidth, the higher the image quality requirements; thus, the present application In this embodiment, according to different image quality requirements, third code streams with different data sizes can be obtained by encoding, so as to adapt to different network bandwidth conditions.
在步骤S33中,图像解码系统对第三码流进行解码,以得到差异信息。In step S33, the image decoding system decodes the third code stream to obtain difference information.
在步骤S34中,图像解码系统根据差异信息,对预测图像进行图像增强处理,以得到用于展示的增强图像。In step S34, the image decoding system performs image enhancement processing on the predicted image according to the difference information, so as to obtain an enhanced image for display.
在本申请实施例中,图像解码系统除执行步骤S30至步骤S32之外,还进一步执行步骤S33至步骤S34。In the embodiment of the present application, the image decoding system further performs steps S33 to S34 in addition to steps S30 to S32.
在一些实施例中,图像解码系统在基于低层特征与高层语义信息进行图像重建,得到预测图像之后,可进一步利用解码第三码流得到的差异信息,对预测图像进行图像增强处理,以得到增强图像。该增强图像可以作为本申请实施例最终重建的图像,并展示 给用户观看。In some embodiments, after the image decoding system performs image reconstruction based on low-level features and high-level semantic information to obtain the predicted image, it can further use the difference information obtained by decoding the third code stream to perform image enhancement processing on the predicted image to obtain enhanced image. The enhanced image can be used as the final reconstructed image in the embodiment of the present application, and displayed to the user for viewing.
在一些实施例中,图像解码系统可以对第三码流进行有损解码。例如对第三码流进行VVC解码。In some embodiments, the image decoding system can perform lossy decoding on the third code stream. For example, VVC decoding is performed on the third code stream.
本申请实施例提供的图像编、解方法能够通过确定预测图像与原始图像之间的差异信息,进而在接收端利用差异信息对重建的预测图像进行增强,从而使得最终重建的增强图像能够更为贴近原始图像,提升了图像重建的准确性。The image encoding and decoding method provided by the embodiment of the present application can determine the difference information between the predicted image and the original image, and then use the difference information to enhance the reconstructed predicted image at the receiving end, so that the final reconstructed enhanced image can be more accurate. Close to the original image, improving the accuracy of image reconstruction.
在进一步的一些实施例中,图像编码系统111可基于发送端110的网络带宽情况,决定是否将差异信息进行编码传输。In some further embodiments, the image encoding system 111 may decide whether to encode and transmit the difference information based on the network bandwidth of the sending end 110 .
如果网络带宽低于设定的带宽阈值,则由于网络带宽受限,为保障用户在接收端能够获得连续的图像播放体验,图像编码系统111可不进行差异信息的编码传输;例如,图像编码系统111可在得到差异信息之后,取消执行步骤S27,不对差异信息进行编码。进而,图像编码系统111可等待发送端的网络带宽高于带宽阈值时,进行差异信息的编码传输,以将差异信息传输给接收端。If the network bandwidth is lower than the set bandwidth threshold, due to the limited network bandwidth, in order to ensure that the user can obtain a continuous image playback experience at the receiving end, the image encoding system 111 may not perform encoding and transmission of difference information; for example, the image encoding system 111 After the difference information is obtained, the execution of step S27 can be cancelled, and the difference information is not encoded. Furthermore, the image coding system 111 may wait until the network bandwidth of the sending end is higher than the bandwidth threshold, and then perform encoding and transmission of the difference information, so as to transmit the difference information to the receiving end.
需要说明的是,如果发送端的网络带宽高于带宽阈值,则由于网络带宽较优,允许发送端与接收端之间传输更多的码流信息,以支持接收端在连续播放图像的基础上,提供高清晰度的图像播放体验,此时,本申请实施例可进行差异信息的编码传输。例如,图像编码系统111可在得到差异信息之后,直接对差异信息进行编码,以生成第三码流,并由发送端将第三码流传输给接收端。It should be noted that if the network bandwidth of the sender is higher than the bandwidth threshold, due to the better network bandwidth, more code stream information is allowed to be transmitted between the sender and the receiver, so as to support the receiver to continuously play images. To provide a high-definition image playback experience, at this time, the embodiment of the present application can perform encoding and transmission of difference information. For example, after obtaining the difference information, the image coding system 111 may directly encode the difference information to generate a third code stream, and the sending end transmits the third code stream to the receiving end.
作为一种实现示例,接收端可以基于发送端连续传输的信息,实现在接收端播放视频图像。如果当前的网络带宽较低,则本申请实施例可降低接收端播放的视频图像清晰度,此时发送端可以不对差异信息进行编码传输,而仅将第一码流和第二码流传输给接收端;从而,接收端此时基于低层特征和高层语义信息进行图像重建并展示重建后图像,可在降低视频图像清晰度的情况下,保障接收端能够连续播放视频图像。如果当前的网络带宽较高,则本申请实施例可提高接收端播放的视频图像清晰度,此时发送端可以将差异信息进行编码传输,将第一码流、第二码流和第三码流传输给接收端;从而,接收端可基于低层特征和高层语义信息进行图像重建,并利用差异信息对重建的图像进行图像增强,以使得增强图像具有更高的清晰度,保障接收端能够连续播放高清晰度的视频图像。As an implementation example, the receiving end may play video images on the receiving end based on the information continuously transmitted by the sending end. If the current network bandwidth is low, the embodiment of the present application can reduce the definition of the video image played by the receiving end. At this time, the sending end may not encode and transmit the difference information, but only transmit the first code stream and the second code stream to the The receiving end; thus, the receiving end performs image reconstruction based on low-level features and high-level semantic information at this time and displays the reconstructed image, which can ensure that the receiving end can continuously play video images while reducing the definition of video images. If the current network bandwidth is relatively high, the embodiment of the present application can improve the clarity of the video image played by the receiving end. At this time, the sending end can encode and transmit the difference information, and the first code stream, the second code stream and the third code stream can be encoded and transmitted. Stream transmission to the receiving end; thus, the receiving end can perform image reconstruction based on low-level features and high-level semantic information, and use the difference information to perform image enhancement on the reconstructed image, so that the enhanced image has higher definition and ensures that the receiving end can continuously Play high-definition video images.
基于图3A所示的方法流程,站在图像编码系统111的角度,图3B示例性的示出了本申请实施例提供的图像编码系统111的另一结构示意图。结合图2B和图3B所示,该图像编码系统111还可以包括:预测器222、比对器214和第三编码器215。Based on the method flow shown in FIG. 3A , from the perspective of the image coding system 111 , FIG. 3B exemplarily shows another structural diagram of the image coding system 111 provided by the embodiment of the present application. As shown in FIG. 2B and FIG. 3B , the image encoding system 111 may further include: a predictor 222 , a comparator 214 and a third encoder 215 .
在本申请实施例中,语义提取器210提取的高层语义信息以及特征提取器211提取的低层特征可导入预测器222。预测器222可用于根据低层特征以及高层语义信息,对原始图像进行图像重建,以得到预测图像。In the embodiment of the present application, the high-level semantic information extracted by the semantic extractor 210 and the low-level features extracted by the feature extractor 211 can be imported into the predictor 222 . The predictor 222 can be used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image.
预测器222得到的预测图像可输入比对器214,比对器214可用于对预测图像与原始图像进行比对,以得到预测图像与原始图像之间的差异信息。差异信息可用于在接收端的图像重建任务中增强预测图像。在一些实施例中,比对器214可以包括减法器。减法器可对预测图像与原始图像进行残差处理,以得到预测图像与原始图像之间的残差信息,该残差信息可以作为上述所指的差异信息。The predicted image obtained by the predictor 222 can be input into the comparator 214, and the comparator 214 can be used to compare the predicted image with the original image to obtain difference information between the predicted image and the original image. The difference information can be used to enhance the predicted image in the image reconstruction task at the receiving end. In some embodiments, comparator 214 may include a subtractor. The subtractor can perform residual processing on the predicted image and the original image to obtain residual information between the predicted image and the original image, and the residual information can be used as the above-mentioned difference information.
比对器214得到的差异信息可输入第三编码器215,第三编码器215可用于对差异信息进行编码,以生成第三码流。在一些实施例中,第三编码器215可以包括有损编码器。在可能的示例实现中,有损编码器例如VVC(Versatile Video Coding,多功能视频编码)编码器。在对差异信息采用有损编码方式进行编码的可选实现上,本申请实施例可基于图像质量要求(QP)与差异信息的编码损失呈负相关的关系,确定差异信息的编码损失,从而得到适应于图像质量要求的第三码流。The difference information obtained by the comparator 214 can be input into the third encoder 215, and the third encoder 215 can be used to encode the difference information to generate a third code stream. In some embodiments, the third encoder 215 may comprise a lossy encoder. In a possible example implementation, the lossy coder is, for example, a VVC (Versatile Video Coding, multifunctional video coding) coder. In the optional implementation of encoding the difference information using a lossy encoding method, the embodiment of the present application can determine the encoding loss of the difference information based on the negative correlation between the image quality requirement (QP) and the encoding loss of the difference information, so as to obtain The third code stream adapted to the image quality requirements.
在进一步的一些实施例中,图像编码系统111生成第三码流之后,发送端110可将第三码流传输给接收端120。In some further embodiments, after the image coding system 111 generates the third code stream, the sending end 110 may transmit the third code stream to the receiving end 120 .
基于图3A所示方法流程,图3C在图3B基础上进一步示出了本申请实施例提供的图像编、解码系统的另一结构示意图。图像编码系统111在生成第一码流、第二码流和第三码流之后,第一码流、第二码流和第三码流可传输到接收端120的图像解码系统121。从而,图像解码系统121可基于图3C所示的结构实现图像解码,以兼容原始图像的图像分析任务和图像重建任务。结合图3B、图2C和图3C所示,图3C所示图像解码系统121还进一步包括:第三解码器223和图像增强器224。Based on the method flow shown in FIG. 3A , FIG. 3C further shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 3B . After the image coding system 111 generates the first code stream, the second code stream and the third code stream, the first code stream, the second code stream and the third code stream may be transmitted to the image decoding system 121 of the receiving end 120 . Therefore, the image decoding system 121 can realize image decoding based on the structure shown in FIG. 3C , so as to be compatible with the image analysis task and image reconstruction task of the original image. Referring to FIG. 3B , FIG. 2C and FIG. 3C , the image decoding system 121 shown in FIG. 3C further includes: a third decoder 223 and an image intensifier 224 .
其中,第三解码器223用于对第三码流进行解码,以得到原始图像与预测图像的差异信息(例如残差信息)。在一些实施例中,第三解码器223可与第一解码器220和第二解码器221不同,例如,第三解码器223可以包括有损解码器。有损解码器例如VVC解码器。Wherein, the third decoder 223 is used to decode the third code stream to obtain difference information (eg residual information) between the original image and the predicted image. In some embodiments, the third decoder 223 may be different from the first decoder 220 and the second decoder 221 , for example, the third decoder 223 may include a lossy decoder. Lossy decoders such as VVC decoders.
图像增强器224可用于根据差异信息,对预测器222得到的预测图像进行图像增强处理,以得到增强图像。由于差异信息表达了预测图像与原始图像之间的差异,因此本申请实施例进一步引入差异信息对预测图像进行图像增强处理,可使得图像增强之后的增强图像更趋近于原始图像,从而提升了图像重建任务的准确性。在图3C所示架构中,图像增强器224得到的增强图像可作为最终展示给用户的图像,从而满足面向用户视觉的原始图像的观看需求。The image enhancer 224 may be configured to perform image enhancement processing on the predicted image obtained by the predictor 222 according to the difference information, so as to obtain an enhanced image. Since the difference information expresses the difference between the predicted image and the original image, the embodiment of the present application further introduces the difference information to perform image enhancement processing on the predicted image, which can make the enhanced image after image enhancement closer to the original image, thereby improving the Accuracy for image reconstruction tasks. In the architecture shown in FIG. 3C , the enhanced image obtained by the image intensifier 224 can be used as the final image displayed to the user, so as to meet the viewing requirements of the original image oriented to the user's vision.
在一些实施例中,图像编码系统111在生成第一码流、第二码流和第三码流之后,发送端110可将第一码流、第二码流和第三码流均传输给接收端120。进而,图像解码系统121可基于接收端120的任务指令,确定当前是解码第一码流以支持图像分析任务,还是对第一码流、第二码流和第三码流均进行解码以执行图像重建任务。In some embodiments, after the image coding system 111 generates the first code stream, the second code stream and the third code stream, the sending end 110 can transmit the first code stream, the second code stream and the third code stream to Receiver 120. Furthermore, the image decoding system 121 can determine whether to decode the first code stream to support the image analysis task based on the task instruction of the receiving end 120, or decode the first code stream, the second code stream and the third code stream to perform image reconstruction task.
在进一步的一些实施例中,如果任务指令指示图像分析任务,则图像解码系统121 可对第一码流进行解码,以得到高层语义信息,将高层语义信息传输给图像分析逻辑,以支持执行图像分析任务。In some further embodiments, if the task instruction indicates an image analysis task, the image decoding system 121 can decode the first code stream to obtain high-level semantic information, and transmit the high-level semantic information to the image analysis logic to support the execution of the image analysis tasks.
如果任务指令指示图像重建任务,则图像解码系统121可分别对第一码流、第二码流和第三码流进行解码,以得到高层语义信息、低层特征和差异信息;从而图像解码系统121可利用高层语义信息和低层特征进行图像重建,以得到预测图像,进一步利用差异信息对预测图像进行图像增强处理,以得到增强图像。If the task instruction indicates an image reconstruction task, the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream to obtain high-level semantic information, low-level features and difference information; thus, the image decoding system 121 The high-level semantic information and low-level features can be used for image reconstruction to obtain a predicted image, and the difference information can be further used to perform image enhancement processing on the predicted image to obtain an enhanced image.
如果任务指令同时指示图像分析任务和图像重建任务,则图像解码系统121可分别对第一码流、第二码流和第三码流进行解码,并利用高层语义信息执行图像分析任务,利用高层语义信息、低层特征和差异信息执行图像重建任务。If the task instruction indicates both the image analysis task and the image reconstruction task, the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream, and use the high-level semantic information to perform the image analysis task. Semantic information, low-level features and difference information perform image reconstruction tasks.
在进一步的一些实施例中,任务指令与用户的当前需求相适应,例如由用户在接收端自动触发不同的任务指令。在另一些实施例中,任务指令也可以由系统默认设置。可以理解的是,本段描述的基于任务指令选择进行解码的码流,也可适用于图2C所示的图像编码系统传输第一码流和第二码流的情况,此种情况下,图像解码系统121可不执行第三码流的解码和图像增强处理过程,其他过程可同理参照上述相应部分的描述,此处不再展开。In some further embodiments, the task instruction is adapted to the user's current needs, for example, the user automatically triggers different task instructions at the receiving end. In some other embodiments, the task instruction may also be set by default by the system. It can be understood that the code stream selected for decoding based on the task instruction described in this paragraph is also applicable to the situation where the image coding system shown in FIG. 2C transmits the first code stream and the second code stream. In this case, the image The decoding system 121 may not perform the decoding and image enhancement processing of the third code stream, and other processes may refer to the descriptions of the above corresponding parts in the same way, and will not be expanded here.
本申请实施例提供的图像编码、解码系统可以使用一套相对应的编码框架和解码框架,实现融合面向用户视觉的图像重建任务以及面向机器视觉的图像分析任务,并且本申请实施例在图像重建任务中使用了原始图像与初次重建的预测图像之间的差异信息,能够进一步提升图像重建的准确性。The image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision, and the embodiments of the present application are used in image reconstruction The difference information between the original image and the first reconstructed predicted image is used in the task, which can further improve the accuracy of image reconstruction.
基于原始图像的图像重建任务,本申请实施例提供一种可选的图像重建方法,包括如下步骤:获取原始图像的高层语义信息以及低层特征;根据低层特征以及高层语义信息,对原始图像进行图像重建,以得到预测图像;以及,获取预测图像与原始图像的差异信息,该差异信息用于增强预测图像。Based on the image reconstruction task of the original image, the embodiment of the present application provides an optional image reconstruction method, including the following steps: obtaining high-level semantic information and low-level features of the original image; performing image processing on the original image according to the low-level features and high-level semantic information reconstructing to obtain a predicted image; and acquiring difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image.
在一些实施例中,在得到预测图像的可选实现中,本申请实施例可以高层语义信息作为图像重建的指导信息,对低层特征表达的具体图像细节进行重建,以得到预测图像。作为一种可选实现,图4A示例性的示出了本申请实施例得到预测图像的方法流程图。该方法流程可由预测器222执行实现,此处所指的预测器222可以设置于图像编码系统,也可以设置于图像解码系统。参照图4A,方法流程可以包括如下步骤。In some embodiments, in the optional implementation of obtaining the predicted image, the embodiment of the present application may use high-level semantic information as guidance information for image reconstruction, and reconstruct specific image details expressed by low-level features to obtain the predicted image. As an optional implementation, FIG. 4A exemplarily shows a flowchart of a method for obtaining a predicted image according to an embodiment of the present application. The flow of the method can be implemented by the predictor 222, and the predictor 222 referred to here can be set in the image encoding system, and can also be set in the image decoding system. Referring to FIG. 4A , the method flow may include the following steps.
在步骤S40中,处理低层特征,以得到与高层语义信息进行进一步处理的目标特征。In step S40, low-level features are processed to obtain target features for further processing with high-level semantic information.
在一些实施例中,本申请实施例可对低层特征进行上采样,以得到与高层语义信息进行进一步处理(例如堆叠处理)的目标特征。在可选实现中,该上采样可以是设定倍数的最近邻上采样,例如,8倍的最近邻上采样。作为一种可选实现,本申请实施例可基于高层语义信息的分辨率,对低层特征进行处理(例如上采样),以使得低层特征处理后的目标特征的分辨率与高层语义信息的分辨率相同;当然,本申请实施例也可支持低 层特征处理后的目标特征与高层语义信息的分辨率不同,并不限制于两者的分辨率相同。In some embodiments, the embodiment of the present application may upsample low-level features to obtain target features for further processing (for example, stacking processing) with high-level semantic information. In an optional implementation, the upsampling may be nearest neighbor upsampling with a set multiple, for example, nearest neighbor upsampling by 8 times. As an optional implementation, the embodiment of the present application can process (for example, upsampling) the low-level features based on the resolution of the high-level semantic information, so that the resolution of the target feature after processing the low-level features is the same as the resolution of the high-level semantic information The same; of course, the embodiment of the present application may also support that the resolution of the target feature processed by the low-level feature is different from that of the high-level semantic information, and is not limited to the resolution of the two being the same.
在步骤S41中,将目标特征与高层语义信息相堆叠后进行卷积处理,以得到卷积特征。In step S41, convolution processing is performed after stacking target features and high-level semantic information to obtain convolution features.
在一些实施例中,步骤S40得到的目标特征可与高层语义信息进行堆叠,以得到堆叠特征,堆叠特征可输入到卷积网络中进行卷积处理,以得到卷积网络输出的卷积特征。在进一步的一些实施例中,卷积网络可以先对堆叠特征进行多次卷积处理,以得到第一卷积特征;然后卷积网络可对第一卷积特征进行多次滤波处理,以得到滤波特征;滤波特征可进一步由卷积网络进行多次卷积处理,以得到第二卷积特征;然后,第二卷积特征可由卷积网络的激活函数输出为卷积特征。In some embodiments, the target features obtained in step S40 can be stacked with high-level semantic information to obtain stacked features, and the stacked features can be input into a convolutional network for convolution processing to obtain convolutional features output by the convolutional network. In some further embodiments, the convolutional network may first perform multiple convolutions on the stacked features to obtain the first convolutional features; then the convolutional network may perform multiple filtering processes on the first convolutional features to obtain Filtering features; the filtering features can be further subjected to multiple convolutions by the convolutional network to obtain second convolutional features; then, the second convolutional features can be output as convolutional features by the activation function of the convolutional network.
在步骤S42中,将卷积特征与低层特征相结合,以得到预测图像。In step S42, the convolutional features are combined with low-level features to obtain a predicted image.
在得到卷积特征之后,由于卷积特征能够表达更加丰富的图像信息,因此本申请实施例可将卷积特征与低层特征相结合,从而得到预测图像。在一些实施例中,卷积特征可与低层特征相加,以得到预测图像。当然,卷积特征与低层特征的结合方式并不限于相加,本申请实施例也可支持其他的结合方式。After the convolutional features are obtained, since the convolutional features can express richer image information, the embodiment of the present application can combine the convolutional features with low-level features to obtain a predicted image. In some embodiments, convolutional features may be added to low-level features to obtain a predicted image. Of course, the combination of convolutional features and low-level features is not limited to addition, and the embodiments of the present application may also support other combinations.
作为预测器222执行实现图4A所示流程的一种可选实现,图4B示例性的示出了本申请实施例提供的预测器222的结构示意图。如图4B所示,该预测器222可以包括:上采样器410、堆叠器420、卷积网络430和加法器440。As an optional implementation for the predictor 222 to implement the process shown in FIG. 4A , FIG. 4B exemplarily shows a schematic structural diagram of the predictor 222 provided in the embodiment of the present application. As shown in FIG. 4B , the predictor 222 may include: an upsampler 410 , a stacker 420 , a convolutional network 430 and an adder 440 .
上采样器410可用于对低层特征进行上采样,以得到与高层语义信息进一步处理的目标特征。在一些实施例中,上采样器410可用于对低层特征进行上采样,以得到与高层语义信息的分辨率相同的目标特征。当然,本申请实施例也可支持低层特征进行上采样后所得到的目标特征与高层语义信息的分辨率不同。在一些实施例中,上采样器可对低层特征进行设定倍数的最近邻上采样(例如8倍最近邻上采样)。The upsampler 410 can be used to upsample low-level features to obtain target features for further processing with high-level semantic information. In some embodiments, the upsampler 410 may be used to upsample low-level features to obtain target features with the same resolution as high-level semantic information. Of course, the embodiment of the present application may also support that the resolution of the target feature obtained by upsampling the low-level feature is different from that of the high-level semantic information. In some embodiments, the upsampler may perform nearest neighbor upsampling by a set factor (for example, 8 times nearest neighbor upsampling) on low-level features.
堆叠器420可将高层语义信息与上采样器410得到的目标特征进行堆叠,以得到堆叠特征。堆叠特征可输入卷积网络430。卷积网络430可对堆叠特征进行卷积处理,以得到卷积特征。The stacker 420 can stack the high-level semantic information and the target features obtained by the upsampler 410 to obtain stacked features. The stacked features may be input into a convolutional network 430 . The convolutional network 430 can perform convolution processing on the stacked features to obtain convolutional features.
加法器440可用于将低层特征以及卷积网络430输出的卷积特征进行相加,以得到预测图像。The adder 440 can be used to add the low-level features and the convolutional features output by the convolutional network 430 to obtain a predicted image.
本申请实施例提供的预测器222可结合原始图像的高层语义信息和低层特征进行原始图像的重建,使得重建得到的预测图像具有较高的准确性。The predictor 222 provided in the embodiment of the present application can combine high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image has higher accuracy.
作为卷积网络430的一种可选实现,图4C示例性的示出了本申请实施例提供的卷积网络430的结构示意图。如图4C所示,卷积网络430可以包括:第一组卷积层431、多个卷积滤波器组432、第二组卷积层433、和激活函数434。As an optional implementation of the convolutional network 430, FIG. 4C exemplarily shows a schematic structural diagram of the convolutional network 430 provided by the embodiment of the present application. As shown in FIG. 4C , the convolutional network 430 may include: a first set of convolutional layers 431 , a plurality of convolutional filter banks 432 , a second set of convolutional layers 433 , and an activation function 434 .
其中,第一组卷积层431可以包括依次连接的多层卷积层,第一组卷积层431中的每层卷积层的卷积配置可以各不相同。Wherein, the first group of convolutional layers 431 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the first group of convolutional layers 431 may be different.
一个卷积滤波器组432可以包括多层卷积滤波器,一个卷积滤波器组432中的每层卷积滤波器的滤波配置可以相同。A convolution filter bank 432 may include multiple layers of convolution filters, and the filtering configuration of each layer of convolution filters in a convolution filter bank 432 may be the same.
第二组卷积层433可以包括依次连接的多层卷积层,第二组卷积层433中的每层卷积层的卷积配置可以各不相同。The second group of convolutional layers 433 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the second group of convolutional layers 433 may be different.
在一些实施例中,卷积层的卷积配置可以包括输出滤波器数量、卷积核大小、卷积步长、是否设置归一化函数、是否设置线性整流单元等之中的一项或者多项。当然,本申请实施例也可支持第一组卷积层431、第二组卷积层433中部分卷积层的卷积配置相同,甚至所有卷积层的卷积配置均相同。In some embodiments, the convolution configuration of the convolutional layer may include one or more of the number of output filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, whether to set a linear rectification unit, etc. item. Of course, this embodiment of the present application can also support the same convolution configuration of some convolution layers in the first group of convolution layers 431 and the second group of convolution layers 433 , or even the same convolution configuration of all convolution layers.
在本申请实施例中,目标特征与高层语义信息进行堆叠后的堆叠特征可输入第一组卷积层431,第一组卷积层431可通过多层卷积层对堆叠特征进行卷积处理,以得到第一卷积特征。第一组卷积层431可输出第一卷积特征到多个卷积滤波器组432,多个卷积滤波器组432可对第一卷积特征进行滤波处理,以得到滤波特征。多个卷积滤波器组432可输出滤波特征到第二组卷积层433,第二组卷积层433可通过多层卷积层对滤波特征进行卷积处理,以得到第二卷积特征。第二组卷积层433可输出第二卷积特征到激活函数434,由激活函数434输出相应的卷积特征。In the embodiment of the present application, the stacked features after stacking the target features and high-level semantic information can be input to the first set of convolutional layers 431, and the first set of convolutional layers 431 can perform convolution processing on the stacked features through multi-layer convolutional layers , to get the first convolution feature. The first set of convolutional layers 431 can output the first convolutional features to multiple convolutional filter banks 432 , and the multiple convolutional filter banks 432 can filter the first convolutional features to obtain filtered features. A plurality of convolutional filter banks 432 can output filtering features to the second set of convolutional layers 433, and the second set of convolutional layers 433 can perform convolution processing on the filtering features through multiple convolutional layers to obtain the second convolutional features . The second set of convolutional layers 433 can output the second convolutional features to the activation function 434 , and the activation function 434 outputs the corresponding convolutional features.
在进一步的一些实施例中,第一组卷积层431和第二组卷积层433中设置的卷积层数量、各个卷积层的卷积配置可以根据实际情况确定,本申请实施例并不设限。同理,一个卷积滤波器组432中设置的卷积滤波器数量、各个卷积滤波器的滤波配置可根据实际情况确定,本申请实施例并不设限。In some further embodiments, the number of convolutional layers set in the first group of convolutional layers 431 and the second group of convolutional layers 433, and the convolution configuration of each convolutional layer can be determined according to actual conditions. No limit. Similarly, the number of convolution filters set in one convolution filter bank 432 and the filtering configuration of each convolution filter may be determined according to actual conditions, which are not limited by this embodiment of the present application.
作为配置卷积网络430的一种可选实现,本申请实施例可定义第一组卷积层431和第二组卷积层433中的卷积层数量,以及各层卷积层的输出滤波器数量、卷积核大小、卷积步长、是否设置归一化函数、是否设置线性整流单元,以实现配置第一组卷积层431和第二组卷积层433。作为可选实现,图4D示例性的示出了本申请实施例提供的预测器222的另一结构示意图。结合图4C和图4D所示,图4D所示的预测器222对于卷积网络430结构进行了具体化。As an optional implementation of configuring the convolutional network 430, the embodiment of the present application can define the number of convolutional layers in the first set of convolutional layers 431 and the second set of convolutional layers 433, and the output filtering of each convolutional layer The number of filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, and whether to set a linear rectification unit, so as to configure the first set of convolutional layers 431 and the second set of convolutional layers 433. As an optional implementation, FIG. 4D exemplarily shows another schematic structural diagram of the predictor 222 provided in the embodiment of the present application. As shown in FIG. 4C and FIG. 4D , the predictor 222 shown in FIG. 4D is specific to the convolutional network 430 structure.
如图4D所示,第一组卷积层431可以包括4层依次连接的卷积层,分别为:Conv60 7x7 s1 Norm ReLU、Conv120 3x3 s2 Norm ReLU、Conv240 3x3 s2 Norm ReLU、Conv480 3x3 s2 Norm ReLU。针对其中的第一层卷积层,Conv60表示卷积层的输出滤波器数量为60、7x7表示输出滤波器的卷积核大小、s1表示输出滤波器的卷积步长为1(相应的,s2表示卷积步长为2,以此类推)、Norm表示设置归一化函数(相应的,卷积配置中不存在Norm,则表示卷积层中不设置归一化函数)、ReLU表示设置线性整流单元(相应的,卷积配置中不存在ReLU,则表示卷积层中不设置线性整流单元)。也就是说,基于上述第一层卷积层的配置,第一层卷积包括60个输出滤波器、一个归一化函数和线性整流单元,且每个输出滤波器的卷积核大小为7x7,卷积步长为1。第一组卷积层431中的其他 层卷积层的具体配置可同理解释。As shown in FIG. 4D, the first set of convolutional layers 431 may include 4 sequentially connected convolutional layers, namely: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU . For the first convolutional layer, Conv60 indicates that the number of output filters of the convolutional layer is 60, 7x7 indicates the size of the convolution kernel of the output filter, and s1 indicates that the convolution step of the output filter is 1 (correspondingly, s2 means that the convolution step size is 2, and so on), Norm means to set the normalization function (correspondingly, if there is no Norm in the convolution configuration, it means that the normalization function is not set in the convolution layer), ReLU means to set Linear rectification unit (correspondingly, if there is no ReLU in the convolution configuration, it means that the linear rectification unit is not set in the convolution layer). That is to say, based on the configuration of the first convolutional layer above, the first convolutional layer includes 60 output filters, a normalization function and a linear rectification unit, and the convolution kernel size of each output filter is 7x7 , with a convolution step size of 1. The specific configurations of other convolutional layers in the first group of convolutional layers 431 can be explained in the same way.
多个卷积滤波器组432可以包括9个依次连接的卷积滤波器组。一个卷积滤波器组可以包括2层卷积滤波器,每层卷积滤波器可以包括480个卷积滤波器,每个卷积滤波器的卷积核大小为3x3、卷积步长为1。The plurality of convolutional filter banks 432 may include nine sequentially connected convolutional filter banks. A convolution filter bank can include 2 layers of convolution filters, each layer of convolution filters can include 480 convolution filters, the convolution kernel size of each convolution filter is 3x3, and the convolution step size is 1 .
第二组卷积层433可以包括4层依次连接的卷积层,分别为:ConvT240 3x3 s2 Norm ReLU、ConvT120 3x3 s2 Norm ReLU、ConvT60 3x3 s2 Norm ReLU、Conv3 7x7 s1。其中,ConvT表示反卷积,ConvT240表示执行反卷积的卷积层的输出滤波器数量为240。需要说明的是,针对第二组卷积层433中的最后一层卷积层Conv3 7x7 s1,由于该最后一层卷积层的卷积配置中不存在Norm和ReLU,因此该最后一层卷积层不设置归一化函数和线性整流单元。第二组卷积层433中各卷积层的配置含义可参照前文描述同理解释,此处不再展开。The second group of convolutional layers 433 may include 4 sequentially connected convolutional layers, namely: ConvT240 3x3 s2 Norm ReLU, ConvT120 3x3 s2 Norm ReLU, ConvT60 3x3 s2 Norm ReLU, Conv3 7x7 s1. Among them, ConvT indicates deconvolution, and ConvT240 indicates that the number of output filters of the convolutional layer performing deconvolution is 240. It should be noted that, for the last convolutional layer Conv3 7x7 s1 in the second group of convolutional layers 433, since Norm and ReLU do not exist in the convolution configuration of the last convolutional layer, the last convolutional layer Multilayer does not set normalization function and linear rectification unit. The configuration meaning of each convolutional layer in the second group of convolutional layers 433 can be explained similarly with reference to the previous description, and will not be expanded here.
需要说明的是,图4D所示的卷积网络430的具体结构仅是一种可选结构,本申请实施例也可对图4D所示的卷积网络430进行结构变形、调整或者替换,只要本申请实施例提供的卷积网络430能够对目标特征与高层语义信息的堆叠特征进行卷积处理,且得到的卷积特征能够表达更加丰富的图像信息即可。It should be noted that the specific structure of the convolutional network 430 shown in FIG. 4D is only an optional structure. In the embodiment of the present application, the structure of the convolutional network 430 shown in FIG. 4D can also be deformed, adjusted or replaced, as long as The convolutional network 430 provided in the embodiment of the present application can perform convolution processing on the stacked features of target features and high-level semantic information, and the obtained convolutional features can express richer image information.
在进一步的一些实施例中,图像编码系统111的特征提取器211可以使用多层卷积层以及激活函数,提取原始图像的低层特征。作为一种可选实现,特征提取器22中的多层卷积层可对原始图像提取特征,然后由激活函数输出为低层特征。图5A示出了本申请实施例提供的特征提取器211的结构示意图。如图5A所示,特征提取器211可以包括依次连接的多层卷积层、以及激活函数。多层卷积层的卷积配置可以各不相同,也可以部分相同。原始图像输入特征提取器211之后,多层卷积层可以对原始图像提取特征,多层卷积层提取的特征可通过激活函数输出为低层特征。In some further embodiments, the feature extractor 211 of the image coding system 111 may use multiple convolutional layers and activation functions to extract low-level features of the original image. As an optional implementation, the multi-layer convolutional layer in the feature extractor 22 can extract features from the original image, and then output as low-level features by the activation function. FIG. 5A shows a schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application. As shown in FIG. 5A , the feature extractor 211 may include sequentially connected multi-layer convolutional layers and activation functions. The convolutional configurations of multiple convolutional layers can be different or partially the same. After the original image is input to the feature extractor 211, the multi-layer convolutional layer can extract features from the original image, and the features extracted by the multi-layer convolutional layer can be output as low-level features through an activation function.
在一种实现示例中,图5B示出了本申请实施例提供的特征提取器211的另一结构示意图。结合图5A和图5B所示,在图5B所示特征提取器211中,多层卷积层可以包括5层卷积层,分别为:Conv60 7x7 s1 Norm ReLU、Conv120 3x3 s2 Norm ReLU、Conv240 3x3 s2 Norm ReLU、Conv480 3x3 s2 Norm ReLU、Conv3 3x3 s1 Norm ReLU。需要说明的是,图5B所示的多层卷积层的具体数量以及各层卷积层的具体配置仅是一种可选示例,本申请实施例对此并不设限。In an implementation example, FIG. 5B shows another schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application. As shown in FIG. 5A and FIG. 5B , in the feature extractor 211 shown in FIG. 5B , the multi-layer convolutional layer may include 5 layers of convolutional layers, which are: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU, Conv3 3x3 s1 Norm ReLU. It should be noted that the specific number of multi-layer convolutional layers and the specific configuration of each convolutional layer shown in FIG. 5B are only an optional example, which is not limited by this embodiment of the present application.
在进一步的一些实施例中,低层特征可以紧凑的表达原始图像的细节特征,例如低层特征可以表示原始图像的设定比例尺寸的紧凑特征。在一个实现示例中,低层特征可以表示原始图像的1/64尺寸的紧凑特征(例如,低层特征在宽度尺寸和高度尺寸上均可以表示为原始图像的1/8)。In some further embodiments, the low-level features can compactly express the detailed features of the original image, for example, the low-level features can represent the compact features of the original image with a set scale. In an implementation example, the low-level features may represent compact features with a size of 1/64 of the original image (eg, the low-level features may be represented as 1/8 of the original image in both width and height dimensions).
作为一种可选实现,为使得低层特征可以紧凑的表达原始图像的细节特征,并且保障低层特征的特征大小,本申请实施例可在特征提取器的第一层卷积层和最后一层的激 活函数前采用第一设定大小的ReflectionPad(镜像填充),例如,采用大小为3的ReflectionPad;而在特征提取器的其余网络层(例如其余的卷积层)前均采用第二设定大小的ReflectionPad,例如采用长宽为1的ReflectionPad。As an optional implementation, in order to enable the low-level features to compactly express the detailed features of the original image and ensure the feature size of the low-level features, the embodiment of the present application can be used in the first convolutional layer and the last layer of the feature extractor. Use the ReflectionPad (mirror padding) of the first set size before the activation function, for example, use a ReflectionPad with a size of 3; and use the second set size before the remaining network layers of the feature extractor (such as the remaining convolutional layers) ReflectionPad, for example, a ReflectionPad whose length and width are 1.
在一些实施例中,特征提取器211在提取原始图像的低层特征时,本申请实施例可使用通道归一化(ChannelNorm)技术,得到表达图像精细细节且紧凑的低层特征。作为一种可选实现,本申请实施例可基于通道归一化技术,确定原始图像在各个通道对应的低层特征,从而根据原始图像在各个通道对应的低层特征,确定原始图像的低层特征。例如,特征提取器211可基于通道归一化技术,使用多层卷积层,确定原始图像在各个通道对应的低层特征;进而,原始图像在各个通道对应的低层特征可在特征提取器211的激活函数进行结合,以得到原始图像的低层特征。In some embodiments, when the feature extractor 211 extracts low-level features of the original image, this embodiment of the present application may use ChannelNorm technology to obtain compact low-level features that express fine details of the image. As an optional implementation, the embodiment of the present application can determine the low-level features of the original image corresponding to each channel based on the channel normalization technology, so as to determine the low-level features of the original image according to the low-level features of the original image corresponding to each channel. For example, the feature extractor 211 can use multi-layer convolutional layers based on channel normalization technology to determine the low-level features of the original image corresponding to each channel; furthermore, the low-level features of the original image corresponding to each channel can be used in the The activation functions are combined to obtain the low-level features of the original image.
在一个实现示例中,以原始图像为具有二维位置的二维图像为例,针对计算原始图像在当前通道对应的低层特征而言,本申请实施例可确定原始图像的当前通道在二维位置处对应的单位像素、原始图像在二维位置处的所有通道对应的均值、在二维位置处的所有通道对应的均方误差;从而本申请实施例可根据当前通道在二维位置处对应的单位像素、原始图像在二维位置处的所有通道对应的均值和均方误差、以及设定的当前通道的偏移量,确定出原始图像在当前通道对应的低层特征。In an implementation example, taking the original image as a two-dimensional image with a two-dimensional position as an example, for calculating the low-level features corresponding to the original image in the current channel, the embodiment of the present application can determine the two-dimensional position of the current channel of the original image The corresponding unit pixel at , the mean value corresponding to all channels of the original image at the two-dimensional position, and the mean square error corresponding to all the channels at the two-dimensional position; thus, the embodiment of the present application can be based on the corresponding The unit pixel, the mean value and mean square error corresponding to all channels of the original image at the two-dimensional position, and the set offset of the current channel determine the low-level features of the original image corresponding to the current channel.
以原始图像的总通道数为M,计算原始图像在第c个通道对应的低层特征为例,本申请实施例可基于通道归一化技术,使用如下公式计算原始图像在第c个通道对应的低层特征。Taking the total number of channels of the original image as M, and calculating the low-level features corresponding to the c-th channel of the original image as an example, the embodiment of the present application may use the following formula to calculate the corresponding low-level features of the original image in the c-th channel based on the channel normalization technology Low-level features.
f
chw'表示原始图像在第c个通道对应的低层特征,c表示原始图像的第c个通道,c属于1至M,f
chw表示原始图像的第c个通道在h、w位置对应的单位像素,μ
hw表示原始图像的h、w位置处M个通道的均值,
表示原始图像的h、w位置处M个通道的均方误差,α
c和β
c表示第c个通道经过学习的偏移量。
f chw 'indicates the low-level feature corresponding to the cth channel of the original image, c indicates the cth channel of the original image, c belongs to 1 to M, and f chw indicates the unit corresponding to the cth channel of the original image at the h and w positions pixel, μ hw represents the mean value of the M channels at the h and w positions of the original image, Indicates the mean square error of the M channels at the h and w positions of the original image, and α c and β c represent the learned offset of the c-th channel.
本申请实施例通过通道归一化技术,确定原始图像在各个通道对应的低层特征,进而结合原始图像在当前通道对应的低层特征,得到原始图像的低层特征,能够在保证低层特征表达原始图像丰富细节信息的前提下,显著减少低层特征的大小;进而,本申请实施例可减少低层特征的编码、解码开销,以及低层特征对应的第二码流在发送端和接收端之间的传输开销。The embodiment of the present application uses the channel normalization technology to determine the low-level features of the original image corresponding to each channel, and then combine the low-level features of the original image corresponding to the current channel to obtain the low-level features of the original image, which can ensure that the original image is rich in low-level features. Under the premise of detailed information, the size of the low-level features can be significantly reduced; further, the embodiments of the present application can reduce the encoding and decoding overhead of the low-level features, and the transmission overhead of the second code stream corresponding to the low-level features between the sending end and the receiving end.
在进一步的一些实施例中,图像编码系统111在提取高层语义信息时,可基于图像分析任务的类型提取相应的高层语义信息。图像编码系统111可以设置与图像分析任务的类型相应的语义提取器,以实现从原始图像中提取与图像分析任务的类型相应的高层语义信息。作为可选实现,一个高层语义信息可支持至少一类图像分析任务,即一个高层语义信息可以支持一类或同时支持多类图像分析任务。例如高层语义信息可以包括支 持实例分割任务、图像分类任务、物体检测任务的实例分割图信息,支持人体姿态识别任务的火柴人图信息等之中的任一项。In some further embodiments, when extracting the high-level semantic information, the image coding system 111 may extract corresponding high-level semantic information based on the type of the image analysis task. The image coding system 111 may set a semantic extractor corresponding to the type of the image analysis task, so as to extract high-level semantic information corresponding to the type of the image analysis task from the original image. As an optional implementation, one high-level semantic information can support at least one type of image analysis tasks, that is, one high-level semantic information can support one type or multiple types of image analysis tasks at the same time. For example, high-level semantic information can include any one of instance segmentation map information supporting instance segmentation tasks, image classification tasks, and object detection tasks, and stick figure information supporting human body pose recognition tasks.
图6示出了本申请实施例提供的图像编、解码系统的再一结构示意图。如图6所示,图像编码系统111中设置的语义提取器210可以为实例分割器610,实例分割器610可以基于实例分割技术,从原始图像提取实例分割图信息。在本申请实施例中,实例分割图信息可作为原始图像的高层语义信息。FIG. 6 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application. As shown in FIG. 6 , the semantic extractor 210 set in the image coding system 111 may be an instance segmenter 610 , and the instance segmenter 610 may extract instance segmentation map information from an original image based on an instance segmentation technique. In the embodiment of the present application, the instance segmentation map information may be used as high-level semantic information of the original image.
实例分割图信息可以导入预测器222以及FLIF编码器620,FLIF编码器620可以是第一编码器212和第二编码器213使用的同一无损编码器。FLIF编码器620可对实例分割图信息进行无损编码,以生成第一码流。第一码流可传输到图像解码系统121。The instance segmentation map information can be imported into the predictor 222 and the FLIF encoder 620 , and the FLIF encoder 620 can be the same lossless encoder used by the first encoder 212 and the second encoder 213 . The FLIF encoder 620 can perform lossless encoding on the instance segmentation map information to generate a first code stream. The first code stream can be transmitted to the image decoding system 121 .
特征提取器211提取原始图像的低层特征,低层特征可以导入预测器222以及FLIF编码器620。FLIF编码器620可对低层特征进行无损编码,以生成第二码流。第二码流可传输到图像解码系统121。The feature extractor 211 extracts low-level features of the original image, and the low-level features can be imported into the predictor 222 and the FLIF encoder 620 . The FLIF encoder 620 can perform lossless encoding on low-level features to generate a second code stream. The second code stream can be transmitted to the image decoding system 121 .
在图像编码系统111中,预测器222可根据实例分割图信息和低层特征进行图像重建,以得到预测图像。减法器630可确定预测图像和原始图像的残差信息。残差信息可以导入到VVC编码器640。VVC编码器640可对残差信息进行有损编码,以生成第三码流。第三码流可传输到图像解码系统121。其中,减法器630可以是比对器214的一种可选形式,VVC编码器640可以是第三编码器215使用的有损编码器。In the image coding system 111, the predictor 222 can perform image reconstruction according to instance segmentation map information and low-level features to obtain a predicted image. The subtracter 630 may determine residual information of the predicted image and the original image. The residual information can be imported into the VVC encoder 640 . The VVC encoder 640 can perform lossy encoding on the residual information to generate a third code stream. The third code stream can be transmitted to the image decoding system 121 . Wherein, the subtractor 630 may be an optional form of the comparator 214 , and the VVC encoder 640 may be a lossy encoder used by the third encoder 215 .
在图像解码系统121中,FLIF解码器650可对第一码流和第二码流进行解码,以得到实例分割图信息和低层特征。FLIF解码器650可以是第一解码器220和第二解码器221使用的同一无损解码器。FLIF解码器650得到的实例分割图信息可以导入实例分割逻辑、图像分类逻辑、物体检测逻辑,以实现对原始图像的实例分割、图像分类、物体检测等图像分析任务。同时,FLIF解码器650得到的实例分割图信息和低层特征可导入图像解码系统121中的预测器222。预测器222可根据实例分割图信息和低层特征,对原始图像进行图像重建,以得到预测图像。预测器222得到的预测图像可导入图像增强器224。In the image decoding system 121, the FLIF decoder 650 can decode the first code stream and the second code stream to obtain instance segmentation map information and low-level features. The FLIF decoder 650 may be the same lossless decoder used by the first decoder 220 and the second decoder 221 . The instance segmentation map information obtained by the FLIF decoder 650 can be imported into instance segmentation logic, image classification logic, and object detection logic to implement image analysis tasks such as instance segmentation, image classification, and object detection of the original image. Meanwhile, the instance segmentation map information and low-level features obtained by the FLIF decoder 650 can be imported into the predictor 222 in the image decoding system 121 . The predictor 222 may perform image reconstruction on the original image according to the instance segmentation map information and low-level features to obtain a predicted image. The predicted image obtained by the predictor 222 can be imported into the image enhancer 224 .
在图像解码系统121中,VVC解码器660可对第三码流进行解码,以得到预测图像和原始图像的残差信息。VVC解码器660可以是第三解码器223使用的有损解码器。残差信息可导入图像增强器224。图像增强器224可以根据残差信息,对预测图像进行图像增强处理,以得到增强图像。增强图像可以展示给用户观看,以满足面向用户视觉的图像重建需求。In the image decoding system 121, the VVC decoder 660 can decode the third code stream to obtain residual information of the predicted image and the original image. The VVC decoder 660 may be a lossy decoder used by the third decoder 223 . The residual information may be introduced into the image enhancer 224 . The image enhancer 224 may perform image enhancement processing on the predicted image according to the residual information to obtain an enhanced image. The enhanced image can be displayed to the user to meet the user's vision-oriented image reconstruction requirements.
图6以实例分割图信息作为高层语义信息,对本申请实施例提供的图像编码系统和图像解码系统在图像重建任务和实例分割任务的实现过程进行了具体说明。可以理解的是,本申请实施例也可在图2B所示图像编码系统和图2C所示图像解码系统的基础上,使用实例分割图信息作为高层语义信息。FIG. 6 uses the instance segmentation map information as high-level semantic information, and specifically illustrates the implementation process of the image encoding system and the image decoding system provided by the embodiment of the present application in the image reconstruction task and the instance segmentation task. It can be understood that, on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C , the embodiment of the present application can also use instance segmentation map information as high-level semantic information.
在其他可能的实现中,原始图像的高层语义信息也可以是原始图像的火柴人图信息, 用于支持人体姿态识别的图像分析任务;在此情况下,本申请实施例可将图6所示架构中的实例分割器610替换为支持提取火柴人图信息的语义提取器,将实例分割图信息替换为火柴人图信息,其他过程可同理参照图6部分的描述。当然,本申请实施例也可在图2B所示图像编码系统和图2C所示图像解码系统的基础上,使用火柴人图信息作为高层语义信息。In other possible implementations, the high-level semantic information of the original image can also be the stick figure information of the original image, which is used to support the image analysis task of human body gesture recognition; in this case, the embodiment of the present application can use the The instance segmenter 610 in the architecture is replaced by a semantic extractor that supports the extraction of stickman graph information, and the instance segmentation graph information is replaced by stickman graph information. For other processes, refer to the description in Figure 6 in the same way. Of course, the embodiment of the present application can also use stick figure information as high-level semantic information on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C .
在进一步的一些实施例中,图像编码系统111可设置多个语义提取器210,以对原始图像提取多个高层语义信息,从而支持不同类型的图像分析任务。图7示出了本申请实施例提供的图像编、解码系统的又一结构示意图。结合图6和图7所示,图7所示图像编码系统111可以设置多个语义提取器(例如图7所示n个语义提取器2101至210n)。In some further embodiments, the image coding system 111 may set multiple semantic extractors 210 to extract multiple high-level semantic information from the original image, so as to support different types of image analysis tasks. FIG. 7 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application. As shown in FIG. 6 and FIG. 7 , the image encoding system 111 shown in FIG. 7 may be provided with multiple semantic extractors (for example, n semantic extractors 2101 to 210 n shown in FIG. 7 ).
多个语义提取器用于逐一对原始图像提取高层语义信息,以得到多个高层语义信息(例如高层语义信息1至n)。例如,语义提取器2101可提取原始图像的高层语义信息1,以此类推,语义提取器210n可提取原始图像的高层语义信息n。一个语义提取器提取的一个高层语义信息可用于支持原始图像的至少一类图像分析任务,例如,高层语义信息1至n中的任一高层语义信息可支持原始图像的一类或多类图像分析任务。Multiple semantic extractors are used to extract high-level semantic information from the original image one by one to obtain multiple high-level semantic information (for example, high-level semantic information 1 to n). For example, the semantic extractor 2101 can extract the high-level semantic information 1 of the original image, and so on, the semantic extractor 210n can extract the high-level semantic information n of the original image. A high-level semantic information extracted by a semantic extractor can be used to support at least one type of image analysis task of the original image, for example, any high-level semantic information in the high-level semantic information 1 to n can support one or more types of image analysis of the original image Task.
作为一种实现示例,多个高层语义信息可以包括实例分割图信息和火柴人图信息,其中,实例分割图信息可支持原始图像的实例分割、图像分类、物体检测等之中的一类或多类图像分析任务,火柴人图信息可支持原始图像的人体姿态识别任务。As an implementation example, the plurality of high-level semantic information may include instance segmentation map information and stickman map information, wherein the instance segmentation map information may support one or more of instance segmentation, image classification, and object detection of the original image. Similar to image analysis tasks, the stick figure information can support the human body pose recognition task of the original image.
在多个语义提取器对原始图像提取多个高层语义信息之后,FLIF编码器620可分别对多个高层语义信息进行无损编码,以得到多个第一码流。例如,在语义提取器2101至210n逐一对原始图像提取高层语义信息,得到高层语义信息1至n后,FLIF编码器620可分别对高层语义信息1至n进行无损编码,以生成第一码流1至n。After multiple semantic extractors extract multiple high-level semantic information from the original image, the FLIF encoder 620 may perform lossless encoding on the multiple high-level semantic information to obtain multiple first code streams. For example, after the semantic extractors 2101 to 210n extract high-level semantic information from the original image one by one to obtain the high-level semantic information 1 to n, the FLIF encoder 620 can perform lossless encoding on the high-level semantic information 1 to n respectively to generate the first code stream 1 to n.
在图像编码系统111中,多个高层语义信息中的一个高层语义信息(例如实例分割图信息、火柴人图信息等之中的一个高层语义信息)可导入预测器222(即,高层语义信息1至n中的一个高层语义信息可导入预测器222),特征提取器211提取原始图像的低层特征可导入预测器222。从而,预测器222可根据低层特征以及多个高层语义信息中的一个高层语义信息,进行图像重建,以得到预测图像。图像编码系统111端的其他实现过程可同理参照前文相应部分的描述,此处不再展开说明。In the image coding system 111, one high-level semantic information among multiple high-level semantic information (for example, one high-level semantic information among instance segmentation map information, stickman map information, etc.) can be imported into the predictor 222 (that is, high-level semantic information 1 One of the high-level semantic information in to n can be imported into the predictor 222), and the low-level features extracted by the feature extractor 211 of the original image can be imported into the predictor 222. Therefore, the predictor 222 can perform image reconstruction according to the low-level features and one high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image. For other implementation processes at the image coding system 111 end, similarly, reference may be made to the description of the corresponding part above, and no further description is given here.
在一些实施例中,预测器222也可根据低层特征以及多个高层语义信息中的至少两个高层语义信息,进行图像重建,以得到预测图像。例如,预测器222在根据低层特征与实例分割图信息进行图像重建的过程中,可以进一步引入多个高层语义信息中的至少一个高层语义信息,以使得重建的预测图像更为准确。作为可选实现,本申请实施例可将多个高层语义信息中语义层级最高的一个高层语义信息或者至少两个高层语义信息,导入预测器222;从而预测器222可根据低层特征以及多个高层语义信息中语义层级最高的一个或多个高层语义信息,进行图像重建,以得到更为准确的预测图像。在其他可 能的实现中,本申请实施例也可将多个高层语义信息中的任意至少两个高层语义信息,与低层特征相结合,以用于原始图像的图像重建,并不限制用于图像重建的高层语义信息的语义层级。In some embodiments, the predictor 222 may also perform image reconstruction according to low-level features and at least two high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image. For example, the predictor 222 may further introduce at least one high-level semantic information among multiple high-level semantic information during image reconstruction according to low-level features and instance segmentation map information, so as to make the reconstructed predicted image more accurate. As an optional implementation, the embodiment of the present application can import the high-level semantic information or at least two high-level semantic information with the highest semantic level among multiple high-level semantic information into the predictor 222; thus, the predictor 222 can In the semantic information, one or more high-level semantic information with the highest semantic level is used for image reconstruction to obtain a more accurate prediction image. In other possible implementations, the embodiment of the present application can also combine any at least two high-level semantic information among multiple high-level semantic information with low-level features to be used for image reconstruction of the original image, and is not limited to be used for image Semantic hierarchy of reconstructed high-level semantic information.
在图7所示图像编码系统111中,多个高层语义信息1至n编码生成的多个第一码流、低层特征编码生成的第二码流、预测图像与原始图像的残差信息编码生成的第三码流,可传输到图像解码系统121。In the image coding system 111 shown in FIG. 7 , a plurality of first code streams generated by encoding multiple high-level semantic information 1 to n, a second code stream generated by encoding low-level features, and residual information of predicted images and original images are generated by encoding The third code stream can be transmitted to the image decoding system 121 .
在图像解码系统121中,针对图像重建任务,FLIF解码器650可对多个第一码流中的任一第一码流、以及第二码流进行解码,从而得到一个高层语义信息和低层特征,该一个高层语义信息和低层特征可导入预测器222中;预测器222可根据该一个高层语义信息和低层特征进行图像重建,得到预测图像;图像增强器224可基于VVC解码器660解码第三码流所得到残差信息,对预测图像进行图像增强,以得到增强图像,完成图像重建任务。In the image decoding system 121, for image reconstruction tasks, the FLIF decoder 650 can decode any first code stream and the second code stream among multiple first code streams, so as to obtain a high-level semantic information and low-level features , the high-level semantic information and low-level features can be imported into the predictor 222; the predictor 222 can perform image reconstruction according to the high-level semantic information and low-level features to obtain a predicted image; the image enhancer 224 can decode the third The residual information obtained by the code stream is used to perform image enhancement on the predicted image to obtain an enhanced image and complete the image reconstruction task.
针对图像分析任务,FLIF解码器650可基于当前图像分析任务,从多个第一码流中选择相应的第一码流进行解码,从而得到适用于当前图像分析任务的高层语义信息,以支持当前图像分析任务的执行。作为一种可选实现,接收端120可获取用户指示的任务指令,该任务指令可以指示接收端需要执行的当前图像分析任务,从而图像解码系统121可从多个第一码流中选择与当前图像分析任务所使用的高层语义信息相应的第一码流进行解码,以得到适用于当前图像分析任务的高层语义信息。在进一步的一些实施例中,该任务指令可以指示多个当前图像分析任务,从而图像解码系统121可从多个第一码流中选择与各当前图像分析任务相应的第一码流进行解码,以得到适用于各当前图像分析任务的高层语义信息。在一个示例中,如果当前图像分析任务为实例分割、图像分类、物体检测等的任一项时,则本申请实施例可对实例分割图信息相应的第一码流进行解码;如果当前图像分析任务为人体姿态识别时,本申请实施例可对火柴人图信息相应的第一码流进行解码。当前图像分析任务可以为一类或者多类,具体可由用户需求而定或者系统设置而定。For the image analysis task, the FLIF decoder 650 can select a corresponding first code stream from multiple first code streams for decoding based on the current image analysis task, so as to obtain high-level semantic information suitable for the current image analysis task to support the current Execution of image analysis tasks. As an optional implementation, the receiving end 120 can obtain the task instruction indicated by the user, and the task instruction can indicate the current image analysis task that the receiving end needs to perform, so that the image decoding system 121 can select from multiple first code streams that are compatible with the current The first code stream corresponding to the high-level semantic information used by the image analysis task is decoded to obtain the high-level semantic information suitable for the current image analysis task. In some further embodiments, the task instruction may indicate multiple current image analysis tasks, so that the image decoding system 121 may select a first code stream corresponding to each current image analysis task from multiple first code streams for decoding, In order to obtain high-level semantic information suitable for each current image analysis task. In an example, if the current image analysis task is any one of instance segmentation, image classification, object detection, etc., the embodiment of the present application can decode the first code stream corresponding to the instance segmentation map information; if the current image analysis task When the task is human body posture recognition, the embodiment of the present application can decode the first code stream corresponding to the stick figure information. The current image analysis tasks can be of one type or more types, and the details can be determined by user requirements or system settings.
基于多个类型的当前图像分析任务,FLIF解码器650可以从多个第一码流中选择至少两个第一码流进行解码,以得到适用于多个类型的当前图像分析任务的高层语义信息。在进一步的一些实施例中,FLIF解码器650也可对多个第一码流均进行解码,以同时支持多类的图像分析任务。Based on multiple types of current image analysis tasks, the FLIF decoder 650 can select at least two first code streams from multiple first code streams to decode, so as to obtain high-level semantic information applicable to multiple types of current image analysis tasks . In some further embodiments, the FLIF decoder 650 can also decode multiple first code streams, so as to simultaneously support multiple types of image analysis tasks.
本申请实施例提供的图像编码系统和图像解码系统,可通过多个语义提取器对原始图像提取多个高层语义信息,以实现支持原始图像多个类型的图像分析任务,并且多个高层语义信息中的任一高层语义信息可与原始图像的低层特征相结合,以实现支持原始图像的图像重建任务。可见,本申请实施例可通过一套图像编码系统和相应的图像解码系统,实现融合图像重建任务和多个类型的图像分析任务。The image coding system and the image decoding system provided by the embodiments of the present application can extract multiple high-level semantic information from the original image through multiple semantic extractors, so as to realize image analysis tasks supporting multiple types of original images, and multiple high-level semantic information Any high-level semantic information in can be combined with the low-level features of the original image to achieve image reconstruction tasks supporting the original image. It can be seen that in the embodiment of the present application, a set of image encoding system and corresponding image decoding system can be used to realize fusion image reconstruction tasks and multiple types of image analysis tasks.
针对于接收端的图像分析层面,本申请实施例提供一种图像分析方法。作为一种可选实现,图8示例性的示出了本申请实施例提供的图像分析方法的流程图。该方法流程可由接收端执行实现。如图8所示,该方法流程可以包括如下步骤。Aiming at the image analysis level of the receiving end, the embodiment of the present application provides an image analysis method. As an optional implementation, FIG. 8 exemplarily shows a flow chart of the image analysis method provided by the embodiment of the present application. The method flow can be implemented by the receiver. As shown in FIG. 8 , the method flow may include the following steps.
在步骤S80中,基于原始图像的图像分析任务,从多个码流中获取目标码流。In step S80, based on the image analysis task of the original image, a target code stream is obtained from multiple code streams.
基于本申请实施例提供的图像编码方案,发送端可向接收端传输多个码流。在一些实施例中,该多个码流可以包括原始图像的高层语义信息对应的第一码流、原始图像的低层特征对应的第二码流。在进一步的一些实施例中,该多个码流还可以进一步包括预测图像与原始图像的差异信息对应的第三码流。其中,该多个码流中的第一码流的数量可以为至少一个(即一个或多个第一码流),该至少一个第一码流可对应原始图像的至少一个高层语义信息,并且一个高层语义信息对应一个第一码流。Based on the image coding scheme provided by the embodiment of the present application, the sending end can transmit multiple code streams to the receiving end. In some embodiments, the multiple code streams may include a first code stream corresponding to high-level semantic information of the original image, and a second code stream corresponding to low-level features of the original image. In some further embodiments, the plurality of code streams may further include a third code stream corresponding to difference information between the predicted image and the original image. Wherein, the number of first code streams in the plurality of code streams may be at least one (that is, one or more first code streams), and the at least one first code stream may correspond to at least one high-level semantic information of the original image, and One high-level semantic information corresponds to one first code stream.
接收端在获得多个码流之后,可以基于原始图像当前需要执行的图像分析任务,从多个码流中获取适用于该图像分析任务的目标码流。由于在本申请实施例中,第一码流承载的高层语义信息用于原始图像的图像分析任务,因此本申请实施例可具体从该多个码流的至少一个第一码流中获取目标码流。目标码流可以是至少一个第一码流中高层语义信息适用于该图像分析任务的第一码流。After obtaining the multiple code streams, the receiving end can obtain a target code stream suitable for the image analysis task from the multiple code streams based on the current image analysis task to be performed on the original image. Since in the embodiment of the present application, the high-level semantic information carried by the first code stream is used for the image analysis task of the original image, the embodiment of the present application can specifically obtain the target code from at least one first code stream of the multiple code streams flow. The target code stream may be the first code stream whose high-level semantic information is applicable to the image analysis task in the at least one first code stream.
在一些实施例中,如果多个码流中存在至少两个第一码流,则本申请实施例可基于图像分析任务需要使用的高层语义信息,从至少两个第一码流中获取适用于该图像分析任务的目标码流。例如,至少两个第一码流包括实例分割图信息相应的第一码流,和火柴人图信息相应的第一码流,如果当前需要执行的图像分析任务为实例分割、图像分类、物体检测中的任一项,则本申请实施例可从至少两个第一码流中获取实例分割图信息相应的第一码流,以作为目标码流;如果当前需要执行的图像分析任务为人体姿态识别,则本申请实施例可从至少两个第一码流中获取火柴人图信息相应的第一码流,以作为目标码流。In some embodiments, if there are at least two first code streams among the multiple code streams, the embodiment of the present application may obtain the information applicable to The target stream of the image analysis task. For example, the at least two first code streams include the first code stream corresponding to the instance segmentation graph information and the first code stream corresponding to the stickman graph information. If the image analysis tasks currently to be performed are instance segmentation, image classification, and object detection Any one of them, the embodiment of the present application can obtain the first code stream corresponding to the instance segmentation map information from at least two first code streams as the target code stream; if the image analysis task currently to be performed is human body posture identification, the embodiment of the present application may obtain the first code stream corresponding to the stick figure information from at least two first code streams as the target code stream.
在一些实施例中,如果多个码流中仅存在一个第一码流,则当前需要执行的图像分析任务可以是预先设置的固定的图像分析任务,例如将实例分割、图像分类、物体检测、人体姿态识别中的任一项任务设置为接收端固定执行的图像分析任务,本申请实施例可将多个码流中仅存在的一个第一码流,作为目标码流。In some embodiments, if there is only one first code stream among the multiple code streams, the image analysis task currently to be performed may be a preset fixed image analysis task, such as instance segmentation, image classification, object detection, Any task in human gesture recognition is set as an image analysis task fixedly performed by the receiving end. In this embodiment of the present application, only one first code stream among the multiple code streams can be used as the target code stream.
在步骤S81中,对目标码流进行解码,以得到适用于图像分析任务的高层语义信息。In step S81, the target code stream is decoded to obtain high-level semantic information suitable for image analysis tasks.
在一些实施例中,本申请实施例可对目标码流进行无损解码,以得到适用于图像分析任务的高层语义信息。In some embodiments, the embodiments of the present application may perform lossless decoding on the target code stream to obtain high-level semantic information suitable for image analysis tasks.
在步骤S82中,根据解码得到的高层语义信息,执行图像分析任务。In step S82, an image analysis task is performed according to the decoded high-level semantic information.
在一些实施例中,基于解码得到的高层语义信息,接收端可通过具体的图像分析逻辑(例如执行图像分类、物体检测、实例分割等任务的图像分析逻辑),执行原始图像的图像分析任务。In some embodiments, based on the decoded high-level semantic information, the receiver can perform image analysis tasks on the original image through specific image analysis logic (such as image analysis logic that performs tasks such as image classification, object detection, and instance segmentation).
在一些实施例中,步骤S80至步骤S81可由接收端的图像解码系统执行实现,步骤S82可由接收端设置的执行图像分析任务的图像分析逻辑执行实现。In some embodiments, steps S80 to S81 can be implemented by an image decoding system at the receiving end, and step S82 can be implemented by an image analysis logic configured at the receiving end to perform an image analysis task.
本申请实施例提供的图像分析方法能够在融合图像重建任务和图像分析任务的编、解码方案和框架下,执行具体的图像分析任务,从而为接收端的图像分析任务提供技术支撑,并可以在不同类型的图像分析任务下实现适用。The image analysis method provided by the embodiment of the present application can perform specific image analysis tasks under the encoding and decoding scheme and framework of the fusion image reconstruction task and image analysis task, thereby providing technical support for the image analysis task at the receiving end, and can be used in different The implementation is applicable under the type of image analysis tasks.
本申请实施例提供的图像编、解码方案相比于VTM(VVC test model,VVC测试模型)等其他图像编、解方案具有更高的图像重建质量以及更优的图像分析质量。在VCM(Video Coding for Machin,机器视频编码)标准测试环境下,将本申请实施例提供的图像编、解码方案与VTM方案进行比对,可以得到如图9A、图9B、图9C和图9D所示的效果对比示意图。在采用相同图像以及QP进行量化的情况下,本申请实施例提供的图像编、解码方案与VTM方案的比对过程如下。Compared with other image encoding and decoding schemes such as VTM (VVC test model, VVC test model), the image encoding and decoding scheme provided by the embodiment of the present application has higher image reconstruction quality and better image analysis quality. In the VCM (Video Coding for Machin, machine video coding) standard test environment, the image encoding and decoding scheme provided by the embodiment of the application is compared with the VTM scheme, and it can be obtained as shown in Fig. 9A, Fig. 9B, Fig. 9C and Fig. 9D The effect comparison diagram shown. In the case of using the same image and QP for quantization, the comparison process between the image encoding and decoding scheme provided by the embodiment of the present application and the VTM scheme is as follows.
图像重建质量比对,利用本申请实施例提供的方案及VTM方案分别进行图像压缩后重建,获得SSIM(Structural Similarity,结构相似性)及对应的BPP(Bits Per Pixel,像素深度)、PSNR(Peak Signal to Noise Ratio,峰值信噪比)及对应的BPP。以SSIM及对应的BPP绘制本申请实施例提供方案与VTM方案的效果对比示意图,如图9A所示。以PSNR及对应的BPP绘制本申请实施例提供方案与VTM方案的效果对比示意图,如图9B所示。Image reconstruction quality comparison, using the scheme provided by the embodiment of the application and the VTM scheme to perform image compression and reconstruction respectively, and obtain SSIM (Structural Similarity) and corresponding BPP (Bits Per Pixel, pixel depth), PSNR (Peak Signal to Noise Ratio, peak signal-to-noise ratio) and the corresponding BPP. Use SSIM and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, as shown in FIG. 9A . A schematic diagram of the effect comparison between the solution provided by the embodiment of the present application and the VTM solution is drawn using PSNR and the corresponding BPP, as shown in FIG. 9B .
机器视觉的图像分析任务的性能比对,以图像分析任务选用物体检测任务,图像分析任务的性能指标选用mAP(物体检测任务的一种常见指标)为例,将利用本申请实施例进行物体检测任务的性能指标,与利用VTM压缩后的图像进行物体检测任务的性能指标进行对比,获得mAP及对应的BPP;以mAP及对应的BPP绘制本申请实施例提供方案与VTM方案的效果对比示意图,如图9C所示。以图像分析任务选用实例分割任务,图像分析任务的性能指标选用mAP为例,将利用本申请实施例进行实例分割任务的性能指标,与利用VTM压缩后的图像进行实例分割任务的性能指标进行对比,获得mAP及对应的BPP;以mAP及对应的BPP绘制本申请实施例提供方案与VTM方案的效果对比示意图,如图9D所示。The performance comparison of the image analysis task of machine vision, taking the object detection task as the image analysis task, and the mAP (a common indicator of the object detection task) as the performance index of the image analysis task as an example, will use the embodiment of this application to perform object detection The performance index of the task is compared with the performance index of the object detection task using the image compressed by VTM to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, As shown in Figure 9C. Taking the instance segmentation task as the image analysis task and the mAP as the performance index of the image analysis task as an example, the performance index of the instance segmentation task using the embodiment of the application is compared with the performance index of the instance segmentation task using the image compressed by VTM , to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, as shown in FIG. 9D .
从图9A、图9B、图9C和图9D可以看出,在同等BPP的图像重建质量方面,本申请实施例提供的图像编、解码方案相较于目前性能较优的VVC的VTM方案具有更高的图像重建质量(例如,本申请实施例提供方案相较于VTM在PSNR上高出4dB);在同等BPP的图像分析质量方面,本申请实施例提供的图像编、解码方案相较于VTM具有更优的图像分析质量(例如,在mAP方面,本申请实施例提供的方案完全超越VTM)。It can be seen from Fig. 9A, Fig. 9B, Fig. 9C and Fig. 9D that in terms of the image reconstruction quality of the same BPP, the image encoding and decoding scheme provided by the embodiment of the present application has better performance than the current VVC VTM scheme with better performance. High image reconstruction quality (for example, the scheme provided by the embodiment of the present application is 4dB higher than VTM in PSNR); in terms of the image analysis quality of the same BPP, the image encoding and decoding scheme provided by the embodiment of the present application is compared with VTM It has better image analysis quality (for example, in terms of mAP, the solution provided by the embodiment of the present application completely surpasses VTM).
本申请实施例提供的图像编、解码方案,提出了合理且高效的编码、解码框架,可以有效应对VCM在面向用户视觉的图像重建任务和面向机器视觉的图像分析任务的融合问题。The image encoding and decoding scheme provided in the embodiment of the present application proposes a reasonable and efficient encoding and decoding framework, which can effectively deal with the fusion problem of VCM in user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.
在一些应用示例中,本申请实施例提供的图像编、解码方案可以针对智能硬件产生的图像,实现有效融合图像重建任务和图像分析任务。可以理解的是,随着智能城市和深度学习的兴起,智能硬件每天产生海量的图像,用以更好的满足各种信息交互需求;智能硬件产生的图像可以呈现给用户观看,以便于图像、视频监控等目的,也可以依靠机器视觉分析系统进行相应的分析和决策任务(例如,智能交通系统中的车牌识别,道路规划;自动驾驶系统中的物体检测,道路跟踪;智慧医疗系统中的人脸及表情检测分析,异常行为检测等);因此需要一套编、解码方案和框架,来对智能硬件产生的图像,进行图像重建任务和图像分析任务的融合。In some application examples, the image encoding and decoding solutions provided by the embodiments of the present application can effectively integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware. It is understandable that with the rise of smart cities and deep learning, smart hardware generates massive images every day to better meet various information interaction needs; the images generated by smart hardware can be presented to users for viewing, so that images, For purposes such as video surveillance, machine vision analysis systems can also be relied on for corresponding analysis and decision-making tasks (for example, license plate recognition in intelligent transportation systems, road planning; object detection in automatic driving systems, road tracking; human intelligence in smart medical systems Face and expression detection and analysis, abnormal behavior detection, etc.); therefore, a set of encoding and decoding schemes and frameworks are needed to integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware.
下面以智能硬件为智能交通系统中的摄像头为例,对本申请实施例提供的图像编、解码方案的应用示例进行介绍。图10示出了本申请实施例提供的应用示例图。如图10所示,摄像头910作为具有视频采集能力的智能硬件,可以采集交通视频图像。摄像头910采集的交通视频图像可以利用本申请实施例提供的图像编码系统111进行编码处理,从而向交通指挥中心920输出第一码流、第二码流和第三码流。Taking the intelligent hardware as the camera in the intelligent transportation system as an example, an application example of the image encoding and decoding solution provided by the embodiment of the present application will be introduced below. Fig. 10 shows an application example diagram provided by the embodiment of the present application. As shown in FIG. 10 , the camera 910 is an intelligent hardware capable of collecting video, and can collect traffic video images. The traffic video images collected by the camera 910 can be encoded by the image encoding system 111 provided in the embodiment of the present application, so as to output the first code stream, the second code stream and the third code stream to the traffic command center 920 .
交通指挥中心920可以利用本申请实施例提供的图像解码系统121进行解码处理,从而重建交通视频图像以及输出交通视频图像的高层语义信息。图像解码系统121重建的交通视频图像可以在交通指挥中心920的监控屏幕进行展示,以实现交通视频监控。图像解码系统121输出的交通视频图像的高层语义信息可以导入交通指挥中心920的车牌识别系统,以实现交通视频图像中车辆的车牌识别。The traffic command center 920 can use the image decoding system 121 provided by the embodiment of the present application to perform decoding processing, so as to reconstruct the traffic video image and output the high-level semantic information of the traffic video image. The traffic video image reconstructed by the image decoding system 121 can be displayed on the monitoring screen of the traffic command center 920 to realize traffic video monitoring. The high-level semantic information of the traffic video image output by the image decoding system 121 can be imported into the license plate recognition system of the traffic command center 920 to realize the license plate recognition of vehicles in the traffic video image.
运用本申请实施例提供的图像编、解码方案可以在智能交通系统场景下,实现交通视频图像的重建,以满足视频监控需求,同时可以对交通视频图像进行图像分析,以满足车辆的车牌识别等分析需求。当然,图10仅是本申请实施例提供的图像编、解码方案的一种可选应用示例,本申请实施例能够在任意的需求图像重建任务和图像分析任务的场景下进行应用。Using the image encoding and decoding scheme provided by the embodiment of the present application, the reconstruction of traffic video images can be realized in the scene of intelligent transportation system to meet the needs of video surveillance, and at the same time, image analysis can be performed on traffic video images to meet the needs of vehicle license plate recognition, etc. Analyze needs. Of course, FIG. 10 is only an optional application example of the image encoding and decoding scheme provided by the embodiment of the present application, and the embodiment of the present application can be applied in any scene requiring image reconstruction tasks and image analysis tasks.
本申请实施例能够兼容面向用户视觉以及机器视觉任务的图像编码,本申请实施例提供的编、解码方案可以广泛的用于智能城市系统中图像数据的相关处理,从而有效的提升图像数据的压缩率,降低网络带宽负担,降低云端服务的工作负担,减少图像数据的存储消耗等,进而降低智能城市的运行成本。The embodiment of the present application is compatible with image coding for user vision and machine vision tasks. The encoding and decoding scheme provided by the embodiment of the present application can be widely used in the related processing of image data in the smart city system, thereby effectively improving the compression of image data efficiency, reduce the burden of network bandwidth, reduce the workload of cloud services, reduce the storage consumption of image data, etc., thereby reducing the operating cost of smart cities.
本申请实施例还提供一种电子设备,该电子设备例如发送端110或者接收端120。图11示出了本申请实施例提供的电子设备的框图。如图11所示,该电子设备可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4。The embodiment of the present application also provides an electronic device, such as the sending end 110 or the receiving end 120 . Fig. 11 shows a block diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 11 , the electronic device may include: at least one processor 1 , at least one communication interface 2 , at least one memory 3 and at least one communication bus 4 .
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信。In the embodiment of the present application, there are at least one processor 1 , communication interface 2 , memory 3 , and communication bus 4 , and the processor 1 , communication interface 2 , and memory 3 communicate with each other through the communication bus 4 .
可选的,通信接口2可以为用于进行网络通信的通信模块的接口。Optionally, the communication interface 2 may be an interface of a communication module for network communication.
可选的,处理器1可能是CPU(中央处理器),GPU(Graphics Processing Unit, 图形处理器),NPU(嵌入式神经网络处理器),FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列),TPU(张量处理单元),AI芯片,特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路等。Optionally, processor 1 may be CPU (central processing unit), GPU (Graphics Processing Unit, graphics processing unit), NPU (embedded neural network processor), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array ), TPU (tensor processing unit), AI chip, specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application, etc.
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
其中,存储器3存储一条或多条计算机可执行指令,处理器1调用一条或多条计算机可执行指令,以执行如本申请实施例提供的图像编码方法、或者执行如本申请实施例提供的图像解码方法、或者执行如本申请实施例提供的图像重建方法、或者如本申请实施例提供的图像分析方法。Wherein, the memory 3 stores one or more computer-executable instructions, and the processor 1 invokes one or more computer-executable instructions to execute the image encoding method provided by the embodiment of the present application, or execute the image coding method provided by the embodiment of the present application. The decoding method, or the image reconstruction method provided in the embodiment of the present application, or the image analysis method provided in the embodiment of the present application.
本申请实施例还提供一种存储介质,该存储介质可以存储一条或多条计算机可执行指令,该一条或多条计算机可执行指令被执行时实现如本申请实施例提供的图像编码方法、或者如本申请实施例提供的图像解码方法、或者如本申请实施例提供的图像重建方法、或者如本申请实施例提供的图像分析方法。The embodiment of the present application also provides a storage medium, which can store one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the image coding method provided by the embodiment of the present application is implemented, or The image decoding method provided in the embodiment of the present application, or the image reconstruction method provided in the embodiment of the present application, or the image analysis method provided in the embodiment of the present application.
上文描述了本申请实施例提供的多个实施例方案,各实施例方案介绍的各可选方式可在不冲突的情况下相互结合、交叉引用,从而延伸出多种可能的实施例方案,这些均可认为是本申请实施例披露、公开的实施例方案。The above describes multiple embodiment solutions provided by the embodiments of the present application. The optional modes introduced by each embodiment solution can be combined and cross-referenced without conflict, so as to extend a variety of possible embodiment solutions. All of these can be regarded as embodiment disclosures and disclosed embodiment solutions of the present application.
虽然本申请实施例披露如上,但本申请并非限定于此。任何本领域技术人员,在不脱离本申请的精神和范围内,均可作各种更动与修改,因此本申请的保护范围应当以权利要求所限定的范围为准。Although the embodiments of the present application are disclosed above, the present application is not limited thereto. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the present application. Therefore, the protection scope of the present application should be based on the scope defined in the claims.
Claims (20)
- 一种图像编码方法,其中,包括:An image coding method, including:获取原始图像;get the original image;提取所述原始图像的高层语义信息,所述高层语义信息用于所述原始图像的图像分析任务;以及,extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; and,提取所述原始图像的低层特征,所述低层特征与所述高层语义信息用于所述原始图像的图像重建任务;extracting low-level features of the original image, the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;对所述高层语义信息进行编码,以生成第一码流;以及,Encoding the high-level semantic information to generate a first code stream; and,对所述低层特征进行编码,以生成第二码流。Encoding the low-level features to generate a second code stream.
- 根据权利要求1所述的图像编码方法,其中,还包括:The image coding method according to claim 1, further comprising:根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像;performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image;确定所述预测图像与所述原始图像的差异信息,所述差异信息用于在所述图像重建任务中增强所述预测图像;determining difference information between the predicted image and the original image, the difference information being used to enhance the predicted image in the image reconstruction task;对所述差异信息进行编码,以生成第三码流。Encode the difference information to generate a third code stream.
- 根据权利要求2所述的图像编码方法,其中,根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像包括:The image coding method according to claim 2, wherein, according to the low-level features and the high-level semantic information, performing image reconstruction on the original image to obtain a predicted image comprises:以所述高层语义信息作为图像重建的指导信息,对所述低层特征表达的具体图像细节进行重建,以得到预测图像。Using the high-level semantic information as guidance information for image reconstruction, the specific image details expressed by the low-level features are reconstructed to obtain a predicted image.
- 根据权利要求3所述的图像编码方法,其中,所述以所述高层语义信息作为图像重建的指导信息,对所述低层特征表达的具体图像细节进行重建,以得到预测图像包括:The image coding method according to claim 3, wherein said using said high-level semantic information as guidance information for image reconstruction to reconstruct specific image details expressed by said low-level features to obtain a predicted image comprises:处理所述低层特征,以得到与所述高层语义信息进一步处理的目标特征;processing the low-level features to obtain target features for further processing with the high-level semantic information;将所述目标特征与所述高层语义信息相堆叠后进行卷积处理,以得到卷积特征;performing convolution processing after stacking the target features and the high-level semantic information to obtain convolution features;将所述卷积特征与所述低层特征相结合,以得到预测图像。The convolutional features are combined with the low-level features to obtain a predicted image.
- 根据权利要求4所述的图像编码方法,其中,所述处理所述低层特征包括:对所述低层特征进行上采样;The image coding method according to claim 4, wherein the processing the low-level features comprises: upsampling the low-level features;所述将所述目标特征与所述高层语义信息相堆叠后进行卷积处理,以得到卷积特征包括:将所述目标特征与所述高层语义信息进行堆叠,以得到堆叠特征;对所述堆叠特征进行多次卷积处理,以得到第一卷积特征;对所述第一卷积特征进行多次滤波处理,以得到滤波特征;对所述滤波特征进行多次卷积处理,以得到第二卷积特征;将所述第二卷积特征经由激活函数输出为所述卷积特征;The step of stacking the target features and the high-level semantic information to obtain convolution features includes: stacking the target features and the high-level semantic information to obtain stacked features; The stacked feature is subjected to multiple convolution processing to obtain the first convolution feature; the first convolution feature is subjected to multiple filtering processing to obtain the filtering feature; the filtering feature is subjected to multiple convolution processing to obtain a second convolution feature; outputting the second convolution feature as the convolution feature via an activation function;所述将所述卷积特征与所述低层特征相结合,以得到预测图像包括:将所述卷积特征与所述低层特征相加,以得到预测图像。The combining the convolutional features and the low-level features to obtain a predicted image includes: adding the convolutional features to the low-level features to obtain a predicted image.
- 根据权利要求2所述的图像编码方法,其中,所述确定所述预测图像与所述原始图像的差异信息包括:对所述预测图像与所述原始图像进行残差处理,以得到所述预 测图像与所述原始图像之间的残差信息;The image coding method according to claim 2, wherein the determining the difference information between the predicted image and the original image comprises: performing residual processing on the predicted image and the original image to obtain the predicted Residual information between the image and the original image;所述对所述差异信息进行编码,以生成第三码流包括:对所述残差信息进行有损编码,以生成第三码流;The encoding the difference information to generate a third code stream includes: performing lossy encoding on the residual information to generate a third code stream;所述对所述高层语义信息进行编码,以生成第一码流包括:对所述高层语义信息进行无损编码,以生成第一码流;The encoding the high-level semantic information to generate a first code stream includes: performing lossless coding on the high-level semantic information to generate a first code stream;所述对所述低层特征进行编码,以生成第二码流包括:对所述低层特征进行无损编码,以生成第二码流。The encoding the low-layer features to generate a second code stream includes: performing lossless encoding on the low-layer features to generate a second code stream.
- 根据权利要求2所述的图像编码方法,其中,所述提取所述原始图像的低层特征包括:The image coding method according to claim 2, wherein said extracting the low-level features of said original image comprises:基于通道归一化技术,确定原始图像在各个通道对应的低层特征;Based on channel normalization technology, determine the low-level features of the original image corresponding to each channel;根据原始图像在各个通道对应的低层特征,确定原始图像的低层特征。According to the low-level features of the original image corresponding to each channel, the low-level features of the original image are determined.
- 根据权利要求2-7任一项所述的图像编码方法,其中,所述提取所述原始图像的高层语义信息包括:提取所述原始图像的多个高层语义信息,一个高层语义信息用于所述原始图像的一类或多类的图像分析任务;The image coding method according to any one of claims 2-7, wherein said extracting the high-level semantic information of the original image comprises: extracting multiple high-level semantic information of the original image, one high-level semantic information is used for all One or more types of image analysis tasks describing the original image;其中,所述多个高层语义信息分别被编码,以得到多个第一码流;所述低层特征与所述多个高层语义信息中的一个高层语义信息或者至少两个高层语义信息相结合,以对所述原始图像进行图像重建。Wherein, the multiple high-level semantic information is coded respectively to obtain multiple first code streams; the low-level feature is combined with one high-level semantic information or at least two high-level semantic information among the multiple high-level semantic information, to perform image reconstruction on the original image.
- 根据权利要求2所述的图像编码方法,其中,所述高层语义信息包括实例分割图信息或者火柴人图信息;所述实例分割图信息用于原始图像的实例分割任务、图像分类任务、物体检测任务中的至少一项,所述火柴人图信息用于所述原始图像的人体姿态识别任务。The image coding method according to claim 2, wherein the high-level semantic information includes instance segmentation map information or stickman map information; the instance segmentation map information is used for instance segmentation tasks, image classification tasks, and object detection of original images At least one of the tasks, the stick figure information is used for the human body pose recognition task of the original image.
- 一种图像编码系统,其中,包括:An image coding system, comprising:语义提取器,用于提取原始图像的高层语义信息,所述高层语义信息用于原始图像的图像分析任务;A semantic extractor is used to extract high-level semantic information of the original image, and the high-level semantic information is used for image analysis tasks of the original image;特征提取器,用于提取所述原始图像的低层特征,所述低层特征与所述高层语义信息用于所述原始图像的图像重建任务;a feature extractor, configured to extract low-level features of the original image, and the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;第一编码器,用于对所述高层语义信息进行编码,以生成第一码流;a first encoder, configured to encode the high-level semantic information to generate a first code stream;第二编码器,用于对所述低层特征进行编码,以生成第二码流。The second encoder is configured to encode the low-level features to generate a second code stream.
- 根据权利要求10所述的图像编码系统,其中,还包括:The image coding system according to claim 10, further comprising:预测器,用于根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像;A predictor, configured to perform image reconstruction on the original image according to the low-level features and the high-level semantic information, so as to obtain a predicted image;比对器,用于确定所述预测图像与所述原始图像的差异信息,所述差异信息用于在所述图像重建任务中增强所述预测图像;a comparator for determining difference information between the predicted image and the original image, the difference information being used to enhance the predicted image in the image reconstruction task;第三编码器,用于对所述差异信息进行编码,以生成第三码流。a third encoder, configured to encode the difference information to generate a third code stream.
- 根据权利要求11所述的图像编码系统,其中,所述预测器包括:The image coding system according to claim 11, wherein said predictor comprises:上采样器,用于对低层特征进行上采样,以得到与高层语义信息进一步处理的目标特征;An upsampler is used to upsample low-level features to obtain target features for further processing with high-level semantic information;堆叠器,用于将高层语义信息与目标特征进行堆叠,以得到堆叠特征;A stacker, which is used to stack high-level semantic information and target features to obtain stacked features;卷积网络,用于对堆叠特征进行卷积处理,以得到卷积特征;A convolutional network is used to perform convolution processing on stacked features to obtain convolutional features;加法器,用于将低层特征以及卷积特征进行相加,以得到预测图像;An adder is used to add low-level features and convolutional features to obtain a predicted image;所述卷积网络包括:第一组卷积层、多个卷积滤波器组、第二组卷积层和激活函数;其中,所述第一组卷积层包括依次连接的多层卷积层,一个卷积滤波器组包括多层卷积滤波器,所述第二组卷积层包括依次连接的多层卷积层。The convolutional network includes: a first set of convolutional layers, a plurality of convolutional filter banks, a second set of convolutional layers, and an activation function; wherein the first set of convolutional layers includes sequentially connected multi-layer convolutions Layers, one convolutional filter bank includes multiple layers of convolutional filters, and the second set of convolutional layers includes sequentially connected multiple layers of convolutional layers.
- 根据权利要求11所述的图像编码系统,其中,所述高层语义信息包括实例分割图信息或者火柴人图信息;所述实例分割图信息用于原始图像的实例分割任务、图像分类任务、物体检测任务中的至少一项,所述火柴人图信息用于所述原始图像的人体姿态识别任务。The image coding system according to claim 11, wherein the high-level semantic information includes instance segmentation map information or stickman map information; the instance segmentation map information is used for instance segmentation tasks, image classification tasks, and object detection of original images. At least one of the tasks, the stick figure information is used for the human body pose recognition task of the original image.
- 一种图像解码方法,其中,包括:An image decoding method, including:获取原始图像的高层语义信息对应的第一码流,以及原始图像的低层特征对应的第二码流;Obtain a first code stream corresponding to the high-level semantic information of the original image, and a second code stream corresponding to the low-level features of the original image;对所述第一码流进行解码,以得到所述高层语义信息,所述高层语义信息用于执行所述原始图像的图像分析任务;Decoding the first code stream to obtain the high-level semantic information, the high-level semantic information is used to perform an image analysis task of the original image;对所述第二码流进行解码,以得到所述低层特征;Decoding the second code stream to obtain the low-level features;根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像。Perform image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
- 根据权利要求14所述的图像解码方法,其中,还包括:The image decoding method according to claim 14, further comprising:获取所述预测图像与所述原始图像的差异信息对应的第三码流;Acquiring a third code stream corresponding to the difference information between the predicted image and the original image;对所述第三码流进行解码,以得到所述差异信息;Decoding the third code stream to obtain the difference information;根据所述差异信息,对所述预测图像进行图像增强处理,以得到用于展示的增强图像。According to the difference information, image enhancement processing is performed on the predicted image to obtain an enhanced image for display.
- 一种图像解码系统,其中,包括:An image decoding system, including:第一解码器,用于对原始图像的高层语义信息对应的第一码流进行解码,以得到所述高层语义信息;所述高层语义信息用于执行所述原始图像的图像分析任务;The first decoder is configured to decode the first code stream corresponding to the high-level semantic information of the original image to obtain the high-level semantic information; the high-level semantic information is used to perform an image analysis task of the original image;第二解码器,对原始图像的低层特征对应的第二码流进行解码,以得到所述低层特征;The second decoder is configured to decode the second code stream corresponding to the low-level features of the original image to obtain the low-level features;预测器,根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像。The predictor performs image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
- 根据权利要求16所述的图像解码系统,其中,还包括:The image decoding system according to claim 16, further comprising:第三解码器,用于对所述预测图像与所述原始图像的差异信息对应的第三码流进行解码,以得到所述差异信息;A third decoder, configured to decode a third code stream corresponding to difference information between the predicted image and the original image, to obtain the difference information;图像增强器,用于根据所述差异信息,对所述预测图像进行图像增强处理,以得到用于展示的增强图像。An image intensifier, configured to perform image enhancement processing on the predicted image according to the difference information, so as to obtain an enhanced image for display.
- 一种图像重建方法,其中,包括:An image reconstruction method, including:获取原始图像的高层语义信息以及低层特征;Obtain high-level semantic information and low-level features of the original image;根据所述低层特征以及所述高层语义信息,对原始图像进行图像重建,以得到预测图像;以及,performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image; and,获取所述预测图像与所述原始图像的差异信息,所述差异信息用于增强所述预测图像。Acquiring difference information between the predicted image and the original image, where the difference information is used to enhance the predicted image.
- 一种图像分析方法,其中,包括:An image analysis method, comprising:基于原始图像的图像分析任务,从多个码流中获取目标码流;所述多个码流包括与原始图像的至少一个高层语义信息对应的至少一个第一码流,以及与原始图像的低层特征对应的第二码流;其中,原始图像的一个高层语义信息对应一个第一码流,所述目标码流为所述至少一个第一码流中高层语义信息适用于所述图像分析任务的第一码流;Based on the image analysis task of the original image, the target code stream is obtained from multiple code streams; the multiple code streams include at least one first code stream corresponding to at least one high-level semantic information of the original image, and a low-level code stream corresponding to the original image The second code stream corresponding to the feature; wherein, a high-level semantic information of the original image corresponds to a first code stream, and the target code stream is the high-level semantic information in the at least one first code stream that is suitable for the image analysis task first code stream;对所述目标码流进行解码,以得到适用于所述图像分析任务的高层语义信息;Decoding the target code stream to obtain high-level semantic information applicable to the image analysis task;根据解码得到的高层语义信息,执行所述图像分析任务。The image analysis task is performed according to the decoded high-level semantic information.
- 一种电子设备,其中,包括至少一个存储器和至少一个处理器,所述存储器存储一条或多条计算机可执行指令,所述处理器调用所述一条或多条计算机可执行指令,以执行如权利要求1-9任一项所述的图像编码方法,或者,如权利要求14-15任一项所述的图像解码方法,或者,如权利要求18所述的图像重建方法,或者,如权利要求19所述的图像分析方法。An electronic device, including at least one memory and at least one processor, the memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions to perform the The image coding method according to any one of claims 1-9, or the image decoding method according to any one of claims 14-15, or the image reconstruction method according to claim 18, or, the image reconstruction method according to claim 1 The image analysis method described in 19.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110860294.5A CN113660486B (en) | 2021-07-28 | 2021-07-28 | Image coding, decoding, reconstructing and analyzing method, system and electronic equipment |
CN202110860294.5 | 2021-07-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023005740A1 true WO2023005740A1 (en) | 2023-02-02 |
Family
ID=78478912
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/106507 WO2023005740A1 (en) | 2021-07-28 | 2022-07-19 | Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113660486B (en) |
WO (1) | WO2023005740A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113660486B (en) * | 2021-07-28 | 2024-10-01 | 阿里巴巴(中国)有限公司 | Image coding, decoding, reconstructing and analyzing method, system and electronic equipment |
CN117831545A (en) * | 2022-09-29 | 2024-04-05 | 抖音视界有限公司 | Encoding method, decoding method, encoder, decoder, electronic device, and storage medium |
US20240236346A1 (en) * | 2023-01-11 | 2024-07-11 | Tencent America LLC | Efficient neural network module for image compression |
CN116847091B (en) * | 2023-07-18 | 2024-04-26 | 华院计算技术(上海)股份有限公司 | Image coding method, system, equipment and medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062754A (en) * | 2018-01-19 | 2018-05-22 | 深圳大学 | Segmentation, recognition methods and device based on dense network image |
CN111292330A (en) * | 2020-02-07 | 2020-06-16 | 北京工业大学 | Image semantic segmentation method and device based on coder and decoder |
US20200218948A1 (en) * | 2019-01-03 | 2020-07-09 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Thundernet: a turbo unified network for real-time semantic segmentation |
CN111901610A (en) * | 2020-08-03 | 2020-11-06 | 西北工业大学 | Parallel image description method based on multilayer encoder |
CN112004085A (en) * | 2020-08-14 | 2020-11-27 | 北京航空航天大学 | Video coding method under guidance of scene semantic segmentation result |
CN112750175A (en) * | 2021-01-12 | 2021-05-04 | 山东师范大学 | Image compression method and system based on octave convolution and semantic segmentation |
CN113660486A (en) * | 2021-07-28 | 2021-11-16 | 阿里巴巴(中国)有限公司 | Image coding, decoding, reconstructing and analyzing method, system and electronic equipment |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10832084B2 (en) * | 2018-08-17 | 2020-11-10 | Nec Corporation | Dense three-dimensional correspondence estimation with multi-level metric learning and hierarchical matching |
CN110147864B (en) * | 2018-11-14 | 2022-02-22 | 腾讯科技(深圳)有限公司 | Method and device for processing coding pattern, storage medium and electronic device |
CN111953989A (en) * | 2020-07-21 | 2020-11-17 | 重庆邮电大学 | Image compression method and device based on combination of user interaction and semantic segmentation technology |
CN112866715B (en) * | 2021-01-06 | 2022-05-13 | 中国科学技术大学 | Universal video compression coding system supporting man-machine hybrid intelligence |
-
2021
- 2021-07-28 CN CN202110860294.5A patent/CN113660486B/en active Active
-
2022
- 2022-07-19 WO PCT/CN2022/106507 patent/WO2023005740A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062754A (en) * | 2018-01-19 | 2018-05-22 | 深圳大学 | Segmentation, recognition methods and device based on dense network image |
US20200218948A1 (en) * | 2019-01-03 | 2020-07-09 | Beijing Jingdong Shangke Information Technology Co., Ltd. | Thundernet: a turbo unified network for real-time semantic segmentation |
CN111292330A (en) * | 2020-02-07 | 2020-06-16 | 北京工业大学 | Image semantic segmentation method and device based on coder and decoder |
CN111901610A (en) * | 2020-08-03 | 2020-11-06 | 西北工业大学 | Parallel image description method based on multilayer encoder |
CN112004085A (en) * | 2020-08-14 | 2020-11-27 | 北京航空航天大学 | Video coding method under guidance of scene semantic segmentation result |
CN112750175A (en) * | 2021-01-12 | 2021-05-04 | 山东师范大学 | Image compression method and system based on octave convolution and semantic segmentation |
CN113660486A (en) * | 2021-07-28 | 2021-11-16 | 阿里巴巴(中国)有限公司 | Image coding, decoding, reconstructing and analyzing method, system and electronic equipment |
Non-Patent Citations (1)
Title |
---|
AKBARI MOHAMMAD; LIANG JIE; HAN JINGNING: "DSSLIC: Deep Semantic Segmentation-based Layered Image Compression", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 12 May 2019 (2019-05-12), pages 2042 - 2046, XP033566098, DOI: 10.1109/ICASSP.2019.8683541 * |
Also Published As
Publication number | Publication date |
---|---|
CN113660486A (en) | 2021-11-16 |
CN113660486B (en) | 2024-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023005740A1 (en) | Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device | |
Duan et al. | Video coding for machines: A paradigm of collaborative compression and intelligent analytics | |
CN110300977B (en) | Method for image processing and video compression | |
US11410275B2 (en) | Video coding for machine (VCM) based system and method for video super resolution (SR) | |
US20120057640A1 (en) | Video Analytics for Security Systems and Methods | |
WO2021164216A1 (en) | Video coding method and apparatus, and device and medium | |
CN110798690A (en) | Video decoding method, and method, device and equipment for training loop filtering model | |
CN112954398B (en) | Encoding method, decoding method, device, storage medium and electronic equipment | |
CN116803079A (en) | Scalable coding of video and related features | |
CN115442609A (en) | Characteristic data encoding and decoding method and device | |
CN116437102B (en) | Method, system, equipment and storage medium for learning universal video coding | |
US11095901B2 (en) | Object manipulation video conference compression | |
CN117441333A (en) | Configurable location for inputting auxiliary information of image data processing neural network | |
CN116508320A (en) | Chroma subsampling format processing method in image decoding based on machine learning | |
CN118872266A (en) | Video decoding method based on multi-mode processing | |
CN114581460A (en) | Image processing, model training and live broadcast room background switching method | |
Chen et al. | A new image codec paradigm for human and machine uses | |
CN112383778B (en) | Video coding method and device and decoding method and device | |
WO2023193629A1 (en) | Coding method and apparatus for region enhancement layer, and decoding method and apparatus for area enhancement layer | |
TW202420815A (en) | Parallel processing of image regions with neural networks, decoding, post filtering, and rdoq | |
WO2023133889A1 (en) | Image processing method and apparatus, remote control device, system and storage medium | |
CN116847087A (en) | Video processing method and device, storage medium and electronic equipment | |
WO2023133888A1 (en) | Image processing method and apparatus, remote control device, system, and storage medium | |
CN114240750A (en) | Video resolution improving method and device, storage medium and electronic equipment | |
Sehli et al. | WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22848345 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22848345 Country of ref document: EP Kind code of ref document: A1 |