CN113066019A - Image enhancement method and related device - Google Patents
Image enhancement method and related device Download PDFInfo
- Publication number
- CN113066019A CN113066019A CN202110221939.0A CN202110221939A CN113066019A CN 113066019 A CN113066019 A CN 113066019A CN 202110221939 A CN202110221939 A CN 202110221939A CN 113066019 A CN113066019 A CN 113066019A
- Authority
- CN
- China
- Prior art keywords
- image
- network
- features
- tone
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 118
- 238000012545 processing Methods 0.000 claims abstract description 100
- 230000006870 function Effects 0.000 claims description 133
- 230000004927 fusion Effects 0.000 claims description 82
- 238000012549 training Methods 0.000 claims description 55
- 230000015654 memory Effects 0.000 claims description 39
- 238000004590 computer program Methods 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 abstract description 12
- 238000013473 artificial intelligence Methods 0.000 abstract description 10
- 238000000605 extraction Methods 0.000 description 62
- 238000010586 diagram Methods 0.000 description 55
- 238000013528 artificial neural network Methods 0.000 description 36
- 239000011159 matrix material Substances 0.000 description 32
- 238000013527 convolutional neural network Methods 0.000 description 28
- 230000008569 process Effects 0.000 description 25
- 239000000284 extract Substances 0.000 description 23
- 239000013598 vector Substances 0.000 description 22
- 238000011176 pooling Methods 0.000 description 21
- 230000000694 effects Effects 0.000 description 19
- 230000001537 neural effect Effects 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000006243 chemical reaction Methods 0.000 description 9
- 210000002569 neuron Anatomy 0.000 description 9
- 230000004913 activation Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 7
- 239000003086 colorant Substances 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 230000009467 reduction Effects 0.000 description 7
- 125000004122 cyclic group Chemical group 0.000 description 6
- 239000011800 void material Substances 0.000 description 6
- 238000013500 data storage Methods 0.000 description 5
- 238000003062 neural network model Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000005096 rolling process Methods 0.000 description 4
- 239000000243 solution Substances 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000011022 operating instruction Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000007670 refining Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 241000272525 Anas platyrhynchos Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000002349 favourable effect Effects 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 102220645282 Structure-specific endonuclease subunit SLX4_S21P_mutation Human genes 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 239000012086 standard solution Substances 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/90—Determination of colour characteristics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The embodiment of the application discloses an image enhancement method, which is applied to the field of artificial intelligence and comprises the following steps: the terminal inputs a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to tone components of an enhanced image of the first image; the terminal outputs a second image through processing of a second network, wherein the second image is an enhanced image of the first image, and the input of the second network comprises the first image and the one or more groups of intermediate features. The method positions and separates the reflection area in the image based on the hue component capable of fully representing the difference of the reflection layer and the transmission layer, and can effectively remove the reflection phenomenon in the image.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to an image enhancement method and a related apparatus.
Background
With the popularization of terminal devices such as smart phones and tablet computers, people have become very popular to take pictures and take videos through the terminal devices. Due to the limited shooting scenario, if the user takes a picture through a glass or window, the captured image often exhibits undesirable reflections. Such reflection phenomena often cause problems such as distortion, occlusion or blurring of the background scene in the image, thereby degrading the quality of the image. The reduction in image quality may affect the visual experience of the user, and may also have a severe impact on subsequent computer vision tasks (e.g., object detection and semantic segmentation). Therefore, the reflection phenomenon in the image is eliminated, the original image content is restored, the imaging quality can be greatly improved, and the processing accuracy of the later computer vision task is improved.
At present, the reflection removing method in the related art assists reflection removing by manually designing prior information, such as information of gradient sparsity, defocus blur characteristic, depth of field characteristic, and the like. Because the prior information of the manual design only considers the reflection characteristics under a specific scene, when the reflection removal is executed based on the prior information, a good reflection removal effect can be obtained only under the specific scene. Once the reflected scene violates established assumptions, these methods cannot correctly remove the reflection phenomenon in the image. Therefore, a method for effectively removing the reflection phenomenon in the image is needed.
Disclosure of Invention
The application provides an image enhancement method and a related device, wherein intermediate features extracted by a first network are acquired, the intermediate features are related to tone components of an enhanced image of an input image, and the intermediate features are input into a second network for performing image enhancement, so that the tone components capable of fully representing the difference of a reflecting layer and a transmitting layer are introduced into the second network, the reflecting area in the image can be more accurately positioned and separated in the image enhancement, and the reflection phenomenon in the image is effectively removed.
A first aspect of the present application provides an image enhancement method, including: the terminal inputs the first image into a first network to obtain one or more groups of intermediate features extracted by the first network. The first network may be a trained network model, the first network being used to predict a tonal image of an enhanced image of the first image. The Hue image includes Hue components of Hue, Saturation, lightness (Hue, Saturation, Value, HSV) color space. The enhanced image of the first image is specifically an image obtained by performing image enhancement on the first image. Thus, the one or more sets of intermediate features extracted by the first network are associated with tonal components of an enhanced image of the first image. The hue component is a component indicating color information in hue, saturation, value HSV color space.
The terminal then outputs a second image, which is an enhanced image of the first image, by processing over the second network, the input of which comprises the first image and the one or more sets of intermediate features. Specifically, after one or more groups of intermediate features extracted by the first network are obtained, the terminal takes the first image as the original input of the second network and inputs the original input into the second network; one or more groups of intermediate features extracted by the first network are used as intermediate input of the second network and input into the second network.
According to the scheme, the intermediate features extracted by the first network are acquired, the intermediate features are related to tone components of an enhanced image of an input image, and the intermediate features are input into the second network for performing image enhancement, so that the tone components capable of fully representing the difference of the reflection layer and the transmission layer are introduced into the second network, the reflection areas in the image can be more accurately positioned and separated in the image enhancement, and the reflection phenomenon in the image is effectively removed.
In a possible implementation manner, the inputting, by the terminal, a first image into a first network to obtain one or more sets of intermediate features extracted by the first network includes: and the terminal extracts the one or more groups of intermediate features through one or more first convolution layers in series in the first network. That is, the first network includes one or more first convolution layers in series, and each first convolution layer is used for extracting a set of intermediate features to obtain multiple sets of intermediate features with different scales.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers. The first convolutional layer may comprise, for example, a plurality of serially connected hole convolutional layers.
Because the tone image comprises remarkable reflection characteristics and the tone image comprises less details, the first convolution layer can rapidly improve the multi-scale expression capability of the layering mode by using a serial hole convolution layer structure, and is favorable for positioning a reflection region in space.
In a possible implementation manner, the outputting, by the terminal through the processing of the second network, the second image specifically includes: extracting the features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In the scheme, the intermediate features extracted by the first network are fused with the features of the first image extracted by the second network through the second network, so that the hue components representing the difference between the reflection layer and the transmission layer can be effectively fused, the attention of the second network to the significant reflection region is enhanced, and a more accurate optimization direction is provided for network training.
In one possible implementation manner, the fusing, by the second network, the feature of the first image and the one or more sets of intermediate features to obtain a fused feature includes: and fusing the features of the first image and the one or more groups of intermediate features through one or more feature units in series in the second network to obtain one or more groups of fused features. Each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In the scheme, one or more groups of intermediate features extracted by the first network are fused with one or more features of the first image extracted by the second network through the second network, so that the hue components representing the difference between the reflection layer and the transmission layer can be effectively fused, the emphasis of the second network on the significant reflection area is enhanced, and a more accurate optimization direction is provided for network training.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel. The second convolutional layer may comprise, for example, a plurality of parallel hole convolutional layers.
In the scheme, the second convolution layer can effectively recover the clean transmission layer with rich details in the image by adopting the parallel hole convolution layer to extract the characteristics of the image.
In a possible implementation manner, the second network may further include an attention module of a volume block, where the attention module of the volume block is connected to a first feature unit in the second network, the attention module of the volume block is configured to extract features of the first image, and the first feature unit in the second network is configured to perform feature extraction on the features extracted by the attention module of the volume block. In short, the attention module of the volume block is located before the first second feature extraction module in the second network. And after the attention module of the volume block extracts the features of the image input into the second network, the second feature extraction module further extracts the features extracted by the attention module of the volume block.
In the scheme, the attention diagram of spatial domain and channel information is continuously deduced through the attention module of the rolling block in the second network, so that the characteristics can be adaptively refined, and the image anti-reflection effect is improved.
In one possible implementation, the method further includes: the terminal processes the first image to obtain a tone image corresponding to the first image, wherein the tone image comprises a tone component of an HSV color space; and the terminal inputs the tone image corresponding to the first image into the first network so as to obtain the one or more groups of intermediate features together with the first image through the processing of the first network.
In one possible implementation, the method further includes: inputting a first image into a first network by a terminal to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space. That is, after the first image is input into the first network, the first network can convert the first image into a tone image and process the one or more sets of intermediate features based on the first image and the tone image corresponding to the first image.
In one possible implementation, the input to the second network further comprises semantic features of the first image. The semantic features may specifically refer to high-level features. The high-level features are opposite to the low-level features, and the low-level features refer to detailed features such as specific outlines, edges, colors, textures and shapes in the image. The high-level features are features that describe objects in the image and can be used to identify and classify objects in the image.
In the scheme, the semantic comprehension capability of the second network can be improved by introducing the semantic features, the image quality of the second network after image dereflection is guaranteed at a higher dimensionality, for example, the image color does not have color cast and a large reflection area is removed, and the overall quality of a dereflection result is finally improved.
In one possible implementation, the method is used to implement at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
A second aspect of the present application provides a method for training a network model, including: obtaining a pair of image samples, the pair of image samples comprising a first image and an enhanced image of the first image; inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space; outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features; obtaining a first loss function from the enhanced image and the second image of the pair of image samples, the first loss function being indicative of a difference between the enhanced image and the second image of the pair of image samples; and training the first network and the second network according to the first loss function to obtain the trained first network and the trained second network.
In one possible implementation, the method further includes: acquiring a tone image of the second image and a tone image of an enhanced image in the image sample pair; obtaining a second loss function from the tone image of the second image and the tone image of the enhanced image in the pair of image samples, the second loss function indicating a difference between the tone image of the second image and the tone image of the enhanced image in the pair of image samples; the training the first network and the second network according to the first loss function includes: training the first network and the second network according to the first loss function and the second loss function.
In one possible implementation, the method further includes: obtaining a tone image of an enhanced image in an image sample pair and a third image output by the first network; obtaining a third loss function from the tonal image of the enhanced image in the pair of image samples and the third image, the third loss function being indicative of a difference between the tonal image of the enhanced image in the pair of image samples and the third image obtaining the third loss function; the training the first network and the second network according to the first loss function and the second loss function includes: training the first network and the second network according to the first loss function, the second loss function, and the third loss function.
In one possible implementation, the inputting the first image into a first network to obtain one or more sets of intermediate features extracted by the first network includes: extracting the one or more sets of intermediate features through one or more first convolution layers in series in the first network.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
In one possible implementation, the outputting, by the processing of the second network, the second image includes: extracting the features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In one possible implementation manner, the fusing, by the second network, the feature of the first image and the one or more sets of intermediate features to obtain a fused feature includes: fusing the features of the first image and the one or more groups of intermediate features through one or more feature units in series in the second network to obtain one or more groups of fused features; each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
In one possible implementation, the method further includes: processing the first image to obtain a tone image corresponding to the first image, wherein the tone image comprises a tone component of an HSV color space; and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In one possible implementation, the method further includes: inputting the first image into a first network to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
In one possible implementation, the input to the second network further comprises semantic features of the first image.
In one possible implementation, the method is used to implement at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
A third aspect of the present application provides an image processing apparatus comprising: and a processing unit. The processing unit is configured to: inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space; outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features.
In a possible implementation manner, the processing unit is further configured to extract the one or more sets of intermediate features through one or more first convolution layers in series in the first network.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
In one possible implementation, the processing unit is further configured to: extracting the features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In one possible implementation, the feature of the first image and the one or more sets of intermediate features are fused by one or more feature units in series in the second network to obtain one or more sets of fused features; each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
In one possible implementation, the processing unit is further configured to: processing the first image to obtain a tone image corresponding to the first image, wherein the tone image comprises a tone component of an HSV color space; and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In one possible implementation, the processing unit is further configured to: inputting the first image into a first network to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
In one possible implementation, the input to the second network further comprises semantic features of the first image.
In one possible implementation, the apparatus is configured to perform at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
The present application in a fourth aspect provides a training apparatus comprising: an acquisition unit and a processing unit. The acquisition unit is used for acquiring an image sample pair, and the image sample pair comprises a first image and an enhanced image of the first image; the processing unit is configured to: inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space; outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features; obtaining a first loss function from the enhanced image and the second image of the pair of image samples, the first loss function being indicative of a difference between the enhanced image and the second image of the pair of image samples; and training the first network and the second network according to the first loss function to obtain the trained first network and the trained second network.
In a possible implementation, the obtaining unit is further configured to obtain a tone image of the second image and a tone image of the enhanced image in the pair of image samples; the obtaining unit is further configured to obtain a second loss function from the tone image of the second image and the tone image of the enhanced image in the image sample pair, the second loss function indicating a difference between the tone image of the second image and the tone image of the enhanced image in the image sample pair; the processing unit is further configured to train the first network and the second network according to the first loss function and the second loss function.
In a possible implementation manner, the obtaining unit is further configured to obtain a tone image of the enhanced image in the image sample pair and a third image output by the first network; the obtaining unit is further configured to obtain a third loss function from the tone image of the enhanced image in the pair of image samples and the third image, the third loss function being indicative of a difference between the tone image of the enhanced image in the pair of image samples and the third image obtaining third loss function; the processing unit is further configured to train the first network and the second network according to the first loss function, the second loss function, and the third loss function.
In a possible implementation manner, the processing unit is further configured to extract the one or more sets of intermediate features through one or more first convolution layers in series in the first network.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
In a possible implementation manner, the processing unit is further configured to extract features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In a possible implementation manner, the processing unit is further configured to fuse the feature of the first image and the one or more sets of intermediate features through one or more feature units in series in the second network to obtain one or more sets of fused features; each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
In a possible implementation manner, the processing unit is further configured to process the first image to obtain a hue image corresponding to the first image, where the hue image includes hue components of an HSV color space; and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In a possible implementation, the processing unit is further configured to input the first image into the first network to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
In one possible implementation, the input to the second network further comprises semantic features of the first image.
In one possible implementation, the apparatus is configured to perform at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
A fifth aspect of the present application provides an image processing apparatus, which may include a processor, a processor coupled to a memory, the memory storing program instructions, which when executed by the processor implement the method of the first or second aspect. For the processor to execute the steps in each possible implementation manner of the first aspect or the second aspect, reference may be made to the first aspect specifically, and details are not described here.
A sixth aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method of the first or second aspect.
A seventh aspect of the present application provides circuitry comprising processing circuitry configured to perform the method of the first or second aspect.
An eighth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first or second aspect.
A ninth aspect of the present application provides a chip system, which includes a processor, configured to enable a server or a threshold value obtaining apparatus to implement the functions referred to in the first aspect or the second aspect, for example, to send or process data and/or information referred to in the method. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.
Drawings
FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;
FIG. 2 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application;
FIG. 4 is a diagram illustrating a system architecture according to an embodiment of the present application;
FIG. 5a is a schematic diagram illustrating comparison between before and after reflection removal for an image according to an embodiment of the present disclosure;
FIG. 5b is a schematic diagram illustrating comparison between before and after shadow removal for an image according to an embodiment of the present application;
fig. 6 is a schematic flowchart of an image enhancement method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a model of an HSV color space provided in an embodiment of the present application;
FIG. 8a is a schematic diagram illustrating a comparison between an RGB image without reflection and a tone image according to an embodiment of the present disclosure;
FIG. 8b is a schematic diagram illustrating a comparison between an RGB image with reflection phenomenon and a tone image according to an embodiment of the present disclosure;
fig. 8c is a schematic structural diagram of a first network and a second network according to an embodiment of the present application;
fig. 9a is a schematic structural diagram of a first feature extraction module according to an embodiment of the present disclosure;
fig. 9b is a schematic structural diagram of a second feature extraction module according to an embodiment of the present disclosure;
fig. 9c is a schematic structural diagram of a feature fusion module according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of an attention module of a roll block according to an embodiment of the present disclosure;
fig. 11a is a schematic structural diagram of a first network according to an embodiment of the present application;
fig. 11b is a schematic structural diagram of a second network according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of an image antireflection network according to an embodiment of the present disclosure;
FIG. 13 is a schematic diagram illustrating a comparison between the input and output of an image dereflection network according to an embodiment of the present disclosure;
fig. 14 is a schematic flowchart of a method for training a network model according to an embodiment of the present disclosure;
FIG. 15 is a schematic distance metric chart of a tone image according to an embodiment of the present application;
FIG. 16 is a diagram illustrating a comparison of network de-reflection effects provided by embodiments of the present application;
FIG. 17 is a comparison of the reflection reduction effect introduced into different networks according to the embodiment of the present application;
FIG. 18 is a schematic diagram illustrating a comparison of the reflection reduction effects of different image reflection reduction methods according to an embodiment of the present disclosure;
FIG. 19 is a schematic diagram illustrating a comparison between image reflection reduction effects provided by embodiments of the present application;
FIG. 20 is a schematic diagram illustrating a comparison of image de-reflection effects provided by an embodiment of the present application;
fig. 21 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;
FIG. 22 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 23 is a schematic structural diagram of an execution device according to an embodiment of the present application;
fig. 24 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.
Detailed Description
The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.
(1) An infrastructure.
The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.
(2) And (4) data.
Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) And (6) data processing.
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.
The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.
(4) Universal capability.
After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.
(5) Intelligent products and industrial applications.
The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..
The method provided by the application is described from the model training side and the model application side as follows:
the model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning and deep learning, symbolic and formal intelligent information modeling, extraction, preprocessing, training and the like are carried out on training data, and a trained neural network model (such as a target neural network model in the embodiment of the application) is finally obtained; and the target neural network model can be used for model reasoning, and specifically, input data can be input into the target neural network model to obtain output data.
Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.
(1) A neural network.
The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:
where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(2) Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer (for example, a first convolutional layer and a second convolutional layer in the present embodiment) for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.
Specifically, as shown in fig. 2, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
The structure formed by the convolutional layer/pooling layer 120 and the neural network layer 130 may be a first convolutional layer and a second convolutional layer described in this application, the input layer 110 is connected to the convolutional layer/pooling layer 120, the convolutional layer/pooling layer 120 is connected to the neural network layer 130, the output of the neural network layer 130 may be input to the active layer, and the active layer may perform nonlinear processing on the output of the neural network layer 130.
Convolutional/pooling layers 120. And (3) rolling layers: as shown in FIG. 2, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.
Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depthdimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.
When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer: since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 2, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers.
The neural network layer 130: after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.
(3) A deep neural network.
Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein,is the input vector of the input vector,is the output vector of the output vector,is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vectorObtaining the output vector through such simple operationDue to the large number of DNN layers, the coefficient W and the offset vectorThe number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined asNote that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.
(4) A loss function.
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
(5) A back propagation algorithm.
The convolutional neural network can adopt a back propagation (S21P 000217) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.
(6) Linear operation.
Linearity refers to a proportional, linear relationship between a quantity and a quantity, and is understood mathematically as a function of which the first derivative is a constant, and linear operations can be, but are not limited to, addition operations, null operations, identity operations, convolution operations, batch normalization BN operations, and pooling operations. Linear operations may also be referred to as linear mapping, which requires two conditions to be satisfied: homogeneity and additivity, either condition being not satisfied then being non-linear
Wherein, homogeneous means f (ax) af (x); additivity means f (x + y) ═ f (x) + f (y); for example, f (x) ax is linear. It should be noted that x, a, and f (x) herein are not necessarily scalars, and may be vectors or matrices, forming a linear space of any dimension. If x, f (x) are n-dimensional vectors, the equivalence satisfies homogeneity when a is constant, and the equivalence satisfies additivity when a is matrix. In contrast, a function graph is a straight line and does not necessarily conform to a linear mapping, for example, f (x) ax + b, which does not satisfy homogeneity or additivity, and thus belongs to a nonlinear mapping.
In the embodiment of the present application, a composite of a plurality of linear operations may be referred to as a linear operation, and each linear operation included in the linear operation may also be referred to as a sub-linear operation.
Fig. 4 is a schematic diagram of an architecture of a processing system 100 according to an embodiment of the present application, in fig. 4, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140.
During the input data preprocessing performed by the execution device 110 or the processing related to the calculation performed by the calculation module 111 of the execution device 110 (such as performing the function implementation of the neural network in the present application), the execution device 120 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing in the data storage system 150.
Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.
Alternatively, the client device 140 may be, for example, a control unit in an automatic driving system, a functional algorithm module in a mobile phone terminal, and the functional algorithm module may be used to implement related tasks, for example.
It should be noted that the training device 120 may generate corresponding target models/rules (e.g., target neural network models in this embodiment) based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.
In the case shown in fig. 4, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific form may be a display, a sound, an action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.
It should be noted that fig. 4 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.
In order to remove the reflection phenomenon in the image, the traditional method for removing reflection assists in removing reflection by manually designing prior information, such as information of gradient sparsity, defocus blur characteristics, depth of field characteristics and the like. Because the prior information of the manual design only considers the reflection characteristics under a specific scene, when the reflection removal is executed based on the prior information, a good reflection removal effect can be obtained only under the specific scene. Once the reflected scene violates established assumptions, these methods cannot correctly remove the reflection phenomenon in the image.
For the image de-reflection task, the widely used imaging models are: the reflection distorted image I is the transmission layer T + the reflection layer R, i.e. the reflection distorted image I is a linear combination of the transmission layer T and the reflection layer R. The de-reflection is the separation of the reflective layer R from the reflection distorted image I to obtain the desired transmissive layer T. Since the transmissive layer T and the reflective layer R are unknown variables, additional auxiliary information needs to be introduced to separate the reflective layer R. The reflection layers of the real scene have various sources and complex characteristics, and effective reflection removal can be realized by hardly finding effective auxiliary information.
In recent years, with advances in deep learning and computer vision, convolutional neural networks have become the standard solution to most vision problems, including image de-reflection. Convolutional neural networks are based on a large number of trainable convolution kernels whose parameters are optimized in a supervised manner by means of a specific loss function. The neural network-based method can explore different auxiliary information, such as bottom-layer edge information, high-layer semantic information and the like, and the auxiliary information is combined with training data to help the network to learn the reflection removal, so that the reflection removal performance is greatly improved. However, these auxiliary information are limited by their own characteristics, and cannot sufficiently characterize the characteristics of the reflective layer and the transmissive layer, and it is still difficult to achieve effective reflection removal in some complicated reflection scenarios. For example, edge information, which is a sparse feature of an image, only considers differences between adjacent pixels, cannot describe reflections of a larger spatial extent, so reflection is less effective for highlight regions.
In view of this, the image enhancement method provided in the embodiments of the present application introduces color tone information of an image as auxiliary information, can sufficiently characterize the difference between a reflective layer and a transmissive layer, and helps to locate and separate a reflective region in the image, thereby effectively removing a reflection phenomenon in the image.
In addition, the image enhancement method provided by the embodiment of the application can be used for removing the reflection phenomenon in the image, and can also be used for removing the shadow in the image or removing the fog in the image. In short, when reflection, shadow or fog exists in an image, the image enhancement method provided by the embodiment of the application can be adopted to realize image enhancement, so that a reflection layer, a shadow layer or a fog layer in the image is removed. For convenience of description, the image enhancement method provided by the embodiment of the present application will be described in detail below by taking image de-reflection as an example.
For example, referring to fig. 5a, fig. 5a is a schematic diagram illustrating comparison between before and after reflection removal of an image according to an embodiment of the present application. As shown in fig. 5a, before the image is subjected to reflection, a reflection phenomenon is obvious in the left area of the image, and the reflection layer and the transmission layer are overlapped in the image, so that the image is blocked. After the image is subjected to reflection removal, the reflection phenomenon of the left area of the image is eliminated, and the reflection layer in the image is removed, so that the problem that the transmission layer in the image is blocked is well solved.
Referring to fig. 5b, fig. 5b is a schematic diagram illustrating comparison between before and after the shadow is removed from the image according to an embodiment of the present disclosure. As shown in fig. 5b, the image represents the scene of the ground under the sun. Before the image is subjected to shadow removal, the area above the image has a distinct shadow, specifically the shadow that a person creates by holding an umbrella under the sun. In the image, the shadow obscures the tiles of the floor, resulting in a dull layer being evident in the obscured portion of the floor. After the shadow of the image is removed, the shadow phenomenon of the area above the image is eliminated, and the shadow layer in the image is removed, so that the problem that the ground in the image is blocked is well solved.
The image enhancement method provided by the embodiment of the application can be applied to a terminal or a server. The terminal may be, for example, a digital camera, a surveillance camera, a mobile phone (mobile phone), a Personal Computer (PC), a notebook, a server, a tablet, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety, a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), or the like. For convenience of description, the image enhancement method provided in the embodiments of the present application will be described below by taking an example of applying the image enhancement method to a terminal.
Referring to fig. 6, fig. 6 is a schematic flowchart of an image enhancement method according to an embodiment of the present disclosure. As shown in fig. 6, the image enhancement method comprises the following steps 601-602.
In this embodiment, the one or more sets of intermediate features are associated with a tonal component of the enhanced image of the first image, the tonal component being a component in HSV color space that indicates color information. Optionally, the first network may be a trained network model, and the first network is configured to predict a tone image of an enhanced image of the first image. The tonal image includes tonal components of an HSV color space. The enhanced image of the first image is specifically an image obtained by performing image enhancement on the first image. Thus, the one or more sets of intermediate features extracted by the first network are associated with tonal components of an enhanced image of the first image. The hue component is a component indicating color information in hue, saturation, value HSV color space.
Illustratively, performing image enhancement on an image may refer to performing a de-reflection, de-shading, or defogging operation on the image. That is to say, when the image enhancement method provided by the present embodiment is used to perform image de-reflection, the enhanced image of the first image may refer to an image obtained after performing image de-reflection on the first image.
Wherein the tone image of the enhanced image may be a tone component including only the HSV color space. The HSV color space is a color space created according to the intuitive nature of colors, also known as a hexagonal pyramid Model (Hexcone Model). For example, referring to fig. 7, fig. 7 is a schematic diagram of a model of an HSV color space according to an embodiment of the present disclosure. In the HSV color space, the parameters of the color are: hue (H), saturation (S), lightness (V). The hue can be measured by angle, and the value range is 0-360 degrees, the counter-clockwise direction is counted from red, the red is 0 degrees, the green is 120 degrees, and the blue is 240 degrees. Their complementary colors are: yellow is 60 °, cyan is 180 °, violet is 300 °. In short, the hue parameter represents the color information, i.e. the position of the spectral color.
Optionally, the first image in this embodiment may specifically be an RGB image. An RGB image refers to representing an image by three channels, Red (Red), Green (Green) and Blue (Blue), respectively. The different combinations of these three colors can form almost all other colors, so that the RGB image is actually an image obtained by combining these three colors. Generally, the RGB image and the HSV image can be converted into each other. For example, by performing color space conversion on an RGB image, a corresponding HSV image may be obtained; by performing color space conversion on the HSV image, a corresponding RGB image may be obtained.
For the tone image of the enhanced image, in the case that the enhanced image is an RGB image, the tone image of the enhanced image may be understood as performing color space conversion on the enhanced image to obtain an HSV image corresponding to the enhanced image, and then only the tone component in the HSV image corresponding to the enhanced image is retained to obtain the tone image of the enhanced image.
Optionally, the first network includes one or more first feature extraction modules, and the one or more sets of intermediate features extracted by the first network may be extracted by the one or more first feature extraction modules. Each of the one or more first feature extraction modules comprises one first convolutional layer, i.e. the first network comprises one or more first convolutional layers in series. In short, on the basis that the first network is used for predicting the tone image of the enhanced image corresponding to the input image, the first feature extraction module in the first network is used for extracting the tone feature, so that the subsequent modules in the first network can predict the tone image of the enhanced image based on the tone feature extracted by the first feature extraction module.
Optionally, in order to ensure the prediction accuracy of the first network, the first network may extract an intermediate feature based on the first image and a tone image corresponding to the first image, so as to obtain a tone image of an enhanced image of the first image.
For example, there may be various ways to acquire the tone image corresponding to the first image.
In a possible implementation manner, before the first image is input to the first network, the terminal may perform color space conversion on the first image to obtain a tone image corresponding to the first image. For example, the terminal may execute a color space conversion function (e.g., RGB > HSV conversion function) to convert the hue image corresponding to the first image. After obtaining the tone image corresponding to the first image, inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In another possible implementation manner, the first network includes a color space conversion module, and the color space conversion module is configured to convert the first image into a tone image. That is, after the first image is input into the first network, the first network can convert the first image into a tone image based on the color space conversion module, and process the one or more sets of intermediate features based on the first image and the tone image corresponding to the first image.
After the intermediate features extracted by the first network are obtained, the terminal takes the first image as the original input of the second network and inputs the original input into the second network; and the intermediate features extracted by the first network are used as intermediate input of the second network and input into the second network. The second network is used for extracting features of the input first image and fusing the extracted features with the input intermediate features, so that an enhanced image of the first image, namely a second image, is obtained through prediction.
Specifically, the terminal extracts the features of the first image through the second network, and fuses the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features. And finally, processing the one or more groups of fusion features through the second network to obtain the second image.
Referring to fig. 8a, fig. 8a is a schematic diagram illustrating a comparison between an RGB image without reflection phenomenon and a tone image according to an embodiment of the present disclosure. As shown in fig. 8a, in the case that there is no reflection phenomenon in the RGB image, the color tone image corresponding to the RGB image can well reflect the contour of the main object in the RGB image, such as the contour of the duck and the cradle in the RGB image.
Referring to fig. 8b, fig. 8b is a schematic diagram illustrating a comparison between an RGB image with a reflection phenomenon and a tone image according to an embodiment of the present disclosure. As shown in fig. 8b, when there is a reflection phenomenon in the RGB image, the color tone image corresponding to the RGB image does not reflect the outline of the main object in the RGB image due to the reflection region in the RGB image, for example, the color tone image does not represent the outline of the duck and the cradle in the RGB image well. That is, the corresponding tone characteristic information is included in the tone image corresponding to the RGB image, and the tone characteristic information has a correlation with the reflection area in the RGB image, so that the reflection area in the RGB image can be effectively located based on the tone characteristic information in the tone image.
Therefore, in the present embodiment, by acquiring the hue features extracted by the network for predicting the hue image and inputting the hue features as auxiliary information into the network for predicting the enhanced image, the reflection phenomenon in the image is effectively removed by locating and separating the reflection areas in the image based on the hue features capable of sufficiently characterizing the difference between the reflective layer and the transmissive layer.
Optionally, the second network includes one or more feature units in series, and each feature unit includes a second feature extraction module and a feature fusion module. Each second feature extraction module may include a second convolutional layer. The input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in one or more feature units have a set of intermediate features in one-to-one correspondence.
Wherein, the number of the first feature extraction module in the first network and the feature unit in the second network may be the same. For example, the first network includes n first feature extraction modules, and the second network includes n second feature extraction modules and n feature fusion modules.
Referring to fig. 8c, fig. 8c is a schematic structural diagram of a first network and a second network according to an embodiment of the present disclosure. As shown in fig. 8c, the first network includes n feature extraction modules, i.e., a feature extraction module a1 and a feature extraction module a2.. An; the second network comprises n feature extraction modules in total, namely a feature extraction module B1 and a feature extraction module B2.. the feature extraction module Bn; the second network comprises n feature fusion modules, namely a feature fusion module 1 and a feature fusion module 2.
N feature extraction modules in the first network are sequentially connected and respectively extract an intermediate feature 1 and an intermediate feature 2. Specifically, a feature extraction module a1 in the first network performs feature extraction on input data to obtain an intermediate feature 1; the feature extraction module a2 performs feature extraction on the intermediate feature 2 to obtain the intermediate feature 2, and so on, and the feature extraction modules in the first network respectively extract n intermediate features.
The feature extraction module and the feature fusion module in the second network form a feature unit, and the n groups of feature units are connected in sequence. The input of the first feature extraction module in the second network is the input data of the second network, and the input data of other feature extraction modules in the second network is the feature output by the previous feature fusion module; the input of each feature fusion module in the second network is respectively from the feature extraction module in the second network and the feature extraction module in the first network, and the feature fusion module is used for fusing the features extracted by the feature extraction module in the second network and the features extracted by the feature extraction module in the first network to obtain fusion features.
Specifically, the feature fusion module 1 and the feature fusion module 2.. the feature fusion module n in the second network are respectively connected with the feature extraction module B1 and the feature extraction module B2.. the feature extraction module Bn in the second network. The feature extraction module B1 performs feature extraction on the input data in the second network, and the feature fusion module 1 performs feature fusion on the features extracted by the feature extraction module B1 and the intermediate features 1 extracted by the feature extraction module A1 to obtain fusion features 1; the feature extraction module B2 performs feature extraction on the fusion feature 1, the feature fusion module 2 performs feature fusion on the feature extracted by the feature extraction module B2 and the intermediate feature 2 extracted by the feature extraction module A2 to obtain a fusion feature 2.
Optionally, the first convolutional layer in the first network includes a plurality of serially connected convolutional layers, and the second convolutional layer in the second network includes a plurality of parallel connected convolutional layers. Illustratively, the first convolutional layer may include 4 serially connected hole convolutional layers, and the step sizes (stride) of the 4 hole convolutional layers are {2,4,8,4}, respectively. It is understood that, in addition to the exemplary 4-layer hollow convolution layer, the convolution layer included in the first convolution layer may be another number of layers, for example, the first convolution layer includes 6 or 8 convolution layers in series, and the number of layers of the convolution layer included in the first convolution layer is not specifically limited in this embodiment. In addition, the void convolution layer in the first convolution layer may also adopt other step sizes, and the embodiment also does not specifically limit the step size of the convolution layer included in the first convolution layer.
Specifically, referring to fig. 9a, fig. 9a is a schematic structural diagram of a first winding layer according to an embodiment of the present disclosure. As shown in fig. 9a, the first buildup layer is composed of 4 void buildup layers connected in series. In fig. 9a, D2, D4, D8, and D4 respectively indicate that the step size of the first layer of the cavity convolution layer is 2, the step size of the second layer of the cavity convolution layer is 4, the step size of the third layer of the cavity convolution layer is 8, and the step size of the fourth layer of the cavity convolution layer is 4.
Because the tone image comprises remarkable reflection characteristics and the tone image comprises less details, the serial hole convolution layer structure in the first convolution layer can quickly improve the multi-scale expression capability of the layering mode, and is favorable for positioning a reflection region in space.
Illustratively, the second convolutional layer may comprise 3 parallel hole convolutional layers, the step size of these 3 hole convolutional layers is {4,8,16}, respectively. It is understood that, in addition to the exemplary 3 layers of hole convolution layers, the number of convolution layers included in the second feature module may be other layers and other step sizes may be used, for example, the second feature module includes 4 or 6 convolution layers in parallel, and the embodiment does not specifically limit the number of convolution layers included in the first feature module and the included step size. By using parallel hole convolution layers to form the second convolution layer, a clean and detailed transmission layer in an image can be effectively recovered.
Specifically, referring to fig. 9b, fig. 9b is a schematic structural diagram of a second convolutional layer according to an embodiment of the present disclosure. As shown in fig. 9b, the second buildup layer is composed of 3 parallel void buildup layers, and one void buildup layer is connected to each of the layers before and after the second buildup layer. Wherein, the 3 void convolutional layers included in the second convolutional layer are respectively represented as D4, D8 and D16, and the step sizes of the 3 void convolutional layers are respectively {4,8 and 16 }; before the second convolution layer, a convolution layer is connected, and the convolution layer is respectively connected with the parallel 3-layer cavity convolution layer in the second convolution layer; before the second convolutional layer, a convolutional layer with step length of 2 (i.e. D2 in the figure) is connected, and the convolutional layers are respectively connected with the 3-layer hollow convolutional layer which is parallel to the second convolutional layer. And the parallel 3 layers of cavity convolutional layers in the second convolutional layer are also connected with a splicing module, the splicing module is used for splicing the characteristics extracted by the parallel 3 layers of cavity convolutional layers, and the characteristics obtained by splicing are subjected to characteristic processing by one layer of convolutional layer with the step length of 2.
In this embodiment, the first convolutional layer is formed by using a plurality of convolutional layers connected in series, and the second convolutional layer is formed by using a plurality of convolutional layers connected in parallel, so that the multi-scale representation capability of the entire network can be improved on a finer-grained level, and the performance of the network, that is, the reflection removing capability of the network, can be effectively improved.
For example, the feature fusion module in the second network may be specifically configured to perform feature splicing operation on the intermediate features extracted by the first convolutional layer and the features extracted by the second convolutional layer, so as to obtain spliced features, where the spliced features are fusion features. In addition, an enhancement-operation-decrement (SOS) enhancement strategy can be added into the feature fusion module to enhance the fusion operation, so that two different feature information can be effectively utilized to obtain fusion features with better fusion performance. For example, referring to fig. 9c, fig. 9c is a schematic structural diagram of a feature fusion module provided in an embodiment of the present application. As shown in fig. 9c, after the features 1 and 2 are subjected to enhanced fusion by the splicing module, the fused features are processed by a convolution layer with a step length of 2, and the features obtained by convolution layer processing and the features 2 are subjected to decrement operation to finally obtain fused features.
Optionally, the input of the second network further comprises semantic features of the first image. And the terminal inputs the first image, the semantic features of the first image and the intermediate features into the second network to obtain a second image output by the second network. That is, the terminal may take the first image and the semantic features of the first image together as input to the second network, such that the second network performs image enhancement based on the first image and the semantic features of the first image.
Generally, the semantics of an image are divided into low-level features and high-level features. The image underlying features refer to: detail features such as contour, edge, color, texture, and shape features. Generally, semantic information of low-level features of an image is less, but the target position is accurate. High-level features of an image generally refer to what a human being can see. For example, low-level features are extracted from a face, feature information such as a connected contour, a nose, and eyes can be extracted, and high-level features are displayed as a face. Briefly, high-level features are used to describe features of objects in an image, which can be used to identify and classify objects in the image. Semantic information of high-level features is usually rich, but the target location is rough. In this embodiment, by introducing the semantic features, the semantic comprehension capability of the second network can be improved, so that the second network ensures the image quality after image dereflection in a relatively high dimension, for example, the image color does not have color cast and a large reflection area is removed, and finally the overall quality of a dereflection result is improved.
In this embodiment, the terminal may extract the semantic feature of the first image through a pre-training Network, for example, extract the semantic feature through a Visual Geometry Group Network (VGG). In practical application, the terminal may also extract the semantic features of the first image by using other manners, and the manner of extracting the semantic features is not specifically limited in this embodiment.
Optionally, the second network further includes a attention module (CBAM) of the volume Block, where the attention module of the volume Block is connected to a first feature unit in the second network, the attention module of the volume Block is configured to extract features of the first image, and the first feature unit in the second network is configured to perform feature extraction on the features extracted by the attention module of the volume Block. In short, the attention module of the volume block is located before the first second feature extraction module in the second network. And after the attention module of the volume block extracts the features of the image input into the second network, the second feature extraction module further extracts the features extracted by the attention module of the volume block.
Referring to fig. 10, fig. 10 is a schematic structural diagram of an attention module of a roll block according to an embodiment of the present disclosure. As shown in fig. 10, the attention module of the roll block includes a channel attention module and a space attention module connected in sequence. The attention module of the convolution block is actually a simple and efficient feedforward convolutional neural network of the attention module. By giving an intermediate feature map, the attention module of the volume block can infer the attention map in two independent dimensions, channels and spatial order, and then multiply the attention map into an adaptive feature refined input feature map. Since the attention module of the volume block is a lightweight and versatile module, it can be seamlessly integrated into the convolutional neural network and adds only negligible computational overhead. In the embodiment, the attention model using the rolling block in the second network continuously deduces the attention map of the spatial domain and channel information, so that the characteristics can be adaptively refined to improve the image reflection removing effect.
The image enhancement method provided by the embodiment is described above, and the process of performing image de-reflection by using the image enhancement method provided by the embodiment is described below with reference to specific examples.
Referring to fig. 11a, fig. 11a is a schematic structural diagram of a first network according to an embodiment of the present disclosure. As shown in fig. 11a, the input of the first network is a reflection distortion image and a tone image of the reflection distortion image, the plurality of feature modules in the first network respectively extract different intermediate features, and the output of the first network is a tone image after reflection of the reflection distortion image. The first network extracts remarkable reflection characteristics as intermediate characteristics by predicting tone images of the reflection distortion images and provides the intermediate characteristics to the second network so as to help the second network to locate and remove reflection areas, thereby effectively removing the reflection areas in the images and obtaining high-quality recovery images.
Referring to fig. 11b, fig. 11b is a schematic structural diagram of a second network according to an embodiment of the present disclosure. As shown in fig. 11b, the input of the second network is a reflection distortion map and semantic features corresponding to the reflection distortion map, and the output of the second network is a reflection-removed image corresponding to the reflection distortion map. Specifically, the input part of the second network introduces semantic features to improve the semantic comprehension capability of the network, and then uses an attention module of a lightweight volume block to continuously infer the attention map of spatial and channel information, thereby adaptively refining the features. And the features obtained by refining are further extracted by a feature extraction module, are subjected to enhanced fusion with intermediate features provided by a first network, and are further refined by a residual error module, so that a final prediction result is obtained.
Referring to fig. 12, fig. 12 is a schematic structural diagram of an image antireflection network according to an embodiment of the present disclosure. As shown in fig. 12, the image antireflection network is constituted by the first network and the second network described above. The feature extraction module in the first network, the feature extraction module in the second network and the feature fusion module form a basic unit together, and the image reflection removing network has n basic units in total.
Assume that the reflection distortion map is I, the tone image corresponding to the reflection distortion map is H, and the high-level semantic feature corresponding to the reflection distortion map is G. The intermediate features extracted by the first net are denoted as F1, F2 … Fn, and the output of the first net is H'.
The first network takes the reflection distortion image I and the tone image H as input, and extracts features through n feature extraction modules. Specifically, the first network extracts the feature F1 through the 1 st feature extraction module, and transmits the feature F1 to the corresponding feature fusion module in the second network; the first network extracts the feature F2 through the 2 nd feature extraction module and transmits the feature F2 to the corresponding feature fusion module in the second network; and repeating the steps until the first network extracts the characteristics Fn through the nth characteristic extraction module and transmits the characteristics Fn to the last characteristic fusion module in the second network.
Suppose that the features extracted by the feature extraction module in the second network are denoted as T1 and T2 … Tn, the features obtained by fusion of the feature fusion module are denoted as S1 and S2 … Sn, and the output of the second network is T'. Specifically, the second network refines the features through the attention module of the rolling block, then obtains the features T1 through the extraction module 1, and performs enhanced fusion on the features T1 and the features F1 transmitted by the first network through the feature fusion module 1 to obtain enhanced features S1. The second network extracts the feature S1 through the 2 nd feature extraction module to obtain a feature T2, and performs enhanced fusion on the feature T2 and the feature F2 transmitted by the first network through the feature fusion module 2 to obtain an enhanced feature S2. By analogy, the second network extracts the feature Sn-1 through the nth feature extraction module to obtain a feature Tn, and performs enhanced fusion on the feature Tn and the feature Fn transmitted by the first network through the feature fusion module n to obtain an enhanced feature Sn. And finally, refining the characteristic Sn by the second network through a residual error module to obtain final output T'.
Referring to fig. 13, fig. 13 is a schematic diagram illustrating comparison between an input and an output of an image de-reflection network according to an embodiment of the present disclosure. As shown in fig. 13, the reflection distortion map and the tone image of the reflection distortion map are input to the first network as input to the first network, and an output image of the first network is obtained. The reflection distortion map and the high-level semantic features corresponding to the reflection distortion map are used as the input of the second network and input into the second network; the intermediate features 1 and 2 extracted by the first network are also input into the second network, and finally an output image of the second network is obtained.
The image enhancement method provided by the embodiment of the present application is introduced above, and a training method of a network model is introduced below. By the training method of the network model, the network for performing image enhancement described in the above embodiment can be trained.
Referring to fig. 14, fig. 14 is a schematic flowchart of a method for training a network model according to an embodiment of the present application. As shown in fig. 14, the method for training a network model according to the embodiment of the present application includes the following steps 1401-1405.
In this embodiment, before the training method of the network model is executed, an image sample pair set may be obtained in advance, where the image sample pair set includes a plurality of image sample pairs. Each image sample pair comprises a first image and an enhanced image corresponding to the first image. For example, the first image in the image sample pair may be a reflection distortion map, and the enhanced image of the first image may be a true value image, i.e., a reflection distortion map corresponding to a reflection distortion image.
There are various ways to obtain the image sample pair set.
In one possible implementation, pairs of image samples are obtained by acquiring different images of the same scene under different circumstances. Taking the image sample pair as a reflection distortion image as an example, during the image acquisition process, an image in a normal scene may be acquired as an enhanced image (i.e., a de-reflection image) corresponding to the first image, and an image in the same scene may be acquired as the first image in the case of having a reflection phenomenon. For example, after an image in a static scene is collected as a de-reflected image through the terminal, the position of the terminal is kept unchanged, a transparent glass capable of reflecting light is placed in front of the terminal, and an image in the same scene is collected as a reflection distortion image through the terminal.
In another possible implementation manner, after the image without the reflection phenomenon is acquired, noise can be added to the image through a specific algorithm, so that a distorted image corresponding to the image is obtained. For example, in the case where reflection noise is added to an image, the image before the reflection noise is added may be used as an enhanced image of the first image, and the image after the reflection noise is added may be used as the first image.
In this embodiment, step 1402-.
After the predicted image corresponding to the first image is predicted by the second network, a first loss function indicating a difference between the enhanced image in the pair of image samples and the first predicted image may be found based on the enhanced image in the pair of image samples and the obtained predicted image.
Illustratively, the first loss function may include one or more of a content loss function, a perceptual loss function, and a countering loss function. Wherein the content loss function is used to represent the content fidelity of the image. For example, the content loss function may be represented in the L1 paradigm. Specifically, the content loss function can be represented by formula 1.
Where L1 represents the content loss function, T represents the enhanced image in the image sample pair,representing a predicted image obtained by the second network.
The perceptual loss function is used to provide supervised constraints related to high-level semantic features. Illustratively, the perceptual loss function may be represented by equation 2.
Where Lc represents the perceptual loss function and T represents the image sampleThe enhanced image in the present pair is,representing a predicted image, phi, obtained by the second network1,φ2,φ3Respectively representing the characteristics obtained by extracting the conv1_2 layer, the conv3_2 layer and the conv4_2 layer in the VGG-19; lambda [ alpha ]1,λ2,λ3Respectively, represent the weight coefficients.
The countering loss function is used to enhance the realism of the generated transmission layer results and to suppress color cast and reduce artifacts.
Wherein L isadvRepresenting the penalty function, T represents the enhanced image in the image sample pair,representing a predicted image obtained by the second network.Representing an image generated given an input image IIs the probability of a true transmission layer image.
In the embodiment, the network is constrained by adopting content loss, perception loss and countermeasure loss, so that the reality of the network recovery image can be improved, the color cast can be inhibited, and the artifacts can be reduced.
The first network and the second network obtained after the training in step 1405 may refer to the description in the corresponding embodiment of fig. 6, and are not described herein again.
In one possible embodiment, since the first network is used for predicting the tone image, in order to ensure the prediction accuracy of the first network, the terminal may acquire a loss function based on the output and the input of the first network, and train the first network and the second network together based on the loss function.
Specifically, the terminal may acquire the tone image of the first predictive image and the tone image of the enhanced image in the pair of image samples, and acquire a second loss function indicating a difference between the tone image of the first predictive image and the tone image of the enhanced image in the pair of image samples, from the tone image of the first predictive image and the tone image of the enhanced image in the pair of image samples. Finally, the terminal may train the first network and the second network according to the first loss function and the second loss function.
Since the second loss function is actually derived based on the two tone images, how to accurately measure the difference between the two tone images is crucial for network training. Referring to fig. 15, fig. 15 is a schematic diagram illustrating a distance measurement of a tone image according to an embodiment of the present disclosure. As shown in fig. 15, hue is distributively cyclic in the HSV color space. For any two tonal values, the angle between the two tonal values is conventionally less than 180 °, and in the case of a maximum value the angle between the two tonal values is equal to 180 °. The distance between two tonal images cannot be directly measured by conventional distance measurement methods (e.g. L1 distance or L2 distance), but needs to be adapted accordingly to the characteristics of the tone.
Illustratively, the embodiment of the application provides a cyclic tone loss, which can measure the distance between two tone images. Specifically, the distance measure between tone images is as shown in equation 4 and equation 5.
Mi,j=1 when|Ha i,j-Hb i,j|≤0.5
Mi,j=0 when|Ha i,j-Hb i,j0.5 equation 5
Wherein HaRepresenting the tone value of any position in the tone image a, representing the tone value of the same position in the tone image b, Dhue(Ha,Hb) Representing the loss function between tone image a and tone image b. Mi,jRepresenting the weight coefficients.
In brief, for the difference value of any position in two tone images, the included angle itself is taken when the included angle is less than 0.5, and the absolute value of the difference between 1 and the included angle is taken when the included angle is greater than 0.5. Then, the corresponding difference values of all the positions are summed to obtain the distance between the two tone images, namely the cyclic tone loss.
In a possible embodiment, since the tone image can represent the reflection phenomenon in the image, in order to improve the prediction accuracy of the second network, ensure that the image predicted by the second network can accurately eliminate the reflection phenomenon, and also introduce cyclic tone loss to constrain the second network.
Illustratively, the terminal may acquire a tone image of the enhanced image of the pair of image samples and a third image output by the first network, and then acquire a third loss function from the tone image of the enhanced image of the pair of image samples and the third image, the third loss function indicating a difference between the tone image of the enhanced image of the pair of image samples and the third image acquisition third loss function. The obtaining manner of the third loss function is similar to the obtaining manner of the second loss function, and reference may be specifically made to the description of the second loss function, which is not described herein again. And finally, the terminal trains the first network and the second network according to the first loss function, the second loss function and the third loss function. That is, the terminal may determine a final total loss function based on the first loss function, the second loss function, and the third loss function, and finally train the first network and the second network according to the total loss function.
Illustratively, the total loss function is specifically shown in equation 6 and equation 7.
L=α1*Lhue+α2*L1+α2*Lc+α3*LadvEquation 6
Wherein L represents the total loss function; l ishueThe overall hue loss function is expressed as a function of,a second loss function representing a correspondence of the tone image of the first predicted image and the tone image of the enhanced image in the pair of image samples,a third loss function representing a correspondence of the tone image and the third image of the enhanced image in the pair of image samples. L1 denotes a content loss function. Lc represents the perceptual loss function. L isadvRepresenting the penalty function. Alpha is alpha1,α2,α3And β is a weight coefficient, and may be 50, 0.2, 0.01, 2, respectively.
Taking the method of the present embodiment for image de-reflection as an example, the present embodiment performs a test on a simulation data set to compare the method with the existing image de-reflection algorithm.
First, 3500 pairs of reflection images and a true (GT) image are obtained by image synthesis in this embodiment. The reflection image is an image with a reflection area, and the true value image is a reflection-removed image corresponding to the reflection image. In addition, in the embodiment, a plurality of groups of test data are divided based on the synthesized image, and the plurality of groups of test data are natural-20, real-20 and world-55 respectively. Wherein, native-20 comprises 20 pairs of reflection images and GT images, real-20 comprises 20 pairs of reflection images and GT images, and wild-55 comprises 55 pairs of reflection images and GT images.
Referring to fig. 16, fig. 16 is a schematic diagram illustrating a comparison of network reflection reduction effects according to an embodiment of the present disclosure. As shown in fig. 16, the present embodiment uses the same three sets of test data to perform tests on the conventional network and the network provided by the present application, and obtains Peak signal-to-noise ratio (PPNR) and Structural Similarity (SSIM) corresponding to the de-reflected images obtained by the conventional network and the network provided by the present application, respectively. Where PPNR is an engineering term representing the ratio of the maximum possible power of a signal and the power of destructive noise affecting its representation accuracy, and can be used to measure the quality of an image. SSIM is an index for measuring the similarity of two images and is used for evaluating the quality of an output image processed by an algorithm. Generally, the higher the PPNR corresponding to an image, the higher the quality of the image; the larger the SSIM of the two images, the higher the similarity of the two images. As can be seen from fig. 16, compared with the conventional network, the network prediction provided by the present application has a larger PPNR and SSIM, which proves that the tone feature information provided by the embodiment of the present application can be well used for implementing image de-reflection.
Referring to fig. 17, fig. 17 is a schematic diagram illustrating a comparison of the reflection removing effect introduced into different networks according to an embodiment of the present application. As shown in fig. 17, when the network provided in the embodiment of the present application includes only the second network, the prediction effect of the network is poor; when the network provided by the embodiment of the application comprises the first network and the second network, the prediction effect of the network is improved; when the network provided by the embodiment of the application comprises the first network and the second network and the cyclic tone loss is adopted to constrain the network, the prediction effect of the network is the best. That is to say, the first network capable of extracting hue feature information and the cyclic hue loss for constraining the network proposed in the embodiment of the present application both have forward benefits for the image de-reflection task, and can effectively improve the image de-reflection effect.
Referring to fig. 18, fig. 18 is a schematic diagram illustrating a comparison of the reflection removing effect of different image reflection removing methods according to an embodiment of the present disclosure. Fig. 18 shows the evaluation results of the image dereflection method provided in the embodiment of the present application and the existing image dereflection method on the public data set, wherein the data in the last row shows the objective index corresponding to the image dereflection method provided in the embodiment of the present application. It can be seen that objective indexes corresponding to the image dereflection method provided by the embodiment of the application all exceed objective indexes corresponding to the existing image dereflection method, that is, compared with the existing image dereflection method, the image dereflection method provided by the embodiment of the application can have a better image dereflection effect.
Referring to fig. 19 and 20, fig. 19 is a schematic diagram illustrating a comparison between image reflection reduction effects according to an embodiment of the present disclosure; fig. 20 is a schematic diagram illustrating a comparison of image antireflection effects according to an embodiment of the present application. As can be seen from fig. 19, the image dereflection method provided in the embodiment of the present application predicts a higher visual quality image and is closer to the true value image, and removes most of the reflection, especially the dense fringe reflection (e.g., the enlarged region of the second set of images) and the regional reflection (e.g., the whole background of the third set of images). As can be seen from fig. 20, compared with the existing image dereflection method, the image dereflection method provided by the embodiment of the present application has the cleanest result of dereflection and the highest visual quality, and further verifies the superiority of the image dereflection method provided by the embodiment of the present application.
Referring to fig. 21, fig. 21 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 21, an image processing apparatus according to an embodiment of the present application includes: a processing unit 2101. The processing unit 2101 is configured to: inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space; outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features.
In one possible implementation, the processing unit 2101 is further configured to extract the one or more sets of intermediate features from one or more first convolution layers serially in the first network.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
In one possible implementation, the processing unit 2101 is further configured to: extracting the features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In one possible implementation, the feature of the first image and the one or more sets of intermediate features are fused by one or more feature units in series in the second network to obtain one or more sets of fused features; each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
In one possible implementation, the processing unit 2101 is further configured to: processing the first image to obtain a tone image corresponding to the first image, wherein the tone image comprises a tone component of an HSV color space; and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In one possible implementation, the processing unit 2101 is further configured to: inputting the first image into a first network to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
In one possible implementation, the input to the second network further comprises semantic features of the first image.
In one possible implementation, the apparatus is configured to perform at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
Referring to fig. 22, fig. 22 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure. As shown in fig. 22, an embodiment of the present application provides a model training apparatus, including: an acquisition unit 2201 and a processing unit 2202. The acquiring unit 2201 is configured to acquire a pair of image samples, where the pair of image samples includes a first image and an enhanced image of the first image; the processing unit 2202 is configured to: inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space; outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features; obtaining a first loss function from the enhanced image and the second image of the pair of image samples, the first loss function being indicative of a difference between the enhanced image and the second image of the pair of image samples; and training the first network and the second network according to the first loss function to obtain the trained first network and the trained second network.
In a possible implementation manner, the acquiring unit 2201 is further configured to acquire a tone image of the second image and a tone image of the enhanced image in the image sample pair; the obtaining unit 2201 is further configured to obtain a second loss function from the tone image of the second image and the tone image of the enhanced image in the image sample pair, the second loss function indicating a difference between the tone image of the second image and the tone image of the enhanced image in the image sample pair; the processing unit 2202 is further configured to train the first network and the second network according to the first loss function and the second loss function.
In a possible implementation manner, the acquiring unit 2201 is further configured to acquire a tone image of the enhanced image in the image sample pair and a third image output by the first network; the obtaining unit 2201 is further configured to obtain a third loss function from the tone image of the enhanced image in the pair of image samples and the third image, the third loss function being indicative of a difference between the tone image of the enhanced image in the pair of image samples and the third loss function; the processing unit 2202 is further configured to train the first network and the second network according to the first loss function, the second loss function, and the third loss function.
In one possible implementation, the processing unit 2202 is further configured to extract the one or more sets of intermediate features through one or more first convolution layers in series in the first network.
In one possible implementation, the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
In a possible implementation manner, the processing unit 2202 is further configured to extract features of the first image through the second network; fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features; processing the one or more sets of fused features through the second network to obtain the second image.
In a possible implementation, the processing unit 2202 is further configured to fuse the feature of the first image and the one or more sets of intermediate features through one or more feature units in series in the second network to obtain one or more sets of fused features; each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
In one possible implementation, the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
In a possible implementation manner, the processing unit 2202 is further configured to process the first image to obtain a tone image corresponding to the first image, where the tone image includes tone components of an HSV color space; and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
In one possible implementation, the processing unit 2202 is further configured to input the first image into the first network to obtain a tone image of the first image; inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
In one possible implementation, the input to the second network further comprises semantic features of the first image.
In one possible implementation, the apparatus is configured to perform at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
Referring to fig. 23, fig. 23 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2300 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The data processing apparatus described in the embodiment corresponding to fig. 23 may be disposed on the execution device 2300, and is used to implement the function of data processing in the embodiment corresponding to fig. 23. Specifically, the execution apparatus 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303 and a memory 2304 (wherein the number of processors 2303 in the execution device 2300 may be one or more, for example, one processor in fig. 23), wherein the processor 2303 may include an application processor 23031 and a communication processor 23032. In some embodiments of the application, the receiver 2301, the transmitter 2302, the processor 2303 and the memory 2304 may be connected by a bus or other means.
The memory 2304 may include both read-only memory and random access memory, and provides instructions and data to the processor 2303. A portion of the memory 2304 may also include non-volatile random access memory (NVRAM). The memory 2304 stores processors and operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.
The processor 2303 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.
The methods disclosed in the embodiments of the present application may be implemented in the processor 2303 or implemented by the processor 2303. The processor 2303 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2303. The processor 2303 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 2303 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2304, and the processor 2303 reads information in the memory 2304 and completes the steps of the method in combination with hardware thereof.
The receiver 2301 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. The transmitter 2302 may be used to output numeric or character information through a first interface; the transmitter 2302 may also be used to send instructions to the disk groups through the first interface to modify data in the disk groups; the transmitter 2302 may also include a display screen or the like.
In one embodiment of the application, the processor 2303 is configured to execute the image enhancement method executed by the execution device in the corresponding embodiment of fig. 6.
Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.
The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer-executable instructions stored by the storage unit to cause the chip in the execution device to execute the image enhancement method described in the above embodiments, or to cause the chip in the training device to execute the image enhancement method described in the above embodiments. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.
Specifically, referring to fig. 24, fig. 24 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 2400, and the NPU 2400 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2403, and the controller 2404 controls the arithmetic circuit 2403 to extract matrix data in the memory and perform multiplication.
In some implementations, the arithmetic circuit 2403 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 2403 is a two-dimensional systolic array. Operational circuit 2403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 2403 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2402 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2401 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator) 2408.
The unified memory 2406 is used for storing input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 2405, and the DMAC is transferred to a weight Memory 2402. The input data is also carried into the unified memory 2406 through the DMAC.
The BIU is a Bus Interface Unit 2413 for the interaction of the AXI Bus with the DMAC and the Instruction Fetch memory (IFB) 2409.
The Bus Interface Unit 2413(Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 2409, and further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 2405.
The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2406 or transfer weight data to the weight memory 2402 or transfer input data to the input memory 2401.
The vector calculation unit 2407 includes a plurality of operation processing units, and further processes the output of the operation circuit 2403 such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.
In some implementations, the vector calculation unit 2407 can store the processed output vector to the unified memory 2406. For example, the vector calculation unit 2407 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 2403, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 2407 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the operational circuitry 2403, e.g., for use in subsequent layers in a neural network.
An instruction fetch buffer (instruction fetch buffer)2409 connected to the controller 2404, configured to store instructions used by the controller 2404;
the unified memory 2406, the input memory 2401, the weight memory 2402, and the instruction fetch memory 2409 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.
It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Claims (16)
1. An image enhancement method, comprising:
inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space;
outputting, by processing of a second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features.
2. The method of claim 1, wherein inputting the first image into a first network, resulting in one or more sets of intermediate features extracted by the first network, comprises:
extracting the one or more sets of intermediate features through one or more first convolution layers in series in the first network.
3. The method of claim 2, wherein the first convolutional layer comprises a plurality of layers of serially connected convolutional layers.
4. The method of any of claims 1-3, wherein the processing over the second network to output a second image comprises:
extracting the features of the first image through the second network;
fusing the features of the first image and the one or more groups of intermediate features through the second network to obtain one or more groups of fused features;
processing the one or more sets of fused features through the second network to obtain the second image.
5. The method of claim 4, wherein said fusing the features of the first image and the one or more sets of intermediate features over the second network to obtain fused features comprises:
fusing the features of the first image and the one or more groups of intermediate features through one or more feature units in series in the second network to obtain one or more groups of fused features;
each feature unit in the one or more feature units comprises a second convolution layer and a feature fusion module, wherein the input of the second convolution layer in each feature unit is the fusion feature output by the feature fusion module in the previous feature unit, the input of the feature fusion module in each feature unit is a corresponding set of intermediate features and the output of the second convolution layer in the same feature unit, and the feature fusion modules in the one or more feature units have a set of intermediate features corresponding to each other.
6. The method of claim 5, wherein the second convolutional layer comprises a plurality of convolutional layers connected in parallel.
7. The method according to any one of claims 1-6, further comprising:
processing the first image to obtain a tone image corresponding to the first image, wherein the tone image comprises a tone component of an HSV color space;
and inputting the tone image corresponding to the first image into the first network, so as to obtain the one or more groups of intermediate features through the processing of the first network together with the first image.
8. The method according to any one of claims 1-6, further comprising:
inputting the first image into a first network to obtain a tone image of the first image;
inputting the tone image into the first network to be processed by the first network together with the first image to obtain the one or more sets of intermediate features, wherein the tone image comprises tone components of an HSV color space.
9. The method of any of claims 1-8, wherein the input to the second network further comprises semantic features of the first image.
10. The method according to any of claims 1-9, wherein the method is used to implement at least one of the following image enhancement tasks: image de-reflection, image de-shading, and image de-fogging.
11. A method for training a network model, comprising:
obtaining a pair of image samples, the pair of image samples comprising a first image and an enhanced image of the first image;
inputting a first image into a first network to obtain one or more groups of intermediate features extracted by the first network, wherein the one or more groups of intermediate features are related to hue components of an enhanced image of the first image, and the hue components are components indicating color information in hue, saturation and brightness (HSV) color space;
outputting, by processing of the second network, a second image, the second image being an enhanced image of the first image, the input of the second network comprising the first image and the one or more sets of intermediate features;
obtaining a first loss function from the enhanced image and the second image of the pair of image samples, the first loss function being indicative of a difference between the enhanced image and the second image of the pair of image samples;
and training the first network and the second network according to the first loss function to obtain the trained first network and the trained second network.
12. The method of claim 11, further comprising:
acquiring a tone image of the second image and a tone image of an enhanced image in the image sample pair;
obtaining a second loss function from the tone image of the second image and the tone image of the enhanced image in the pair of image samples, the second loss function indicating a difference between the tone image of the second image and the tone image of the enhanced image in the pair of image samples;
the training the first network and the second network according to the first loss function includes:
training the first network and the second network according to the first loss function and the second loss function.
13. The method of claim 12, further comprising:
obtaining a tone image of an enhanced image in an image sample pair and a third image output by the first network;
obtaining a third loss function from the tonal image of the enhanced image in the pair of image samples and the third image, the third loss function being indicative of a difference between the tonal image of the enhanced image in the pair of image samples and the third image obtaining the third loss function;
the training the first network and the second network according to the first loss function and the second loss function includes:
training the first network and the second network according to the first loss function, the second loss function, and the third loss function.
14. An image processing apparatus, comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when executed, the image processing apparatus performs the method of any of claims 1 to 13.
15. A computer storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 13.
16. A computer program product having stored thereon instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 13.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110221939.0A CN113066019A (en) | 2021-02-27 | 2021-02-27 | Image enhancement method and related device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110221939.0A CN113066019A (en) | 2021-02-27 | 2021-02-27 | Image enhancement method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113066019A true CN113066019A (en) | 2021-07-02 |
Family
ID=76559201
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110221939.0A Pending CN113066019A (en) | 2021-02-27 | 2021-02-27 | Image enhancement method and related device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113066019A (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679657A (en) * | 2013-12-10 | 2014-03-26 | 三峡大学 | Method and device for enhancing image contrast ratio |
CN108447036A (en) * | 2018-03-23 | 2018-08-24 | 北京大学 | A kind of low light image Enhancement Method based on convolutional neural networks |
CN109003231A (en) * | 2018-06-11 | 2018-12-14 | 广州视源电子科技股份有限公司 | Image enhancement method and device and display equipment |
CN109859144A (en) * | 2019-02-22 | 2019-06-07 | 上海商汤智能科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110188776A (en) * | 2019-05-30 | 2019-08-30 | 京东方科技集团股份有限公司 | Image processing method and device, the training method of neural network, storage medium |
CN110796595A (en) * | 2019-10-31 | 2020-02-14 | 北京大学深圳研究生院 | Tone mapping method and device and electronic equipment |
CN111145106A (en) * | 2019-12-06 | 2020-05-12 | 深圳市雄帝科技股份有限公司 | Image enhancement method, device, medium and equipment |
CN111583161A (en) * | 2020-06-17 | 2020-08-25 | 上海眼控科技股份有限公司 | Blurred image enhancement method, computer device and storage medium |
CN111738942A (en) * | 2020-06-10 | 2020-10-02 | 南京邮电大学 | Generation countermeasure network image defogging method fusing feature pyramid |
CN112116537A (en) * | 2020-08-31 | 2020-12-22 | 中国科学院长春光学精密机械与物理研究所 | Image reflected light elimination method and image reflected light elimination network construction method |
CN112257759A (en) * | 2020-09-27 | 2021-01-22 | 华为技术有限公司 | Image processing method and device |
-
2021
- 2021-02-27 CN CN202110221939.0A patent/CN113066019A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103679657A (en) * | 2013-12-10 | 2014-03-26 | 三峡大学 | Method and device for enhancing image contrast ratio |
CN108447036A (en) * | 2018-03-23 | 2018-08-24 | 北京大学 | A kind of low light image Enhancement Method based on convolutional neural networks |
CN109003231A (en) * | 2018-06-11 | 2018-12-14 | 广州视源电子科技股份有限公司 | Image enhancement method and device and display equipment |
CN109859144A (en) * | 2019-02-22 | 2019-06-07 | 上海商汤智能科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110188776A (en) * | 2019-05-30 | 2019-08-30 | 京东方科技集团股份有限公司 | Image processing method and device, the training method of neural network, storage medium |
CN110796595A (en) * | 2019-10-31 | 2020-02-14 | 北京大学深圳研究生院 | Tone mapping method and device and electronic equipment |
CN111145106A (en) * | 2019-12-06 | 2020-05-12 | 深圳市雄帝科技股份有限公司 | Image enhancement method, device, medium and equipment |
CN111738942A (en) * | 2020-06-10 | 2020-10-02 | 南京邮电大学 | Generation countermeasure network image defogging method fusing feature pyramid |
CN111583161A (en) * | 2020-06-17 | 2020-08-25 | 上海眼控科技股份有限公司 | Blurred image enhancement method, computer device and storage medium |
CN112116537A (en) * | 2020-08-31 | 2020-12-22 | 中国科学院长春光学精密机械与物理研究所 | Image reflected light elimination method and image reflected light elimination network construction method |
CN112257759A (en) * | 2020-09-27 | 2021-01-22 | 华为技术有限公司 | Image processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110532871B (en) | Image processing method and device | |
CN112446834B (en) | Image enhancement method and device | |
CN110222717B (en) | Image processing method and device | |
CN111402146B (en) | Image processing method and image processing apparatus | |
CN113066017B (en) | Image enhancement method, model training method and equipment | |
CN111291809B (en) | Processing device, method and storage medium | |
US20230177641A1 (en) | Neural network training method, image processing method, and apparatus | |
CN110378381A (en) | Object detecting method, device and computer storage medium | |
CN112446270A (en) | Training method of pedestrian re-identification network, and pedestrian re-identification method and device | |
CN113284054A (en) | Image enhancement method and image enhancement device | |
CN113705769A (en) | Neural network training method and device | |
CN112308200A (en) | Neural network searching method and device | |
CN111797882B (en) | Image classification method and device | |
CN113011562A (en) | Model training method and device | |
CN112598597A (en) | Training method of noise reduction model and related device | |
CN110222718B (en) | Image processing method and device | |
CN113191489B (en) | Training method of binary neural network model, image processing method and device | |
CN115311186B (en) | Cross-scale attention confrontation fusion method and terminal for infrared and visible light images | |
CN111797881A (en) | Image classification method and device | |
CN111832592A (en) | RGBD significance detection method and related device | |
CN113326930A (en) | Data processing method, neural network training method, related device and equipment | |
WO2022179606A1 (en) | Image processing method and related apparatus | |
CN113536970A (en) | Training method of video classification model and related device | |
CN113627163A (en) | Attention model, feature extraction method and related device | |
CN114359289A (en) | Image processing method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |