CN116563304A

CN116563304A - Image processing method and device and training method and device of image processing model

Info

Publication number: CN116563304A
Application number: CN202210108947.9A
Authority: CN
Inventors: 陈湘广; 孙磊; 李昱; 祝叶; 傅秉涛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2023-08-08

Abstract

The application relates to an image processing method, an image processing device, computer equipment, a storage medium and a computer program product, which are applied to scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance, conference, live broadcast, video, art designing and the like. The method comprises the following steps: performing feature coding processing on an original image to obtain a plurality of feature images with different scales, performing first decoding processing on the basis of the feature images with different scales to obtain a mask image, dividing a target main body from the original image on a semantic level by the mask image, performing second decoding processing on the basis of the feature images with different scales and the mask image to obtain a plurality of candidate mask images with different levels, updating the candidate mask images with different levels according to edge transition features between the target main body and a background area included in each candidate mask image to obtain a target mask image, and determining a matting result corresponding to the target main body according to the target mask image. By adopting the method, the accuracy of the matting can be improved.

Description

Image processing method and device and training method and device of image processing model

Technical Field

The present application relates to the field of computer technology, and in particular, to an image processing method, an apparatus, a computer device, a storage medium, and a computer program product, and a training method, an apparatus, a computer device, a storage medium, and a computer program product for an image processing model.

Background

With the development of computer technology, an image segmentation technology has emerged, through which a subject or a background can be segmented from an original image to meet different use requirements. For example, a portrait in an image or video is individually segmented for composition into other images or videos.

The traditional image segmentation technology generally utilizes a deep convolutional neural network to extract high-level semantic feature information of an image, and then utilizes the high-level semantic information to detect and segment a target. However, the detection and segmentation of the target are performed only by relying on high-level semantic information, so that partial detail features are easily lost, and the problem of inaccurate image segmentation exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image processing method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve matting accuracy.

In one aspect, the present application provides an image processing method. The method comprises the following steps:

performing feature coding processing on the original image to obtain a plurality of feature images with different scales;

performing first decoding processing based on the feature images with different scales to obtain a mask image, wherein the mask image is used for segmenting a target main body from the original image on a semantic level;

performing second decoding processing based on the feature images with different scales and the mask images to obtain candidate mask images with different levels;

updating the candidate mask images of the different layers according to edge transition characteristics between a target main body and a background area included in each candidate mask image to obtain a target mask image;

and determining a matting result corresponding to the target main body according to the target mask map.

In one embodiment, the present application also provides an image processing apparatus. The device comprises:

the coding module is used for carrying out feature coding processing on the original image to obtain a plurality of feature images with different scales;

the first decoding module is used for carrying out first decoding processing based on the feature images with the different scales to obtain a mask image, and the mask image is used for segmenting a target main body from the original image on a semantic level;

The second decoding module is used for performing second decoding processing based on the feature images with different scales and the mask images to obtain candidate mask images with different layers;

the updating module is used for updating the candidate mask images of the different layers according to the edge transition characteristics between the target main body and the background area included in each candidate mask image to obtain the target mask image;

and the result determining module is used for determining a matting result corresponding to the target main body according to the target mask map.

In one embodiment, the apparatus is performed by an image processing model comprising an encoding structure composed of a plurality of encoding units, a first decoding structure composed of a plurality of first decoding units, a second decoding structure composed of a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on an original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

In one embodiment, the encoding module is further configured to acquire an original image, and input the original image into the encoding structure; sequentially carrying out coding processing by a plurality of coding units in the coding structure based on input data corresponding to the coding units respectively to obtain characteristic diagrams output by the coding units respectively; the input data of the initial coding unit in the coding structure is the original image, and the scales of the feature maps output by different coding units in the coding structure are different.

In one embodiment, the first decoding module is further configured to perform, based on the feature maps of the multiple different scales, a first decoding process by using multiple first decoding units in a first decoding structure sequentially based on input data corresponding to each of the first decoding units until a mask image is output by using a last first decoding unit, where the input data of a current first decoding unit in the first decoding structure includes a decoding result corresponding to a previous first decoding unit and a feature map output by an encoding unit corresponding to the current first decoding unit, and the decoding result corresponding to the current first decoding unit is used to form the input data of a subsequent first decoding unit.

In one embodiment, the second decoding module is further configured to determine input data corresponding to each second decoding unit in the second decoding structure, where the input data corresponding to the second decoding unit includes a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding unit, and at least a part of the input data corresponding to the second decoding unit further includes the mask image; and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to the second decoding units, so as to obtain candidate mask graphs output by the second decoding units respectively, wherein the levels of the candidate mask graphs output by the second decoding units are different.

In one embodiment, the updating module is further configured to enlarge the candidate mask map of the current layer to the same scale as the candidate mask map of the next adjacent layer, to obtain an enlarged candidate mask map; extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics; performing fusion processing based on the edge transition image and the candidate mask map of the next adjacent layer to obtain an updated mask map; taking the updated mask map as a candidate mask map of the current level in the next round, returning to the step of amplifying the candidate mask map of the current level to the same scale as the candidate mask map of the next adjacent level, and continuing to execute until the updated mask map corresponding to the candidate mask map of the last level is obtained; and taking the updated mask map corresponding to the candidate mask map of the last layer as a target mask map.

In one embodiment, the updating module is further configured to perform morphological processing on the edge transition image to obtain a processed edge transition image, where the morphological processing includes at least one of a corrosion operation and an expansion operation;

And carrying out fusion processing on the processed edge transition image and the candidate mask image of the next adjacent layer to obtain an updated mask image.

In one embodiment, the second decoding module is further configured to perform a second decoding process based on the feature maps of the multiple different scales and the mask image, to obtain a front Jing Setu that includes the target subject;

and the result determining module is further configured to perform fusion processing on the front Jing Setu and the target mask map, so as to obtain a matting result corresponding to the target main body.

In one embodiment, the apparatus is performed by an image processing model, the apparatus further comprising:

the acquisition module is used for acquiring a first sample image set and a second sample image set, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image;

the model determining module is used for determining an image processing model to be trained, and the image processing model to be trained comprises an initial coding structure, an initial first decoding structure and an initial second decoding structure;

The first training module is used for carrying out first training on the initial coding structure and the initial first decoding structure through the first sample image set until a first stopping condition is met, and obtaining a coding structure and a first decoding structure after training is completed;

the second training module is used for performing second training on the initial second decoding structure through the second sample image set based on the training-completed coding structure and the first decoding structure until a second stopping condition is met, so as to obtain a training-completed second decoding structure and obtain a training-completed image processing model; the trained image processing model is used for determining a matting result corresponding to a target main body in an original image.

In one embodiment, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In one embodiment, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In one embodiment, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The image processing method, the image processing device, the computer equipment, the storage medium and the computer program product obtain a plurality of feature graphs with different scales by performing feature coding processing on the original image. The large scale feature map contains more low level semantic features and the small scale feature map contains more high level semantic features. The first decoding process is performed based on a plurality of feature maps with different scales, so that a target subject in an original image can be identified based on semantic features with different levels, and the position of the target subject in the original image is represented on a semantic level through a mask image so as to initially segment the target subject. The mask image can provide stronger semantic information, and the second decoding process is performed based on the feature images and the mask image with different scales, so that the layer interpretation of the edge transition region of the target main body and the background region can be effectively guided. In addition, the levels of the candidate mask images are different, the candidate mask images of the high level contain edge transition information between more target main bodies and background areas, the candidate mask images of the different levels are updated according to the edge transition characteristics in the candidate mask images, strong attention regression to the low level characteristics can be effectively realized, and therefore the target mask images with clearer textures of the target main bodies and finer edge transition details are obtained, further, the matting results corresponding to the target main bodies can be accurately obtained according to the target mask images, and the matting accuracy is further improved.

In another aspect, the present application further provides a training method of an image processing model, where the method includes:

acquiring a first sample image set and a second sample image set, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image;

determining an image processing model to be trained, wherein the image processing model to be trained comprises an initial coding structure, an initial first decoding structure and an initial second decoding structure;

performing first training on the initial coding structure and the initial first decoding structure through the first sample image set until a first stopping condition is met, and obtaining a coding structure and a first decoding structure after training is completed;

performing second training on the initial second decoding structure based on the training-completed coding structure and the first decoding structure through the second sample image set until a second stopping condition is met, and stopping to obtain a training-completed second decoding structure so as to obtain a training-completed image processing model; the trained image processing model is used for determining a matting result corresponding to a target main body in an original image.

In one embodiment, the application further provides a training device of the image processing model. The device comprises:

In one embodiment, the first training module is further configured to perform feature encoding processing on the first sample image through the initial encoding structure to obtain a plurality of first feature maps with different scales; performing first decoding processing on the basis of the first feature maps with the different scales through the initial first decoding structure to obtain a first mask image corresponding to the first sample image; training the initial coding structure and the initial first decoding structure according to the difference between the segmentation labels corresponding to the first mask image and the first sample image until a first stopping condition is met, and obtaining a trained coding structure and a trained first decoding structure.

In one embodiment, the second training module is further configured to perform feature encoding processing on the second sample image through the encoding structure to obtain a plurality of second feature maps with different scales; performing first decoding processing on the basis of the second feature maps with the different scales through the first decoding structure to obtain a second mask image corresponding to the second sample image; performing a second decoding process based on the second feature maps of the plurality of different scales and the second mask image through the initial second decoding structure to obtain a plurality of sample mask maps of different levels; updating the sample mask patterns of the plurality of different layers based on sample edge transition characteristics between the sample main body and a background area included in each sample mask pattern through the initial second decoding structure to obtain a prediction mask pattern; and training the initial second decoding structure based on the mask loss between the prediction mask map and the key label corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the mask penalty includes a global penalty and a scale penalty; the second training module is further configured to determine a global loss between the prediction mask image and the matting label corresponding to the second sample image; adjusting the prediction mask map to different scales, and determining scale loss of the prediction mask map under the different scales; constructing a target loss function based on the global loss and the scale loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second training module is further configured to determine a mask loss between the second mask image and a segmentation label corresponding to the second sample image; determining mask loss between the prediction mask map and the matting icon corresponding to the second sample image, and constructing a target loss function based on the mask loss and the mask loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second sample image set further includes a foreground color label corresponding to the second sample image, and the second training module is further configured to perform, through the initial second decoding structure, a second decoding process based on the second feature maps of the plurality of different scales and the second mask image, to obtain a pre-prediction Jing Setu; determining a foreground color loss between the predicted foreground color map and a foreground color label corresponding to the second sample image; determining a mask loss between the prediction mask image and the key label corresponding to the second sample image; constructing an objective loss function based on the mask loss and the foreground color loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second sample image includes a composite image therein; the second training module is further used for determining channel loss of the prediction mask image corresponding to the synthesized image and the corresponding matting label on the color channel; determining a mask loss between the prediction mask map and the matting label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the channel loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

According to the training method, the device, the computer equipment, the storage medium and the computer program product of the image processing model, the first sample image set and the second sample image set are obtained, the first sample image set comprises the first sample image and the segmentation label corresponding to the first sample image, the second sample image set comprises the second sample image and the matting label corresponding to the second sample image, the image processing model to be trained is determined, the image processing model to be trained comprises an initial coding structure, an initial first decoding structure and an initial second decoding structure, the first sample image set is used for carrying out first training on the initial coding structure and the initial first decoding structure until the first stopping condition is met, the trained coding structure and the first decoding structure are obtained, the second sample image set is used for carrying out second training on the initial second decoding structure on the basis of the trained coding structure and the first decoding structure until the second stopping condition is met, and the trained second decoding structure can be obtained, and the processing accuracy of the processed image can be accurately determined through the first decoding structure and the initial decoding structure, and the object model can be accurately processed through the segmentation model. And, the image processing model is used for determining the matting result corresponding to the target main body in the original image, so that the matting efficiency can be effectively improved.

Drawings

FIG. 1 is a diagram of an application environment for an image processing method in one embodiment;

FIG. 2 is a flow chart of an image processing method in one embodiment;

FIG. 3 is a schematic illustration of a mask pattern provided in one embodiment;

FIG. 4 is a schematic illustration of a foreground color map provided in another embodiment;

FIG. 5 is a flow chart of an image processing method in one embodiment;

FIG. 6 is a schematic diagram of an image processing model in one embodiment;

FIG. 7 is a schematic diagram of a comparison of a mask map and a corresponding edge transition image in one embodiment;

FIG. 8 is a flow chart of a training method of an image processing model in one embodiment;

FIG. 9 is a diagram illustrating a comparison of segmentation results and matting results in one embodiment;

FIG. 10 is a flow diagram of segmentation task training in one embodiment;

FIG. 11 is a flow diagram of a matting task training in one embodiment;

FIG. 12 is a schematic diagram showing a comparison of a matting result obtained by an image processing model and a processing result obtained in a conventional manner in one embodiment;

FIG. 13 is a comparative schematic of no foreground color synthesis and foreground color synthesis in one embodiment;

FIG. 14 is a block diagram showing the structure of an image processing apparatus in one embodiment;

FIG. 15 is a block diagram of a training apparatus for an image processing model in one embodiment;

fig. 16 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The embodiments of the present application may be applied to various scenarios including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, assisted driving, and the like. For example, it is applied to the field of artificial intelligence (Artificial Intelligence, AI) technology, where artificial intelligence is a theory, method, technique and application system that simulates, extends and expands human intelligence, perceives the environment, acquires knowledge and uses the knowledge to obtain the best result using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The scheme provided by the embodiment of the application relates to an image processing method of artificial intelligence, and specifically is described through the following embodiments.

The image processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 and the server 104 may each individually perform the image processing method provided in the embodiment of the present application. The terminal 102 and the server 104 may also cooperate to perform the image processing methods provided in the embodiments of the present application. When the terminal 102 and the server 104 cooperate to perform the image processing method provided in the embodiment of the present application, the terminal 102 acquires an original image, and transmits the original image to the server 104. The server 104 performs feature encoding processing on the original image to obtain a plurality of feature maps with different scales, performs first decoding processing based on the feature maps with different scales to obtain a mask image, and the mask image is used for segmenting a target subject from the original image on a semantic level. The server 104 performs a second decoding process based on the feature images and the mask images of a plurality of different scales to obtain candidate mask images of a plurality of different levels; and updating the candidate mask graphs of different layers according to edge transition characteristics between the target main body and the background area included in each candidate mask graph to obtain the target mask graph. The server 104 determines a matting result corresponding to the target subject according to the target mask, and the server 104 returns the matting result to the terminal 102. The terminal 102 may be, but is not limited to, various desktop computers, notebook computers, smart phones, tablet computers, intelligent voice interaction devices, intelligent home appliances, vehicle terminals, aircrafts, etc. Applications may be running on the terminal 102, which may be communication applications, mail applications, video applications, music applications, image processing applications, and the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

It should be noted that the numbers of "plural" and the like mentioned in the embodiments of the present application each refer to the number of "at least two".

In one embodiment, the training method of the image processing model may also be applied in the application environment as shown in fig. 1. Both the terminal 102 and the server 104 may independently perform the training method of the image processing model provided in the embodiments of the present application. The terminal 102 and the server 104 may also cooperate to perform the training method of the image processing model provided in the embodiments of the present application. When the terminal 102 and the server 104 cooperate to perform the training method of the image processing model provided in the embodiment of the present application, the terminal 102 acquires a first sample image set and a second sample image set, where the first sample image set includes a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set includes a second sample image and a matting label corresponding to the second sample image. The terminal 102 sends the first sample image set and the second sample image set to the server 104. The server 104 determines an image processing model to be trained that includes an initial encoding structure, an initial first decoding structure, and an initial second decoding structure. The server 104 performs a first training on the initial encoding structure and the initial first decoding structure through the first sample image set until a first stop condition is satisfied, and obtains a trained encoding structure and a trained first decoding structure. The server 104 performs a second training on the initial second decoding structure based on the trained encoding structure and the first decoding structure through the second sample image set, and stops until a second stopping condition is met, so as to obtain a trained second decoding structure, and a trained image processing model is obtained and is used for determining a matting result corresponding to the target main body in the original image.

In one embodiment, as shown in fig. 2, an image processing method is provided, which is described by taking as an example that the method is applied to a computer device (the computer device may be a terminal or a server in fig. 1), and includes the following steps:

step S202, performing feature encoding processing on the original image to obtain a plurality of feature images with different scales.

The original image is an image to be scratched, and may be any one of an RGB (Red, green, blue) image, a gray-scale image, a depth image, an image corresponding to a Y component in a YUV image, and the like, but is not limited thereto. The "Y" in the YUV image represents brightness (luminence or Luma), that is, gray scale values, and the "U" and "V" represent chromaticity (Chroma) to describe the image color and saturation, which is used to designate the color of the pixel. The original image may be an image photographed on an arbitrary scene, for example, a person image, a landscape image, an industrial device image, or the like, but is not limited thereto. Matting refers to separating a certain part of an image or video from an original image or an original video into separate layers for subsequent image synthesis or video synthesis, for example, matting a target subject from the original image. The subject refers to various subjects such as humans, flowers, cats, dogs, cows, blue sky, clouds, background, etc. The target subject refers to a required subject, and can be selected according to requirements. Subject detection (salient object detection) refers to automatically processing regions of interest while selectively ignoring regions of no interest when facing a scene. The region of interest is referred to as the subject region.

The feature map is an image which is obtained by feature encoding the original image and contains key information of the original image. The original image may be composed of a foreground region, which is a region of the original image where the target subject is located, and a background region, which is the rest of the original image except the region where the target subject is located.

Specifically, the computer device may acquire an original image, and perform feature encoding processing on the original image to obtain a plurality of feature maps with different scales. The feature encoding process may include a convolution process and may further include at least one of a pooling process and an activation process. Further, the computer device may perform convolution processing on the original image, and perform next convolution processing on the feature map obtained by the current convolution processing, and may obtain feature maps of different scales based on multiple convolution processing. Alternatively, the computer device may perform at least one of pooling and activation of the feature map obtained by the current convolution process and then perform the next convolution process to obtain a plurality of feature maps of different scales.

In one embodiment, the feature map obtained by the convolution process may be subjected to pooling, the feature map obtained by the pooling may be subjected to next convolution process, and a plurality of feature maps with different scales may be obtained based on the multiple convolution processes and the multiple pooling processes.

In one embodiment, the feature map obtained by the convolution processing may be pooled, the feature map obtained by the pooling processing may be activated, the next convolution processing may be performed on the feature map obtained by the activation processing, and a plurality of feature maps with different scales may be obtained based on the plurality of convolution processes, the plurality of pooling processes, and the plurality of activation processes.

It will be appreciated that the pooling process may be performed after each convolution process, or may be performed once after a plurality of convolutions and then performed again. For example, it is necessary to perform the convolution processing 5 times, pool the feature map of each convolution processing first and then perform the next convolution processing, or pool the feature map obtained after the two convolution processing, pool the feature map after the pooling processing and then perform the next convolution processing.

Similarly, the activation process may be performed after each pooling process, or may be performed once after a plurality of pooling processes, and then the next convolution process may be performed.

Step S204, performing first decoding processing based on the feature maps with different scales to obtain a mask image, wherein the mask image is used for segmenting the target subject from the original image on a semantic level.

The mask image is an image filter template for identifying a main body in an original image, and can shield other parts of the original image to screen out the main body in the image. The first decoding process is to preliminarily identify the target subject from the original image.

Specifically, the computer device may perform a first decoding process according to the feature maps of the plurality of different scales to identify a target subject in the original image and a position of the target subject in the original image. The feature maps of different scales represent semantic features of different levels, and mask images can be obtained through first decoding processing of the semantic features of different levels, and can be used for determining the positions of the target main body in the original image. The computer device may mask the remaining portions of the original image except for the target subject based on the mask image to screen out the target subject in the original image such that the target subject is initially segmented from the original image at a semantic level by the mask image.

Step S206, performing a second decoding process based on the feature maps and the mask images of the different scales to obtain candidate mask maps of different levels.

The mask image is a single-channel gray image, and has the same size as the original image and is used for representing the position of the main body in the original image. The mask map may be an Alpha map, where the area with a value of 1 (corresponding to a pixel value of 255) represents the foreground subject, the area with a value of 0 (corresponding to a pixel value of 0) represents the background, and the area between 0 and 1 (corresponding to a pixel value between 0 and 255) represents the translucent portion of the foreground subject, which is typically found on the edges of a person, such as hair, a hand-held transparent object, etc. The candidate mask map refers to a mask map obtained by the second decoding process. The candidate mask patterns of the multiple different layers refer to different layers of the candidate mask patterns, and may specifically be different scales of the candidate mask patterns. The second decoding process is to scratch out the target subject from the original image on the basis of the mask image obtained by the first decoding process.

Specifically, the computer device may perform a second decoding process based on the feature maps and the mask images of the plurality of different scales to identify a target subject in the original image and an edge transition region between the target subject and the background region, so as to obtain a plurality of candidate mask maps of different levels. The multiple candidate mask patterns have different levels, i.e., the multiple candidate mask patterns have different scales, and the large-scale candidate mask patterns contain more low-level information, e.g., the large-scale candidate mask patterns contain more texture information of the target subject. The small-scale candidate mask map contains more high-level semantic information, e.g., the small-scale candidate mask map contains more edge contour information of the target subject, i.e., edge transition information between the target subject and the background region.

Step S208, updating the candidate mask images of different layers according to the edge transition characteristics between the target main body and the background area included in each candidate mask image to obtain the target mask image.

Specifically, the computer device may determine edge transition features between the target subject and the background region in each candidate mask map, and update the candidate mask maps of the plurality of different levels according to the edge transition features to obtain the target mask map.

In this embodiment, the computer device may determine an edge transition feature between the target subject and the background region in the candidate mask map of the current level, and update the candidate mask map of the next level based on the edge transition feature in the candidate mask map of the current level, to obtain an updated mask map. And taking the updated mask map as the candidate mask map of the current layer of the next round, and continuously executing the steps of determining the edge transition characteristics and updating the candidate mask map of the adjacent next layer until the updating of the candidate mask map of the last layer is completed, and obtaining the target mask map.

In one embodiment, the scales of the candidate mask patterns of different levels are different, and the scale corresponding to the candidate mask pattern with higher level is smaller, and the scale corresponding to the candidate mask pattern with lower level is larger.

In one embodiment, the computer device may update the low-level candidate mask map with the high-level candidate mask map, which corresponds to a smaller scale, and the low-level candidate mask map, which corresponds to a larger scale, i.e., update the large-scale candidate mask map with the small-scale candidate mask map.

Step S210, determining a matting result corresponding to the target main body according to the target mask map.

The matting result refers to an image of the target main body scratched out of the original image.

Specifically, the computer device may perform fusion processing on the target mask image and the original image, i.e. may scratch the target subject from the original image.

In one embodiment, the computer device may perform a second decoding process based on the feature maps and the mask images of the plurality of different scales to obtain a foreground color map including the target subject. And carrying out fusion processing on the foreground color map and the target mask map to obtain a matting result corresponding to the target main body. Where the front Jing Setu is an image of the same size as the original image and is used to represent the color components belonging to the foreground in the pixels of the original image. The target subject belongs to the foreground, and the foreground color map is used to represent color components belonging to the target subject in pixels of the original image.

In this embodiment, feature encoding processing is performed on an original image to obtain a plurality of feature maps with different scales, a large-scale feature map contains more low-level semantic features, and a small-scale feature map contains more high-level semantic features. The first decoding process is performed based on a plurality of feature maps with different scales, so that a target subject in an original image can be identified based on semantic features with different levels, and the position of the target subject in the original image is represented on a semantic level through a mask image so as to initially segment the target subject. The mask image can provide stronger semantic information, and the second decoding process is performed based on the feature images and the mask image with different scales, so that the layer interpretation of the edge transition region of the target main body and the background region can be effectively guided. In addition, the levels of the candidate mask images are different, the candidate mask images of the high level contain edge transition information between more target main bodies and background areas, the candidate mask images of the different levels are updated according to the edge transition characteristics in the candidate mask images, strong attention regression to the low level characteristics can be effectively realized, and therefore the target mask images with clearer textures of the target main bodies and finer edge transition details are obtained, further, the matting results corresponding to the target main bodies can be accurately obtained according to the target mask images, and the matting accuracy is further improved.

In one embodiment, the method is performed by an image processing model comprising an encoding structure comprised of a plurality of encoding units, a first decoding structure comprised of a plurality of first decoding units, a second decoding structure comprised of a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on the original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

In particular, the image processing method may be performed by an image processing model, which may be deployed to run on a computer device, for example on a terminal or server. The image processing model includes an encoding structure, a first decoding structure, and a second decoding structure. The encoding structure includes a plurality of encoding units, the first decoding structure includes a plurality of first decoding units, and the second decoding structure includes a plurality of second decoding units.

The computer equipment inputs the obtained original image into the coding structure of the image processing model, and performs feature coding processing on the original image through the coding structure to obtain a plurality of feature images with different scales. Inputting a plurality of feature images with different scales into a first decoding structure, and performing first decoding processing based on the feature images with different scales to obtain a mask image, wherein the mask image is used for segmenting a target subject from an original image on a semantic level. And taking the mask image and the feature images with different scales as input data of a second decoding structure, and performing second decoding processing based on the feature images with different scales and the mask image through the second decoding structure to obtain candidate mask images with different layers.

And the image processing model updates the candidate mask images of different layers according to the edge transition characteristics between the target main body and the background area included in each candidate mask image to obtain the target mask image. And the image processing model determines a matting result corresponding to the target main body according to the target mask image and outputs the matting result.

In this embodiment, a plurality of encoding units in the encoding structure correspond to a plurality of first decoding units in the first decoding structure, and a plurality of encoding units in the encoding structure correspond to a plurality of second decoding units in the second decoding structure. It can be understood that the feature map output by each encoding unit is used as input data of the corresponding first decoding unit, and the feature map output by each encoding unit is used as input data of the corresponding second decoding unit.

In one embodiment, the feature map output by each coding unit in the coding structure is of different scale.

In this embodiment, the image processing method is performed by an image processing model including an encoding structure constituted by a plurality of encoding units, a first decoding structure constituted by a plurality of first decoding units, and a second decoding structure constituted by a plurality of second decoding units. The original image is subjected to feature encoding processing through the encoding structure, semantic features of different levels can be obtained, so that the first decoding structure performs first decoding processing based on the semantic features of different levels, and the position of the target main body in the original image is primarily predicted and embodied through the mask image. According to the mask image, the rough position of the target main body in the original image is reminded, and the second decoding structure is used for carrying out second decoding processing by using semantic features of different layers and the mask image, so that the second decoding structure can be effectively guided to pay attention to interpretation of the edge transition regions of the target main body and the background region, and therefore candidate mask images of different layers are obtained through decoding, and strong attention regression to low-level features is achieved. The candidate mask map not only comprises the position of the target main body in the original image, but also comprises the characteristic information of the edge transition areas of the target main body and the background area, so that the target mask map with finer edge transition details can be obtained through the candidate mask maps of different layers.

In one embodiment, performing feature encoding processing on an original image to obtain feature graphs with a plurality of different scales, including:

acquiring an original image, and inputting the original image into a coding structure; sequentially carrying out coding processing by a plurality of coding units in the coding structure based on input data corresponding to the coding units respectively to obtain characteristic diagrams output by the coding units respectively; the input data of the initial coding unit in the coding structure is an original image, and the scales of the feature graphs output by different coding units in the coding structure are different.

Specifically, the image processing model includes a coding structure, and a plurality of coding units may be included in the coding structure. Each coding unit carries out coding processing on the respective input data to obtain a characteristic diagram which is respectively output by each coding unit. The input data of the following coding unit includes the feature map output by the preceding coding unit, and the input data of the initial coding unit in the coding structure is the original image. The feature images output by each coding unit have different scales, and a plurality of feature images with different scales can be obtained through a plurality of coding units of the coding structure.

For example, the computer device may obtain an original image, and input the original image to an initial encoding unit of the encoding structure, where the initial encoding unit performs encoding processing on the original image to obtain a feature map output by the initial encoding unit. The initial coding unit refers to the first coding unit in the coding structure, the feature map output by the initial coding unit is used as the input data of the second coding unit, and the feature map output by the previous coding unit is used as the input data of the next coding unit from the second coding unit, so that the coding units perform coding processing on the respective input data to obtain the feature map output by each coding unit.

In one embodiment, the encoding process includes a convolution process, and may further include at least one of a pooling process and an activation process. The coding unit carries out convolution processing on input data, and carries out at least one of pooling processing and activation processing on a characteristic diagram obtained by the convolution processing, so as to obtain the characteristic diagram output by the coding unit. And taking the characteristic diagram output by the coding unit as input data of the next coding unit.

In one embodiment, the coding unit may include a convolutional layer and may further include at least one of a pooling layer and an activation layer. The convolution layer is used for executing convolution processing, and the pooling layer is used for pooling the feature map output by the convolution layer. The activation layer may be used to activate the feature map output by the convolution layer or to activate the feature map output by the pooling layer.

In this embodiment, the encoding structure includes a plurality of encoding units, input data of an initial encoding unit in the encoding structure is an original image, and a feature map output by a preceding encoding unit is used as input data of a following encoding unit, so that semantic information in the original image can be extracted layer by layer through different encoding units. The feature images output by each coding unit are different in scale, so that semantic information of different layers in an original image can be represented through the feature images of different scales. The large-scale feature map contains rich low-level information, the small-scale feature map is a deep feature map, and can reflect high-level semantic information so as to decode by combining semantic information of different levels later, and the decoding accuracy is improved.

In one embodiment, performing a first decoding process based on a plurality of feature maps of different scales to obtain a mask image includes:

and based on the characteristic diagrams of a plurality of different scales, sequentially performing first decoding processing through a plurality of first decoding units in the first decoding structure based on the input data corresponding to each first decoding unit until a mask image is output through a last first decoding unit, wherein the input data of the current first decoding unit in the first decoding structure comprises a decoding result corresponding to the previous first decoding unit and the characteristic diagram output by the encoding unit corresponding to the current first decoding unit, and the decoding result corresponding to the current first decoding unit is used for forming the input data of the subsequent first decoding unit.

Specifically, a plurality of encoding units in the encoding structure correspond to a plurality of first decoding units in the first decoding structure, and a feature map output by each encoding unit serves as input data corresponding to the first decoding unit. The input data of the current first decoding unit comprises the feature map output by the corresponding encoding unit and also comprises the decoding result output by the previous decoding unit. The decoding result corresponding to the current first decoding unit is used to construct the input data of the following first decoding unit.

Specifically, the current first decoding unit performs first decoding processing based on current input data to obtain a decoding result output by the current first decoding unit, and takes the decoding result output by the current first decoding unit as input data of a next first decoding unit, so that the next first decoding unit performs first decoding processing based on a previous decoding result and a corresponding feature map until a decoding result output by a last first decoding unit is obtained. The decoding result output by the last first decoding unit is the mask image.

In this embodiment, the input data of the current first decoding unit includes the decoding result of the previous first decoding unit and the feature map output by the corresponding encoding unit, so that the current first decoding unit decodes again based on the previous decoding result and the corresponding encoding feature map, and the feature map characterizes the semantic information of the original image, so that the target subject in the original image can be identified through multiple times of decoding, and the position of the target subject in the original image is identified through the mask image on the semantic level, so as to initially segment the target subject.

In one embodiment, performing a second decoding process based on the feature maps and the mask images of the plurality of different scales to obtain candidate mask maps of the plurality of different levels, including:

determining input data corresponding to each second decoding unit in the second decoding structure, wherein the input data corresponding to the second decoding units comprises a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding units, and at least part of the input data corresponding to the second decoding units also comprises mask images; and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to the second decoding units, so as to obtain candidate mask graphs output by the second decoding units respectively, wherein the levels of the candidate mask graphs output by the second decoding units are different.

Specifically, the second decoding structure includes a plurality of second decoding units, and the input data corresponding to each second decoding unit includes a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding unit. And at least part of the input data corresponding to the second decoding unit also comprises a mask image obtained by the first decoding process. And each second decoding unit performs second decoding processing based on the respective input data to obtain candidate mask graphs respectively output by each second decoding processing unit. And, the levels of the candidate mask patterns outputted by each second decoding unit are different.

Further, the image processing model may determine input data corresponding to the current second decoding unit in the second decoding structure, and when the input data of the current second decoding unit includes the feature map and the candidate mask map, perform second decoding processing based on the feature map and the candidate mask map, to obtain a candidate mask map output by the current second decoding unit, and use the candidate mask map as input data of the next second decoding unit. And when the input data of the current second decoding unit comprises the feature map, the candidate mask map and the mask image, performing second decoding processing based on the feature map, the candidate mask map and the mask image to obtain the candidate mask map output by the current second decoding unit, and taking the candidate mask map as the input data of the next second decoding unit until the candidate mask map respectively output by each second decoding unit is obtained.

In one embodiment, when the input data of the second decoding unit includes the feature map and the candidate mask map, the feature map and the candidate mask map are fused to obtain the candidate mask map output by the second decoding unit. When the input data of the second decoding unit comprises the feature map, the candidate mask map and the mask image, the feature map and the mask image can be fused to obtain fusion features, and then the fusion features and the candidate mask map are fused to obtain the candidate mask map output by the second decoding unit. And according to the same processing mode, obtaining the candidate mask map output by each second decoding unit respectively.

In this embodiment, the input data corresponding to a part of the second decoding units in the second decoding structure includes a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding unit, so that a part of the second decoding units combine the corresponding encoding features and the candidate mask map obtained by the previous decoding to further decode, so as to obtain a more accurate candidate mask map. And at least a part of input data corresponding to the second decoding unit also comprises a mask image, the mask image can provide stronger semantic information, the second decoding unit is guided to focus on the edge transition region decoding, and the interpretation accuracy of the edge transition region of the target main body and the background region is further improved. And, the levels of the candidate mask graphs output by the second decoding units are different, the candidate mask graph of the lower level contains more texture information of the target main body, and the candidate mask graph of the higher level contains more edge transition information between the target main body and the background area, so that the target mask graph with clearer target main body texture and clearer edge transition can be obtained based on the candidate mask graphs of a plurality of different levels.

In one embodiment, updating the candidate mask map of the plurality of different levels according to edge transition characteristics between the target main body and the background area included in each candidate mask map to obtain the target mask map includes:

amplifying the candidate mask map of the current level to the same scale as the candidate mask map of the next adjacent level to obtain an amplified candidate mask map; extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics; fusion processing is carried out on the basis of the edge transition image and the candidate mask image of the next adjacent layer, so that an updated mask image is obtained; taking the updated mask pattern as a candidate mask pattern of the current layer in the next round, returning to the step of amplifying the candidate mask pattern of the current layer to the same scale as the candidate mask pattern of the adjacent next layer, and continuing to execute until the updated mask pattern corresponding to the candidate mask pattern of the last layer is obtained; and taking the updated mask diagram corresponding to the candidate mask diagram of the last layer as a target mask diagram.

Specifically, the scales of the candidate mask patterns of different levels are different, and the scale corresponding to the candidate mask pattern with the higher level is smaller, and the scale corresponding to the candidate mask pattern with the lower level is larger.

The levels corresponding to the candidate mask patterns of a plurality of different levels are sequentially reduced, and the corresponding scales are sequentially increased. The computer device may determine a candidate mask map of a current level and determine a candidate mask map of a next level adjacent to the candidate mask map of the current level, the current level being higher than the next level adjacent, i.e., the scale corresponding to the current level being smaller than the scale corresponding to the next level adjacent.

The computer equipment enlarges the candidate mask map of the current layer to the same scale as the candidate mask map of the adjacent next layer, and obtains the enlarged candidate mask map. The computer device determines edge transition features between the target subject and the background region in the enlarged candidate mask map, and extracts an edge transition image between the target subject and the background region from the enlarged candidate mask map based on the edge transition features. And the computer equipment performs fusion processing on the edge transition image and the candidate mask map of the next adjacent layer to obtain an updated mask map. Then, the computer equipment takes the updated mask pattern as the candidate mask pattern of the current layer in the next round, returns to the step of enlarging the candidate mask pattern of the current layer to the same scale as the candidate mask pattern of the adjacent next layer and continues to execute until the updated mask pattern corresponding to the candidate mask pattern of the last layer is obtained. The computer equipment takes the updated mask map corresponding to the candidate mask map of the last layer as a target mask map.

In one embodiment, the computer device may select a candidate mask map of a highest level as a candidate mask map of a current level, to gradually update a candidate mask map of a lower level from the candidate mask map of the highest level until the candidate mask map of the lowest level is updated, and take a mask map obtained by updating the candidate mask map of the lowest level as a target mask map.

In one embodiment, the candidate mask patterns of the different layers are candidate mask patterns of different scales. Amplifying the candidate mask map of the current scale to the same scale as the candidate mask map of the next adjacent scale to obtain an amplified candidate mask map, wherein the current scale is smaller than the next adjacent scale; extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics; fusion processing is carried out on the basis of the edge transition image and the candidate mask image of the adjacent next scale, so that an updated mask image is obtained; taking the updated mask pattern as a candidate mask pattern of the current scale in the next round, returning to the step of amplifying the candidate mask pattern of the current scale to the same scale as the candidate mask pattern of the adjacent next scale, and continuing to execute until the updated mask pattern corresponding to the candidate mask pattern of the last scale is obtained; and taking the updated mask map corresponding to the candidate mask map of the last scale as a target mask map.

In this embodiment, the low-level candidate mask map includes more bottom layer information, i.e., includes more texture information of the target subject, and the high-level candidate mask map includes more depth information, i.e., includes more edge transition information between the target subject and the background region. The candidate mask map of the current level is enlarged to the same scale as the candidate mask map of the next level adjacent to the candidate mask map of the current level, so that an edge transition image between a target main body and a background area in the enlarged candidate mask map is extracted, and edge transition information can be extracted from the candidate mask map of the high level. And the edge transition information in the low-level candidate mask image can be updated through the high-level edge transition information based on the fusion processing of the edge transition image and the candidate mask image of the next adjacent level, so that edge transition details between a target main body and a background area in the updated low-level candidate mask image are clearer, meanwhile, enough texture information of the target main body can be reserved, and the accuracy of the matting is improved.

In one embodiment, the fusing process is performed based on the edge transition image and the candidate mask map of the next adjacent layer to obtain an updated mask map, which includes:

Performing morphological processing on the edge transition image to obtain a processed edge transition image, wherein the morphological processing comprises at least one of corrosion operation and expansion operation; and carrying out fusion processing on the processed edge transition image and the candidate mask image of the next adjacent layer to obtain an updated mask image.

Wherein morphological processing refers to processing of shape features of an image, and may include at least one of etching operations and swelling operations. The morphological processing may also be an open operation, which refers to etching followed by expansion, or a closed operation, which refers to etching followed by expansion. The dilation is to enlarge the boundary of the pixel connection components of 1 each in the mask map by one layer to fill the hole inside the edge or 0 pixels. The etching is to remove the boundary points of 1 pixel connection components of the mask image so as to reduce one layer, extract backbone information, remove burrs and remove isolated 0 pixels.

Specifically, the computer device may perform at least one of a corroding operation and an expanding operation on the edge transition image to obtain a processed edge transition image. The computer device may determine an edge transition region in the candidate mask map of the next adjacent level, and fuse the processed edge transition image with the edge transition region in the candidate mask map of the next adjacent level to obtain an updated mask map.

Further, the computer device may perform a filtering operation on the morphologically processed edge transition image to obtain a processed edge transition image. The filtering operation may be median filtering.

In one embodiment, the computer device may perform an erosion operation on the edge transition image prior to a dilation operation to remove noise. The computer equipment can conduct guide filtering processing on the edge transition image after morphological processing, and edge filtering operation is achieved, so that the processed edge transition image is obtained.

In the embodiment, the morphological processing and the filtering processing can ensure that the noise points of the obtained edge transition image are less or no, and the edge is softer. And carrying out fusion processing on the processed edge transition image and the candidate mask image of the next adjacent layer, so that edge transition details of the target main body in the obtained updated mask image are fused more.

In one embodiment, the method further comprises: performing a second decoding process based on the feature maps and the mask images of the plurality of different scales to obtain a front Jing Setu including the target subject;

determining a matting result corresponding to the target main body according to the target mask map, including: and carrying out fusion processing on the foreground color map and the target mask map to obtain a matting result corresponding to the target main body.

Specifically, the computer device may perform a second decoding process based on the feature maps and the mask images of a plurality of different scales, so as to identify color components belonging to the foreground in the original image, and obtain a foreground color map. The target subject belongs to a foreground region in the original image, and the foreground color map comprises color components corresponding to the target subject in the original image. And the computer equipment performs fusion processing on the foreground color map and the target mask map to obtain a matting result corresponding to the target main body.

The original image may be composed of parts in the following formulas:

C＝A*F+(1-A)*B

wherein C is an original image, A is an Alpha image, F is a foreground color image, and B is a background image. The Alpha map is shown in fig. 3, and the corresponding foreground color map is shown in fig. 4.

In one embodiment, the image processing model includes an encoding structure composed of a plurality of encoding units, a first decoding structure composed of a plurality of first decoding units, a second decoding structure composed of a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on the original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

Performing a second decoding process based on the feature maps and the mask images of the plurality of different scales to obtain a front Jing Setu including the target subject, including: determining input data corresponding to each second decoding unit in the second decoding structure, wherein the input data corresponding to the second decoding units comprises a feature map output by a corresponding encoding unit and foreground color features output by a previous second decoding unit before the second decoding units, and at least part of the input data corresponding to the second decoding units also comprises mask images; and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to each second decoding unit until a foreground color image output by the last second decoding unit is obtained.

In one embodiment, when the input data of the second decoding unit includes the feature map and the foreground color feature of the corresponding encoding unit, the feature map and the foreground color feature are fused to obtain the foreground color feature output by the second decoding unit, and the foreground color feature output by the second decoding unit is used as the input of the next second decoding unit. When the input data of the second decoding unit comprises a feature map, a foreground color feature and a mask image, the feature map and the mask image can be subjected to fusion processing to obtain a fusion feature, the fusion feature and the foreground color feature are subjected to fusion processing to obtain the foreground color feature output by the second decoding unit, and the foreground color feature output by the second decoding unit is used as the input of the next second decoding unit until the foreground color map is output by the last second decoding unit.

In one embodiment, the second decoding structure may include processing parameters corresponding to the mask map and processing parameters corresponding to the foreground color map. When a plurality of candidate mask images with different levels are needed to be obtained, processing parameters corresponding to the mask images can be selected, so that the second decoding structure carries out second decoding processing on the feature images with different scales and the mask images based on the processing parameters corresponding to the mask images, and the candidate mask images with different levels are obtained. When the foreground color map is needed to be obtained, processing parameters corresponding to the foreground color map can be selected, so that the second decoding structure performs second decoding processing on the feature maps and the mask images with different scales based on the processing parameters corresponding to the foreground color map, and a plurality of candidate mask maps with different layers are obtained.

In one embodiment, the second decoding structure may include a first branch and a second branch, the processing parameters of the first branch and the second branch being different. The first branch is used for performing second decoding processing based on the feature images and the mask images of a plurality of different scales to obtain candidate mask images of a plurality of different levels. The second branch is used for carrying out second decoding processing based on the feature images and the mask images with different scales to obtain a foreground color image comprising the target main body.

In this embodiment, the first branch and the second branch have the same structure, but the processing parameters are different, that is, the first branch and the second branch are each formed by a plurality of second decoding units, but the processing parameters of the second decoding units in the first branch are different from the processing parameters of the second decoding units in the second branch.

In this embodiment, the second decoding process is performed based on the feature maps and the mask images with different scales, so as to accurately extract the color components corresponding to the foreground region in the original image, and particularly accurately obtain the color components corresponding to the edge transition regions of the foreground and the background, so as to obtain the foreground color map including the target subject. And carrying out fusion processing on the foreground color map and the target mask map, so that the edge detail of the obtained matting result is more information, and the edge color is more accurate.

As shown in fig. 5, a flowchart of an image processing method in one embodiment is shown. The image processing model includes a shared feature Encoder (Encoder), a segmentation Decoder (Decoder), and a matting Encoder. The shared feature encoder is the encoding structure, the partition decoder is the first decoding structure, and the matting decoder is the second decoding structure. The original RGB image (namely the original image) is input into a shared feature encoder, and feature encoding processing is carried out on the original RGB image through the shared feature encoder, so that feature diagrams with different scales are obtained. And respectively taking the feature images with different scales as input data of a segmentation decoder and a matting decoder. And the segmentation decoder performs segmentation decoding processing on the feature images with different scales to obtain a rough portrait segmentation Mask image, namely a Mask image. Then, the rough portrait segmentation Mask is used as input data of a matting decoder, the matting encoder carries out matting encoding processing based on feature images of different scales and the rough portrait segmentation Mask, and finally, the matting decoder outputs a fine matting Alpha image and a portrait foreground color image. The fine matting Alpha image is the target mask image. The architecture diagram of the image processing model may be as shown in fig. 6.

As shown in fig. 6, the shared feature encoder is composed of a convolution layer, a pooling layer and an activation layer, in the convolution process, the size of a feature map becomes smaller as the network deepens, and features with different scales represent semantic information of different layers. Large feature maps contain rich low-level information such as color, edge, geometry, and texture information. Small scale features, i.e., deep feature maps, more reflect high level semantic features, such as foreground and background differentiation. The shared feature encoder design in this embodiment has sufficient depth to ensure that it has the ability to mine for valid features.

In one embodiment, the coding structure is made up of a plurality of coding units, i.e. the shared feature encoder is made up of a plurality of coding units, which are made up of a convolutional layer, a pooling layer and an activation layer.

Because the feature extraction has a certain general type and has a certain similarity to the segmentation task and the matting task, the embodiment only needs to design one encoder, so that the two tasks share the features. The shared feature encoder can also reduce network-to-redundant computation, so that real-time performance can still be ensured when two task reasoning is completed. The shared feature encoder eventually receives a Atrous Spatial Pyramid Pooling (hole space pyramid pooling, ASPP for short) unit to help segment objects of different sizes.

The partition decoder is also composed of a convolution layer, a pooling layer and an activation layer, and during decoding, feature maps are enlarged layer by layer, and feature maps of each layer are derived from the feature maps of the shared feature encoder through skip connection (skip connection). In the embodiment, the segmentation output layer is restored to 1/2 scale of the original image, so as to reduce the network reasoning time. The segmentation decoder finally outputs a Mask map, i.e. a Mask image, with the position of 1 representing the target subject position and 0 representing the position of the background region.

In an embodiment the first decoding structure is constituted by a plurality of first decoding units, i.e. the partition decoder is constituted by a plurality of first decoding units, the first decoding units being constituted by a convolutional layer, a pooling layer and an activation layer.

The matting decoder needs to take Mask images and feature images with different scales output by the segmentation decoder as inputs. The second decoding structure is composed of a plurality of second decoding units, namely, the matting decoder is composed of a plurality of second decoding units, and the plurality of second decoding units are connected with the plurality of encoding units through skip connection structures. The input data corresponding to the second decoding unit comprises a feature map output by the corresponding encoding unit and a candidate Mask map output by the last second decoding unit of the second decoding unit, and at least part of the input data corresponding to the second decoding unit also comprises a Mask map output by the segmentation decoder. When the input data of the second decoding unit comprises the feature map, the candidate Mask map and the Mask map, the feature map and the Mask map can be fused to obtain fusion features, and then the fusion features and the candidate Mask map are fused to obtain the candidate Mask map output by the second decoding unit. In the fusion process, the segmentation mask can provide strong semantic information, and the guide matting encoder is focused on edge position information interpretation. And according to the same processing mode, obtaining the candidate mask map output by each second decoding unit respectively. A fine Alpha map, i.e., a target mask map, may be derived based on each candidate mask map.

And the matting decoder takes the Mask image and the feature images with different scales output by the segmentation decoder as inputs to decode, and can also obtain the foreground color image of the target main body.

The deep feature map provides deep semantic information, and the shallow sub-feature map provides low semantic information, that is, the larger the scale, the clearer the texture such as portrait hairline, and the smaller the scale, the more complete the outline of the subject such as portrait. In order to both preserve contours and increase edge fine textures while enlarging feature map dimensions, the following fine-scale magnification strategy may be used to scale up candidate alpha maps of different scales from small to large:

1) The small-scale alpha map is enlarged to be the same size as the large-scale alpha map, and alpha is obtained _s Graph alpha _s The diagram is shown in fig. 7 (a);

2) Extracting small scale alpha _s Edge position alpha of the graph _small I.e. 0 < alpha _small ＜1；

3) For extracted small scale alpha _s Morphological processing and median filtering are carried out on the edge position image of the graph to obtain a complete smooth and expanded edge region mask _edge I.e., an edge transition image, as shown in (b) of fig. 7;

4) Updating the large-scale alpha map by the following formula to obtain an updated mask map alpha _big ；

alpha _big ＝alpha _big *mask _edge +alpha _small *(1-mask _edge )

5) Will update the mask map alpha _big And (4) continuing to execute the steps 1) -4) as the small-scale alpha map of the next round until the updating of the last large-scale alpha map is completed, and obtaining the target mask map.

In one embodiment, the method is performed by an image processing model, the image processing model being obtained by a training step comprising:

acquiring a first sample image set and a second sample image set, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image; determining an image processing model to be trained, wherein the image processing model to be trained comprises an initial coding structure, an initial first decoding structure and an initial second decoding structure; performing first training on the initial coding structure and the initial first decoding structure through the first sample image set until a first stopping condition is met, and obtaining a coding structure and a first decoding structure after training is completed; performing second training on the initial second decoding structure based on the training-completed coding structure and the first decoding structure through the second sample image set until a second stopping condition is met, and obtaining a training-completed second decoding structure so as to obtain a training-completed image processing model; the trained image processing model is used for determining a matting result corresponding to the target subject in the original image.

The first sample image and the second sample image may be any one of RGB (Red, green, blue) images, gray-scale images, depth images, images corresponding to the Y component in YUV images, and the like, but are not limited thereto. The matting label is a true mask corresponding to the second sample image or a true value representing the mask corresponding to the second sample image. Meeting the first stop condition may be reaching a preset number of training, a preset number of iterations, a loss value less than or equal to a loss threshold, and so forth.

In particular, the computer device may acquire a first set of sample images and a second set of sample images. The first sample image set is a set including first sample images and a respective corresponding split label for each first sample image. The second sample image set is a set comprising second sample images and a matting label corresponding to each second sample image. The segmentation labels are real mask images corresponding to the first sample image or real values characterizing the mask images corresponding to the first sample image.

In one embodiment, the same image may be present in the first sample image set and the second sample image set, which may include both composite and non-composite images. The composite image may be a fusion of a subject region in one image and a background region in another image.

The computer device may determine an image processing model to be trained that includes an initial encoding structure, an initial first decoding structure, and an initial second decoding structure. The initial encoding structure is used for performing feature encoding processing on the sample image, the initial first decoding structure is used for performing first decoding processing, and the initial second decoding structure is used for performing second decoding processing.

In one embodiment, the initial encoding structure is comprised of a plurality of encoding units, the initial first decoding structure is comprised of a plurality of first decoding units, and the initial second decoding structure is comprised of a plurality of second decoding units.

The computer equipment carries out first training on the initial coding structure and the initial first decoding structure through each first sample image and the corresponding segmentation label in the first sample image set so as to adjust parameters of the initial coding structure and the initial first decoding structure in the first training process, and continues training after the parameters are adjusted until the first stopping condition is met, so that the trained coding structure and the trained first decoding structure are obtained.

Specifically, the computer equipment performs second training on the initial second decoding structure through each second sample image in the second sample image set and the corresponding matting label, so as to adjust parameters of the initial second decoding structure in a second training process, and continues training after the parameters are adjusted until a second stopping condition is met, so that the trained second decoding structure is obtained. The trained encoded structure, the first decoded structure, and the second decoded structure form a trained image processing model. The trained image processing model is used for determining a matting result corresponding to a target subject in the original image.

In this embodiment, a first sample image set and a second sample image set are obtained, the first sample image set includes a first sample image and a segmentation label corresponding to the first sample image, the second sample image set includes a second sample image and a matting label corresponding to the second sample image, an image processing model to be trained is determined, the image processing model to be trained includes an initial encoding structure, an initial first decoding structure and an initial second decoding structure, the first sample image set is used for performing a first training on the initial encoding structure and the initial first decoding structure until a first stopping condition is met, the trained encoding structure and the first decoding structure are obtained, so that on the basis of ensuring the processing precision of the encoding structure and the first decoding structure, the second sample image set is used for performing a second training on the initial second decoding structure based on the trained encoding structure and the first decoding structure until a second stopping condition is met, the trained second decoding structure is obtained, the trained encoding structure, the first decoding structure and the second decoding structure form a trained image processing model, and the segmentation precision of the image processing model can be determined accurately by the training structure and the segmentation label corresponding to the original image processing model. And, the image processing model is used for determining the matting result corresponding to the target main body in the original image, so that the matting efficiency can be effectively improved.

In one embodiment, the image processing method may be applied to video conferencing or live scenes. By acquiring the image frames in real time in the video conference, picking out the portraits in the image frames through the processing method, and synthesizing the picked-out portraits and the virtual background in real time, the virtual background can be used for replacing the real background of the person in the video conference or live broadcast, and the privacy of the user can be effectively protected. The synthesized image frame obtained by the image processing method is more natural for the detail processing of the portrait edge, and particularly the transition of hairline is more natural and fine. When the foreground color map is combined, the problem that a circle of artifacts can appear around the image when the dark background map is used can be effectively avoided, the virtual background effect is well improved, and the visual perception of a user is improved. In addition, the virtual background is used in the live scene, so that the cost can be reduced, and the live scene is not limited to be used in places.

In one embodiment, the image processing method can be applied to a cloud rendering platform, mainly can be applied to a studio and a home environment of a host, and the host image is scratched out in real time through the image processing method, so that the host image and a background image to be used are synthesized in real time, the use of a place is not limited, and the cost can be effectively reduced.

In one embodiment, the image processing method can be applied to video explanation scenes, and aims to remove the background in the main video stream when the main video stream is mixed with the video stream, so that the main video can be perfectly fused into the scenes in the video stream.

In one embodiment, the image processing method can be applied to any scene needing to be scratched, such as poster design activities, and the automatic scratching can be realized through the image processing method, so that the time of manual processing can be reduced, and the working efficiency can be improved.

It should be noted that, the application scenarios mentioned in the foregoing embodiments are not limited to the image processing method according to the embodiments of the present application, but are merely used to schematically illustrate the application scenarios of the image processing method provided by the embodiments of the present application.

In one embodiment, as shown in fig. 8, a training method of an image processing model is provided, and the method is applied to a computer device (the computer device may be a terminal or a server in fig. 1) for illustration, and includes the following steps:

step S802, a first sample image set and a second sample image set are acquired, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image.

In step S804, an image processing model to be trained is determined, and the image processing model to be trained includes an initial encoding structure, an initial first decoding structure, and an initial second decoding structure.

In particular, the computer device may determine an image processing model to be trained that includes an initial encoding structure, an initial first decoding structure, and an initial second decoding structure. The initial encoding structure is used for performing feature encoding processing on the sample image, the initial first decoding structure is used for performing first decoding processing, and the initial second decoding structure is used for performing second decoding processing.

Step S806, performing first training on the initial coding structure and the initial first decoding structure through the first sample image set until the first stopping condition is met, and obtaining a coding structure and a first decoding structure after training is completed.

Wherein, the first stopping condition is met by reaching a preset training frequency, a preset iteration frequency, a loss value smaller than or equal to a loss threshold value, and the like.

Specifically, the computer device performs first training on the initial coding structure and the initial first decoding structure through each first sample image in the first sample image set and the corresponding segmentation label, so as to adjust parameters of the initial coding structure and the initial first decoding structure in a first training process, and continues training after the parameters are adjusted until a first stopping condition is met, and then the trained coding structure and the trained first decoding structure are obtained.

For example, when the training of the initial coding structure and the initial first decoding structure does not reach the preset training times, the parameters of the initial coding structure and the initial first decoding structure are adjusted and training is continued until the training reaches the preset training times, and the coding structure and the first decoding structure after the training is completed are obtained.

Step S808, performing second training on the initial second decoding structure based on the trained encoding structure and the first decoding structure through the second sample image set until a second stopping condition is met, and obtaining a trained second decoding structure so as to obtain a trained image processing model; the trained image processing model is used for determining a matting result corresponding to a target subject in the original image.

Wherein, the meeting of the second stopping condition may be reaching a preset training number, a preset iteration number, a loss value less than or equal to a loss threshold value, and the like.

In one embodiment, performing a first training on the initial encoding structure and the initial first decoding structure through the first sample image set until a first stopping condition is met, to obtain a trained encoding structure and a trained first decoding structure, including:

performing feature coding processing on the first sample image through an initial coding structure to obtain a plurality of first feature images with different scales; performing first decoding processing on the basis of a plurality of first feature images with different scales through an initial first decoding structure to obtain a first mask image corresponding to a first sample image; training the initial coding structure and the initial first decoding structure according to the difference between the segmentation labels corresponding to the first mask image and the first sample image until the first stopping condition is met, and obtaining the coding structure and the first decoding structure after training.

Specifically, the computer device may perform feature encoding processing on the first sample image through the initial encoding structure to obtain a first feature map, and perform next feature encoding processing on the obtained first feature map to obtain a corresponding first feature map. And obtaining first feature graphs with different scales through multiple feature coding processes. The initial first decoding structure performs a first decoding process according to a plurality of first feature maps of different scales to identify a first subject in a first sample image and a position of the first subject in the first sample image. The first feature maps of different scales represent semantic features of different levels, and a first mask image can be obtained through first decoding processing of the semantic features of different levels, and the first mask image can be used for determining the position of a first main body in the first sample image. The computer device may calculate a difference between the segmentation labels corresponding to the first mask image and the first sample image, adjust parameters of the initial encoding structure and the initial first decoding structure based on the difference, and continue training until the trained initial encoding structure and initial first decoding structure meet a first stop condition, thereby obtaining a trained encoding structure and first decoding structure.

In one embodiment, the initial encoding structure is comprised of a plurality of encoding units and the initial first decoding structure is comprised of a plurality of first decoding units. Sequentially carrying out coding processing by a plurality of coding units in the initial coding structure based on input data corresponding to the coding units respectively to obtain first characteristic diagrams output by the coding units respectively; the input data of the initial coding unit in the initial coding structure is a first sample image, and the scales of the first feature images output by different coding units in the initial coding structure are different.

And based on the first feature graphs of a plurality of different scales, sequentially performing first decoding processing based on input data corresponding to each first decoding unit through a plurality of first decoding units in an initial first decoding structure until a mask image is output through a last first decoding unit, wherein the input data of the current first decoding unit in the initial first decoding structure comprises decoding results corresponding to the previous first decoding unit and the first feature graph output by the encoding unit corresponding to the current first decoding unit, and the decoding results corresponding to the current first decoding unit are used for forming the input data of the subsequent first decoding unit.

For example, the difference between the first mask image and the corresponding segmentation label of the first sample image may be calculated by a cross entropy loss function:

wherein y is _i The true value representing the first sample image i, i.e. the split label. Positive class is 1, negative class is 0, positive class refers to sample body, e.g. portrait, and negative class refers to background. P is p _i Representing the probability that the first sample image i is predicted to be of positive class.

In this embodiment, feature encoding processing is performed on a first sample image through an initial encoding structure to obtain a plurality of first feature maps with different scales, first decoding processing is performed on the basis of the plurality of first feature maps with different scales through an initial first decoding structure to obtain a first mask image corresponding to the first sample image, and training is performed on the initial encoding structure and the initial first decoding structure according to differences between segmentation labels corresponding to the first mask image and the first sample image until a first stop condition is met, so that the trained encoding structure can accurately extract semantic information with different levels. The large-scale feature map contains rich low-level information, and the small-scale feature map can reflect high-level semantic information, so that the initial first decoding structure can decode by combining semantic information of different levels, and the decoding accuracy is improved.

In one embodiment, performing a second training on the initial second decoding structure through the second sample image set and based on the trained encoding structure and the first decoding structure until a second stop condition is met, to obtain a trained second decoding structure, including:

performing feature coding processing on the second sample image through the coding structure to obtain a plurality of second feature images with different scales; performing first decoding processing on the basis of a plurality of second feature images with different scales through a first decoding structure to obtain a second mask image corresponding to a second sample image; performing second decoding processing based on a plurality of second feature images with different scales and a second mask image through an initial second decoding structure to obtain a plurality of sample mask images with different layers; updating the sample mask images of a plurality of different layers based on sample edge transition characteristics between a sample main body and a background area included in each sample mask image through an initial second decoding structure to obtain a prediction mask image; and training the initial second decoding structure based on the mask loss between the prediction mask image and the key mark corresponding to the second sample image until the second stopping condition is met, and obtaining a trained second decoding structure.

Specifically, the computer device may perform feature encoding processing on the second sample image through the encoding structure to obtain a second feature map, and perform next feature encoding processing on the obtained second feature map to obtain a corresponding second feature map. And obtaining second feature graphs with different scales through multiple feature coding processes. The second decoding structure performs a second decoding process according to a plurality of second feature maps with different scales to identify a sample body in the second sample image and a position of the sample body in the second sample image, and generates a second mask image for representing the position of the sample body in the second sample image.

The initial second decoding structure may perform a second decoding process based on the feature maps and the mask images of the plurality of different scales to identify a sample body in the second sample image and a sample edge transition region between the sample body and the background region, thereby obtaining a plurality of sample mask maps of different levels. The initial second decoding structure can determine sample edge transition characteristics between the sample main body and the background area in each sample mask map, and update the sample mask maps of the plurality of different layers according to the sample edge transition characteristics to obtain a prediction mask map.

Further, the initial second decoding structure may determine sample edge transition features between the sample body and the background region in the sample mask map of the current level, and update the sample mask map of the next adjacent level based on the sample edge transition features in the sample mask map of the current level, resulting in an updated mask map. And taking the updated mask image as a sample mask image of the current layer of the next round, and continuously executing the steps of determining sample edge transition characteristics and updating the sample mask image of the next adjacent layer until the updating of the sample mask image of the last layer is completed, and stopping to obtain a prediction mask image.

The computer device may calculate a mask loss between the prediction mask map and the matting label corresponding to the second sample image, adjust parameters of the initial second decoding structure based on the mask loss, and continue training until the trained initial second decoding structure meets a second stop condition, to obtain a trained second decoding structure.

In one embodiment, the encoding structure is comprised of a plurality of encoding units, the first decoding structure is comprised of a plurality of first decoding units, and the initial second decoding structure is comprised of a plurality of second decoding units. Determining input data corresponding to each second decoding unit in the initial second decoding structure, wherein the input data corresponding to the second decoding units comprises a second characteristic diagram output by a corresponding encoding unit and a sample mask diagram output by a previous second decoding unit before the second decoding units, and at least part of the input data corresponding to the second decoding units also comprises a second mask image; and sequentially performing second decoding processing by a plurality of second decoding units in the initial second decoding structure based on the input data corresponding to the second decoding units, so as to obtain sample mask graphs output by the second decoding units respectively, wherein the sample mask graphs output by the second decoding units are different in hierarchy.

In one embodiment, the initial second decoding structure includes initial parameters corresponding to the mask map and initial parameters corresponding to the foreground color map. And the computer equipment performs second training on the initial second decoding structure based on the trained coding structure and the first decoding structure through the second sample image set so as to adjust initial parameters corresponding to the mask graph in the second decoding structure, and stops until a second stopping condition is met, so that processing parameters corresponding to the mask graph in the trained second decoding structure are obtained.

In one embodiment, the initial second decoding structure may include an initial first branch and an initial second branch. The computer equipment carries out second training on the initial first branch in the initial second decoding structure based on the trained coding structure and the first decoding structure through the second sample image set so as to adjust initial parameters of the initial first branch, and stops when a second stopping condition is met, so that the trained first branch and processing parameters corresponding to the first branch are obtained. The processing parameters corresponding to the first branch refer to the processing parameters corresponding to the mask map.

In this embodiment, the second decoding process is performed by the initial second decoding structure based on the second feature maps and the second mask images with different scales, to obtain a plurality of sample mask maps with different levels, including: performing second decoding processing on the basis of a plurality of second feature images with different scales and second mask images through an initial first branch in an initial second decoding structure to obtain a plurality of sample mask images with different levels;

Training the initial second decoding structure based on the mask loss between the prediction mask map and the key label corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure, including: and training the initial first branch based on the mask loss between the prediction mask image and the matting mark corresponding to the second sample image until the second stopping condition is met, and obtaining the first branch after training and the processing parameters corresponding to the first branch.

In this embodiment, the second decoding structure performs the second decoding process by using semantic features and mask images of different levels, so that the second decoding structure can be effectively guided to focus on interpretation of the edge transition regions of the target main body and the background region, so that sample mask images of different levels are obtained by decoding, and strong focus regression on low-level features is realized in the training process. The sample mask map not only comprises the position of the sample main body in the original image, but also comprises the characteristic information of the edge transition areas of the sample main body and the background area, so that a prediction mask map with finer edge transition details can be obtained through sample mask maps of different layers, the initial second decoding structure is trained based on mask loss between the prediction mask map and the matting labels corresponding to the second sample image, parameters of the second decoding structure can be continuously adjusted, the trained second decoding structure can predict the finer and finer edge transition details in the masking map, and the matting is accurate.

In one embodiment, the mask penalty includes a global penalty and a scale penalty; training the initial second decoding structure based on the mask loss between the prediction mask map and the key label corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure, including:

determining global loss between the prediction mask image and the key label corresponding to the second sample image; adjusting the prediction mask map to different scales, and determining the scale loss of the prediction mask map under different scales; constructing a target loss function based on the global loss and the scale loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

Specifically, the mask penalty includes a global penalty and a scale penalty. Global penalty refers to the overall difference between the prediction mask and the corresponding matting label, while scale penalty refers to the penalty of the prediction mask at different scales. The computer device may calculate a difference between the prediction mask and the matting label corresponding to the second sample image, determine a global penalty between the prediction mask and the matting label based on the difference. For example, the difference between the prediction mask and the matting label, or the absolute value of the difference, is taken as the global loss corresponding to the second sample image.

It will be appreciated that the second sample image set comprises a plurality of second sample images, and that in this manner the global penalty for each second sample image may be calculated and summed and averaged, with the average being taken as the global penalty in the current training.

For example, the computing device may calculate the global loss by the following formula:

wherein y is _i An alpha truth value representing a sample image i, a value between 0 and 1, p _i Representing the opacity of the predicted foreground, i.e. the predicted alpha value, of the sample image i.

The computer equipment can adjust the prediction mask image corresponding to the second sample image into different scales to obtain the prediction mask image with different scales, and calculates the global loss between the prediction mask image under each scale and the matting mark corresponding to the second sample image to obtain the global loss of the second sample image corresponding to the different scales. The computer device may calculate a scale loss corresponding to the second sample image based on the global loss at different scales and the number of prediction mask maps at different scales corresponding to the second sample image.

For example, the computing device may calculate the scale loss by the following formula:

where j is the scale of the prediction mask.

The computer equipment can take the summation formula of the global loss and the scale loss as a target loss function, or multiply the global loss and the scale loss with corresponding weights respectively and then sum the products to obtain the target loss function. Training the initial second decoding structure through the target loss function, adjusting parameters of the initial second decoding structure based on the generated target loss in training, and continuing training until the trained initial second decoding structure meets the second stopping condition, so as to obtain a trained second decoding structure.

In this embodiment, the computer device may calculate, during each training, a target loss value generated by the training through a target loss function, and when the target loss value is greater than a loss threshold, adjust parameters of the initial second decoding structure and continue training. Stopping when the target loss value is less than or equal to the loss threshold value, and obtaining the trained second decoding structure.

In this embodiment, the target loss function is constructed by combining the difference between the prediction mask map and the real matting label and the difference between the prediction mask map of different scales and the real matting label, so that the influence of the prediction results of different scales on the image processing model can be considered, and the trained second decoding structure can be compatible with the processing of the images of different scales, so that the generated prediction mask map is more accurate.

In one embodiment, the method further comprises: determining mask loss between the second mask image and the segmentation labels corresponding to the second sample image;

training the initial second decoding structure based on the mask loss between the prediction mask map and the key label corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure, including:

Determining mask loss between the prediction mask image and the key label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the mask loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In particular, the computer device may determine the difference between the second mask image and the corresponding segmentation labels of the second sample image, i.e. the mask loss, by a cross entropy loss function. The computer device may calculate a difference between the predicted mask and the matting label corresponding to the second sample image, determine a mask penalty between the predicted mask and the matting label based on the difference. The computer device may take the sum of the mask loss and the mask loss as the target loss function, or multiply the mask loss and the mask loss by corresponding weights, respectively, and then sum to obtain the target loss function. Training the initial second decoding structure through the target loss function, adjusting parameters of the initial second decoding structure based on the generated target loss in training, and continuing training until the trained initial second decoding structure meets the second stopping condition, so as to obtain a trained second decoding structure.

In this embodiment, the mask loss can represent the loss generated by the trained encoding structure and the first decoding structure, the mask loss can represent the loss generated by the initial second decoding structure, and the mask loss are combined to train the initial second decoding structure, so that the accuracy of the second decoding structure can be further improved, and the mask graph predicted by the trained second decoding structure is more accurate.

In one embodiment, the second set of sample images further includes a foreground color tag corresponding to the second sample image, the method further comprising:

performing second decoding processing based on a plurality of second feature maps with different scales and a second mask image through an initial second decoding structure to obtain a pre-prediction Jing Setu; determining a foreground color loss between the predicted foreground color map and a foreground color label corresponding to the second sample image;

determining a mask loss between the prediction mask image and the key label corresponding to the second sample image; constructing an objective loss function based on the mask loss and the foreground color loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

Specifically, the second sample image set further includes a foreground color label corresponding to each second sample image. The computer equipment performs second decoding processing based on a plurality of second feature images with different scales and a second mask image through an initial second decoding structure so as to identify color components belonging to the foreground in the second sample image and obtain a predicted foreground color image. The sample main body belongs to a foreground region in the second sample image, and the foreground color image comprises color components corresponding to the sample main body in the second sample image. The computer device may calculate a difference between the predicted foreground color map corresponding to the second sample image and the corresponding foreground color label, determine a foreground color loss between the predicted foreground color map and the corresponding foreground color label based on the difference.

For example, the foreground color loss may be calculated by the following formula:

foreground color loss:

wherein y is _i Representing foreground color truth value, p, of sample image i _i Representing the foreground color predictor of the sample image i, c represents the RGB channel index.

The computer device may calculate a difference between the predicted mask and the matting label corresponding to the second sample image, determine a mask penalty between the predicted mask and the matting label based on the difference. The computer device may take the sum formula of the mask loss and the foreground color loss as the target loss function, or multiply the mask loss and the foreground color loss with corresponding weights respectively and then sum, to obtain the target loss function. Training the initial second decoding structure through the target loss function, adjusting parameters of the initial second decoding structure based on the generated target loss in training, and continuing training until the trained initial second decoding structure meets the second stopping condition, so as to obtain a trained second decoding structure.

In one embodiment, the encoding structure is comprised of a plurality of encoding units, the first decoding structure is comprised of a plurality of first decoding units, and the initial second decoding structure is comprised of a plurality of second decoding units. Determining input data corresponding to each second decoding unit in the initial second decoding structure, wherein the input data corresponding to the second decoding units comprises a second feature map output by a corresponding encoding unit and foreground color features output by a previous second decoding unit before the second decoding units, and at least part of the input data corresponding to the second decoding units also comprises a second mask image; and sequentially performing second decoding processing on the basis of the input data corresponding to each second decoding unit through a plurality of second decoding units in the initial second decoding structure until a predicted foreground color map output by the last second decoding unit is obtained.

In one embodiment, the initial second decoding structure includes initial parameters corresponding to the mask map and initial parameters corresponding to the foreground color map. And the computer equipment performs second training on the initial second decoding structure based on the trained encoding structure and the first decoding structure through the second sample image set so as to adjust initial parameters corresponding to the foreground color map in the second decoding structure, and stops until a second stopping condition is met, so that processing parameters corresponding to the foreground color map in the trained second decoding structure are obtained.

In one embodiment, the initial second decoding structure may include an initial first branch and an initial second branch. And the computer equipment carries out second training on the initial second branch in the initial second decoding structure based on the trained coding structure and the first decoding structure through the second sample image set so as to adjust initial parameters of the initial second branch, and stops when a second stopping condition is met, so that the trained second branch and processing parameters corresponding to the second branch are obtained. The processing parameters corresponding to the second branch refer to the processing parameters corresponding to the foreground color map.

In this embodiment, performing, by the initial second decoding structure, a second decoding process based on a plurality of second feature maps of different scales and a second mask image, to obtain a predicted foreground color map, including: performing second decoding processing based on a plurality of second feature maps with different scales and a second mask image through an initial second branch in an initial second decoding structure to obtain a pre-prediction Jing Setu;

training the initial second decoding structure through the target loss function until a second stopping condition is met, and obtaining a trained second decoding structure, wherein the training comprises the following steps: and training the initial second branch through the target loss function until the second stopping condition is met, and obtaining the trained second branch and the processing parameters corresponding to the second branch. The trained first branch and second branch form a trained second decoding structure.

In this embodiment, the second decoding process is performed based on the feature maps and the mask images with different scales, so as to accurately extract the color components corresponding to the foreground region in the original image, and particularly accurately obtain the color components corresponding to the edge transition regions of the foreground and the background, so as to obtain the predicted foreground color map including the sample main body. The initial second decoding structure is trained by combining the foreground color loss and the mask loss, and parameters of a foreground color image and a mask image corresponding to a predicted image of the second decoding structure can be trained at the same time, so that the trained second decoding structure can share output data of the coding structure and the first decoding structure, the foreground color image and the mask image corresponding to the image are predicted independently based on the shared output data, edge details of the obtained matting result are more information, and the edge color is more accurate.

In one embodiment, the second sample image includes a composite image therein; the method further comprises the steps of: determining channel loss of a prediction mask image corresponding to the synthesized image and a corresponding matting label on a color channel;

Determining the mask loss between the prediction mask image and the key label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the channel loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

Specifically, the time-allowed composite image of the second sample image in the second sample image set may be a non-composite image. When the synthesized image exists in training, after a prediction mask image corresponding to the synthesized image is obtained through the initial second decoding structure, determining channel loss of the prediction mask image and the corresponding matting label on a color channel. Channel loss refers to the sum of losses over the various color channels, such as the sum of losses over the red, green, and blue channels.

For example, the channel loss can be calculated by the following formula:

wherein L is _comp For channel loss, y _i Channel truth value, p, representing sample image i _i The channel predictor for sample image i is represented, c represents the RGB channel index.

The computer device may calculate a difference between the predicted mask and the matting label corresponding to the second sample image, determine a mask penalty between the predicted mask and the matting label based on the difference. The computer device may take the sum formula of the mask loss and the channel loss as the target loss function, or multiply the mask loss and the channel loss with the corresponding weights respectively and then sum them to obtain the target loss function. Training the initial second decoding structure through the target loss function, adjusting parameters of the initial second decoding structure based on the generated target loss in training, and continuing training until the trained initial second decoding structure meets the second stopping condition, so as to obtain a trained second decoding structure.

In the embodiment, the synthetic image is used for participating in training, so that the richness and complexity of training data can be improved, and the recognition capability of an image processing model can be effectively improved. And, the loss and the shade loss on different color channels are combined to the second decoding structure, so that the accuracy of decoding the second decoding structure can be further improved.

In one embodiment, the computer device may be based on the foreground color loss L _fore Global loss L _l1 And a scale loss L _lap Construction of the objective loss function L _total Can also be based on the foreground color loss L _fore Global loss L _l1 Loss of dimension L _lap And mask loss L _seg Construction of the objective loss function L _total May also be based on global loss L _l1 Loss of dimension L _lap Loss of mask L _seg And channel loss L _comp Construction of the objective loss function L _total 。

In other embodiments, the computer device may also be based on the foreground color loss L _fore Global loss L _l1 Loss of dimension L _lap Loss of mask L _seg And channel loss L _comp Constructing a target loss function L _total The constructed target loss function L _total The following is shown:

L _total ＝L _seg +L _l1 +L _comp +L _tap +L _fore

in one embodiment, the training of the image processing model includes two phases: segmentation task training and matting task training. Segmentation task training, mainly training a shared feature encoder and a segmentation decoder, and matting task training, mainly training a matting encoder.

As can be seen from the segmentation result and the matting result in fig. 9, the segmentation result has rough edges, the composite background has a cracking sense, and the matting result is natural and real. However, the matting data has a very small data set amount due to difficulty in labeling, and compared with the segmentation data, the matting data has a relatively large amount of segmentation data. The model generalization is directly affected by the training data set size, so that the segmentation data is used for training and improving the image processing model generalization in the embodiment.

1. Segmentation task training: the shared feature encoder and the segmentation decoder are trained by using the segmentation dataset and adopting a gradient descent method, and the specific flow is as shown in fig. 10:

1) Constructing a segmented dataset

The training data can be sourced in two ways, namely, data downloading and data labeling. The open source portrait segmentation labels are more, and special portrait label data include supervisey, love segmentation and the like. The semantic segmentation dataset may also be used to extract portrait tags by setting the portrait tag to 1 and uniformly resetting other tag categories to background 0. Available semantic segmentation labels are COCO, pascalVOC, etc. The method is also suitable for the Human referencing data set, and the Human body part can be reset to 1. The number of the equal data pictures collected by the method can reach 100k, which is far more than the order of magnitude of the traditional matting data, and the robustness of the model can be greatly improved.

2) Model parameter random initialization

Since it is trained from zero, we can randomly initialize the model.

3) Batch data entry

The computer cannot calculate all data at once, so that data needs to be imported into the memory in batches. Each batch of data will run in its entirety in steps 4), 5) below.

4) Calculating segmentation loss

The data is input into the network, and the predictive mask is output after one forward propagation, so that the accuracy of the predictive mask is calculated and needs to be compared with the true mask. The function used to calculate the difference between the prediction graph and the truth graph is the loss function, and the partition loss function used herein is a two-class cross entropy loss function, with the following formula:

5) Updating model parameters

After the loss is obtained based on the steps, the network parameter gradient is obtained by running one-time back propagation calculation, and the model parameters are updated according to the gradient descent method, namely, the parameters of the shared feature encoder and the segmentation decoder are updated, so that one-time complete training is completed. If the preset number of training rounds is not reached, the process will continue to step 3) and a new training round will be started. The model parameters are updated once per round 3), 4) and 5), so that the model gradually tends to converge after multiple rounds of cyclic updating. After the training round number reaches the preset round number, the training is finished, and the flow goes to step 6).

6) Preserving model parameters

After training, the quality of the model is calculated and evaluated, and the best model parameter is selected for storage. Since the matting decoder does not participate in training at this stage, both updating and saving model parameters should be part of the parameters of the shared feature encoder and the segmentation decoder.

2. Training of matting task

Training the shared feature encoder and the matting encoder by using the matting dataset and adopting a gradient descent method, wherein the specific flow is as shown in fig. 11:

1) Constructing a matting training dataset

Open source portrait matting data are relatively scarce, and hundreds of data can be obtained. Since the matting decoder trains to the aim of forcing the network to learn low-level detail texture features, the data set should contain as much texture-rich data as possible, such as hairlines. In addition to foreground label data, we need to collect more background data to increase the richness of the data set by background synthesis during training, avoiding model overfitting.

2) Reading shared feature encoder and partition decoder parameters

Based on the segmentation training described above, the shared feature encoder has extracted some robust features, where the matting training will multiplex these shared features. Thus, the parameters of the shared feature encoder and the segmentation decoder will be read from the parameters saved after the segmentation task training is completed, while the matting encoder parameters are still generated using randomization.

3) Setting a learning rate

After training of the segmentation task, the shared feature encoder and the segmentation encoder already have parameters that are comparatively synthesized, so that the training should set a lower learning rate at this stage, and the matting decoder should set a larger learning rate, thereby helping it to converge quickly.

4) Batch data entry

5) Random background synthesis

Because the number of foreground data is insufficient, fitting is easy to occur in the training process, and the data enhancement mode is added as much as possible in the training process, so that background synthesis is one of enhancement modes. After each batch of data is loaded into the memory, a certain probability is set to randomly select a plurality of sample images, a piece of background is randomly selected from the background image set, and the background in the original image is replaced. In this way, the data richness can be effectively increased.

Image synthesis may be achieved by the following formula:

C＝A*F+(1-A)*B

wherein C is a synthetic image, A is an Alpha image, F is a foreground color image, and B is a background image.

6) Computing foreground color tags

Many matting datasets only provide alpha maps, without foreground color maps, resulting in models that fail to train foreground color regression tasks on these data. The embodiment proposes to estimate the foreground color by using a traditional method, and add the foreground color as a truth value label to network training, thereby solving the pain point of label deficiency. Wherein the conventional foreground color tag estimation method may be a Closed-form-based or Multi-level method. The method does not need to finish calculation in advance to store the picture, but can finish calculation in a memory.

7) Calculating the matting loss

The matting loss is used to calculate the difference between the prediction and the true value, including Alpha loss and foreground color loss in total, and in this step, the segmentation loss (i.e., mask loss) is also calculated again.

Alpha loss:

foreground color loss:

8) Updating matting decoder parameters

The network obtains an output mask graph through one forward propagation, and the segmentation loss L is calculated first _seg . Then the mask is fused with the coding features and sent into a matting encoder, the matting decoder outputs a predicted alpha image and a front Jing Setu, the loss is calculated according to the formula in 7), and finally the loss is totally summarized to obtain a total loss L _total . And updating all parameters of the matting decoder according to the gradient descent method.

L _total ＝L _seg +L _l1 +L _comp +L _lap +L _fore

Step 4), 5), 6), 7) and 8) are a complete process of updating parameters of the matting decoder, and the process is repeatedly executed when the number of training rounds is less than a preset number. After the training wheel number reaches the preset number, the training is finished, and the flow goes to step 9).

9) Preserving model parameters

All structures of the training are involved, so that all model parameters are saved, namely parameters of a shared feature encoder, a segmentation decoder and a matting encoder are saved, and a trained image processing model is obtained.

In one embodiment, there is provided an image processing method including training and use of an image processing model, the image processing method being applied to a computer device, comprising:

training process:

the method comprises the steps of acquiring a first sample image set and a second sample image set, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image.

An image processing model to be trained is determined, the image processing model to be trained comprising an initial encoding structure, an initial first decoding structure, and an initial second decoding structure.

And performing feature coding processing on the first sample image through the initial coding structure to obtain a plurality of first feature images with different scales.

And performing first decoding processing on the basis of a plurality of first feature maps with different scales through an initial first decoding structure to obtain a first mask image corresponding to the first sample image.

Training the initial coding structure and the initial first decoding structure according to the difference between the segmentation labels corresponding to the first mask image and the first sample image until the first stopping condition is met, and obtaining the coding structure and the first decoding structure after training.

Performing feature coding processing on the second sample image through the coding structure to obtain a plurality of second feature images with different scales; and performing first decoding processing based on a plurality of second feature maps with different scales through the first decoding structure to obtain a second mask image corresponding to the second sample image.

And performing second decoding processing based on a plurality of second feature images with different scales and a second mask image through an initial second decoding structure to obtain a predicted foreground color image.

And performing second decoding processing based on a plurality of second feature maps with different scales and a second mask image through an initial second decoding structure to obtain a plurality of sample mask maps with different layers.

And updating the sample mask images of a plurality of different layers based on sample edge transition characteristics between a sample main body and a background area included in each sample mask image through an initial second decoding structure to obtain a prediction mask image.

A foreground color loss between the predicted foreground color map and a foreground color label corresponding to the second sample image is determined.

And determining the global loss between the prediction mask image and the key label corresponding to the second sample image.

And adjusting the prediction mask map to different scales, and determining the scale loss of the prediction mask map under the different scales.

Mask loss between the second mask image and the segmentation labels corresponding to the second sample image is determined.

And when the second sample image is a composite image, determining channel losses of the prediction mask image corresponding to the composite image and the corresponding matting labels on the color channels.

A target loss function is constructed based on the foreground color loss, global loss, scale loss, mask loss, and channel loss.

And training the initial second decoding structure through the target loss function until the second stopping condition is met, and stopping to obtain a trained second decoding structure so as to obtain a trained image processing model.

Use of image processing models:

the image processing model includes an encoding structure composed of a plurality of encoding units, a first decoding structure composed of a plurality of first decoding units, and a second decoding structure composed of a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on the original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

Determining input data corresponding to each second decoding unit in the second decoding structure, wherein the input data corresponding to the second decoding units comprises a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding units, and at least part of the input data corresponding to the second decoding units also comprises mask images;

and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to the second decoding units, so as to obtain candidate mask graphs output by the second decoding units respectively, wherein the levels of the candidate mask graphs output by the second decoding units are different.

And based on the determination of the input data corresponding to each second decoding unit in the second decoding structure, wherein the input data corresponding to the second decoding units comprises a feature map output by the corresponding encoding unit and foreground color features output by the second decoding unit before the second decoding unit, and at least part of the input data corresponding to the second decoding units also comprises a mask image; and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to each second decoding unit until a foreground color image output by the last second decoding unit is obtained.

Amplifying the candidate mask map of the current level to the same scale as the candidate mask map of the next adjacent level to obtain an amplified candidate mask map; extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics; performing morphological processing on the edge transition image to obtain a processed edge transition image, wherein the morphological processing comprises at least one of corrosion operation and expansion operation; carrying out fusion processing on the processed edge transition image and the candidate mask image of the next adjacent layer to obtain an updated mask image; taking the updated mask pattern as a candidate mask pattern of the current layer in the next round, returning to the step of amplifying the candidate mask pattern of the current layer to the same scale as the candidate mask pattern of the adjacent next layer, and continuing to execute until the updated mask pattern corresponding to the candidate mask pattern of the last layer is obtained; and taking the updated mask diagram corresponding to the candidate mask diagram of the last layer as a target mask diagram.

And carrying out fusion processing on the foreground color map and the target mask map to obtain a matting result corresponding to the target main body.

And synthesizing the matting result and the background image to obtain a synthesized image.

In this embodiment, a target loss function is constructed based on the foreground color loss, the global loss, the scale loss, the mask loss and the channel loss, so that an initial second decoding structure is trained through the target loss function to obtain a trained second decoding structure, and a trained encoding structure, a trained first decoding structure and a trained second decoding structure form a trained image processing model, so that segmentation precision and matting precision of the image processing model can be effectively provided.

The trained image processing model includes an encoding structure composed of a plurality of encoding units, a first decoding structure composed of a plurality of first decoding units, and a second decoding structure composed of a plurality of second decoding units. The original image is subjected to feature encoding processing through the encoding structure, semantic features of different levels can be obtained, so that the first decoding structure performs first decoding processing based on the semantic features of different levels, and the position of the target main body in the original image is primarily predicted and embodied through the mask image. The mask image represents the rough position of the target main body in the original image, and the second decoding structure uses semantic features of different layers and the mask image to perform second decoding processing, so that the second decoding structure can be effectively guided to pay attention to interpretation of the edge transition regions of the target main body and the background region, and therefore candidate mask images of different layers are obtained through decoding, and strong attention regression to low-level features is achieved. The candidate mask images not only comprise the positions of the target main body in the original image, but also comprise the characteristic information of the edge transition areas of the target main body and the background area, so that the strong attention regression of low-level characteristics can be effectively realized through the update processing of the candidate mask images of different levels, the target mask images with clearer textures of the target main body and finer edge transition details can be obtained, further, the matting result corresponding to the target main body can be accurately obtained according to the target mask images, and the matting accuracy is further improved.

And performing a second decoding process based on the feature images and the mask images with different scales to accurately extract color components corresponding to the foreground region in the original image, and particularly accurately obtain color components corresponding to the edge transition regions of the foreground and the background, thereby obtaining a foreground color image comprising the target main body. And carrying out fusion processing on the foreground color map and the target mask map, so that the edge detail of the obtained matting result is more information, and the edge color is more accurate. And combining the matting result and the background image to obtain a combined image with clear and natural edge detail transition and natural color transition, so that the combined image is more natural and real.

As shown in fig. 12, a schematic diagram is provided for comparing the matting result obtained by the image processing model of the present embodiment with the processing result obtained by the conventional manner. As can be seen from fig. 12, the image processing model of the present embodiment processes the original image to obtain an intermediate segmentation result, i.e. a mask image, and then obtains a final result, i.e. a target mask image, based on the intermediate segmentation result, and the hairline texture of the final result is clear and fine, so as to achieve hairline level matting. The traditional method directly obtains the segmentation result, the false background detection occurs, and the region with the color similar to that of the hairline in the background region is segmented, so that the segmentation is not accurate enough.

As shown in fig. 13, a conventional composite image without foreground color synthesis and a composite image with foreground color synthesis in the present application are provided. It is obvious that the lack of foreground color can lead to color artifact in the edge transition area during image synthesis, and the visual sense effect is poor. The foreground color image is obtained through the matting in the method, so that the color transition of the edge area is more natural and finer during image synthesis, and the matting synthesis quality is higher.

Moreover, the image obtained by using the image processing method is superior to the existing mode in all objective indexes, and the comparison conditions are as follows:

the MODET is a traditional real-time portrait background replacement algorithm, MAD (Mean Absolute Difference) refers to average absolute error, MSE (Mean Squared Error) refers to average square error, and as can be seen from the table, the average absolute error and the average square error of the method are smaller than those of a traditional processing mode, namely, the image processing mode can achieve higher precision and accuracy.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiments of the present application also provide an image processing apparatus for implementing the above-mentioned image processing method, and a training apparatus for implementing the above-mentioned image processing model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in the embodiments of the one or more image processing devices provided below may be referred to as limitations of the image processing method, and the specific limitations in the embodiments of the training device of the one or more image processing models provided below may be referred to as limitations of the training method of the image processing model, which are not repeated herein.

In one embodiment, as shown in fig. 14, an image processing apparatus 1400 is provided, which may employ software modules or hardware modules, or a combination of both, as part of a computer device, the apparatus specifically comprising: an encoding module 1402, a first decoding module 1404, a second decoding module 1406, an updating module 1408, and a result determination module 1410, wherein:

the encoding module 1402 is configured to perform feature encoding processing on the original image to obtain a plurality of feature maps with different scales.

The first decoding module 1404 is configured to perform a first decoding process based on a plurality of feature maps with different scales, to obtain a mask image, where the mask image is used to partition a target subject from an original image on a semantic level.

And a second decoding module 1406, configured to perform a second decoding process based on the feature maps and the mask images with different scales, to obtain candidate mask maps with different levels.

And an updating module 1408, configured to update the candidate mask maps of the multiple different layers according to edge transition characteristics between the target main body and the background area included in each candidate mask map, so as to obtain the target mask map.

The result determining module 1410 is configured to determine a matting result corresponding to the target subject according to the target mask.

In one embodiment, the apparatus is performed by an image processing model including an encoding structure constituted by a plurality of encoding units, a first decoding structure constituted by a plurality of first decoding units, a second decoding structure constituted by a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on the original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

In this embodiment, the image processing method is performed by an image processing model including an encoding structure constituted by a plurality of encoding units, a first decoding structure constituted by a plurality of first decoding units, and a second decoding structure constituted by a plurality of second decoding units. The original image is subjected to feature encoding processing through the encoding structure, semantic features of different levels can be obtained, so that the first decoding structure performs first decoding processing based on the semantic features of different levels, and the position of the target main body in the original image is primarily predicted and embodied through the mask image. The mask image prompts the rough position of the target main body in the original image, and the second decoding structure uses semantic features of different layers and the mask image to perform second decoding processing, so that the second decoding structure can be effectively guided to pay attention to interpretation of the edge transition regions of the target main body and the background region, and therefore candidate mask images of different layers are obtained through decoding, and strong attention regression to low-level features is achieved. The candidate mask map not only comprises the position of the target main body in the original image, but also comprises the characteristic information of the edge transition areas of the target main body and the background area, so that the target mask map with finer edge transition details can be obtained through the candidate mask maps of different layers.

In one embodiment, the encoding module 1402 is further configured to obtain an original image and input the original image into the encoding structure; sequentially carrying out coding processing by a plurality of coding units in the coding structure based on input data corresponding to the coding units respectively to obtain characteristic diagrams output by the coding units respectively; the input data of the initial coding unit in the coding structure is an original image, and the scales of the feature graphs output by different coding units in the coding structure are different.

In one embodiment, the first decoding module 1404 is further configured to perform, based on the feature maps of the multiple different scales, a first decoding process by using multiple first decoding units in the first decoding structure sequentially based on input data corresponding to each of the first decoding units until a mask image is output by using a last first decoding unit, where the input data of the current first decoding unit in the first decoding structure includes a decoding result corresponding to a previous first decoding unit and a feature map output by an encoding unit corresponding to the current first decoding unit, and the decoding result corresponding to the current first decoding unit is used to form the input data of a subsequent first decoding unit.

In one embodiment, the second decoding module 1406 is further configured to determine input data corresponding to each second decoding unit in the second decoding structure, where the input data corresponding to the second decoding unit includes a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding unit, and at least a portion of the input data corresponding to the second decoding unit further includes a mask image; and sequentially performing second decoding processing by a plurality of second decoding units in the second decoding structure based on the input data corresponding to the second decoding units, so as to obtain candidate mask graphs output by the second decoding units respectively, wherein the levels of the candidate mask graphs output by the second decoding units are different.

In one embodiment, the updating module 1408 is further configured to scale up the candidate mask map of the current layer to the same scale as the candidate mask map of the next adjacent layer, to obtain a scaled-up candidate mask map; extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics; fusion processing is carried out on the basis of the edge transition image and the candidate mask image of the next adjacent layer, so that an updated mask image is obtained; taking the updated mask pattern as a candidate mask pattern of the current layer in the next round, returning to the step of amplifying the candidate mask pattern of the current layer to the same scale as the candidate mask pattern of the adjacent next layer, and continuing to execute until the updated mask pattern corresponding to the candidate mask pattern of the last layer is obtained; and taking the updated mask diagram corresponding to the candidate mask diagram of the last layer as a target mask diagram.

In one embodiment, the updating module 1408 is further configured to perform morphological processing on the edge transition image to obtain a processed edge transition image, where the morphological processing includes at least one of a corrosion operation and an expansion operation; and carrying out fusion processing on the processed edge transition image and the candidate mask image of the next adjacent layer to obtain an updated mask image.

In one embodiment, the second decoding module 1406 is further configured to perform a second decoding process based on the feature maps and the mask images of the plurality of different scales to obtain a front Jing Setu including the target subject;

the result determining module 1410 is further configured to perform fusion processing on the foreground color map and the target mask map, and obtain a matting result corresponding to the target subject.

the acquisition module is used for acquiring a first sample image set and a second sample image set, wherein the first sample image set comprises a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set comprises a second sample image and a matting label corresponding to the second sample image.

The model determining module is used for determining an image processing model to be trained, and the image processing model to be trained comprises an initial coding structure, an initial first decoding structure and an initial second decoding structure.

And the first training module is used for carrying out first training on the initial coding structure and the initial first decoding structure through the first sample image set until the first stopping condition is met, and obtaining the coding structure and the first decoding structure after training.

The second training module is used for performing second training on the initial second decoding structure through the second sample image set based on the trained coding structure and the first decoding structure until a second stopping condition is met, so as to obtain a trained second decoding structure and obtain a trained image processing model; the trained image processing model is used for determining a matting result corresponding to the target subject in the original image.

In one embodiment, as shown in fig. 15, there is provided an image processing model training apparatus 1500, which may employ software modules or hardware modules, or a combination of both, as part of a computer device, the apparatus specifically comprising: the acquisition module 1502, the model determination module 1504, the first training module 1506, and the second training module 1508, wherein,

the acquiring module 1502 is configured to acquire a first sample image set and a second sample image set, where the first sample image set includes a first sample image and a segmentation label corresponding to the first sample image, and the second sample image set includes a second sample image and a matting label corresponding to the second sample image.

The model determination module 1504 is configured to determine an image processing model to be trained, where the image processing model to be trained includes an initial encoding structure, an initial first decoding structure, and an initial second decoding structure.

The first training module 1506 is configured to perform a first training on the initial encoding structure and the initial first decoding structure through the first sample image set until a first stopping condition is met, thereby obtaining a trained encoding structure and a trained first decoding structure.

A second training module 1508, configured to perform a second training on the initial second decoding structure through the second sample image set and based on the trained encoding structure and the first decoding structure, until a second stopping condition is met, and stop to obtain a trained second decoding structure, so as to obtain a trained image processing model; the trained image processing model is used for determining a matting result corresponding to the target subject in the original image.

In one embodiment, the first training module 1506 is further configured to perform feature encoding processing on the first sample image through the initial encoding structure to obtain a plurality of first feature maps with different scales; performing first decoding processing on the basis of a plurality of first feature images with different scales through an initial first decoding structure to obtain a first mask image corresponding to a first sample image; training the initial coding structure and the initial first decoding structure according to the difference between the segmentation labels corresponding to the first mask image and the first sample image until the first stopping condition is met, and obtaining the coding structure and the first decoding structure after training.

In one embodiment, the second training module 1508 is further configured to perform feature encoding processing on the second sample image through the encoding structure, to obtain a plurality of second feature maps with different scales; performing first decoding processing on the basis of a plurality of second feature images with different scales through a first decoding structure to obtain a second mask image corresponding to a second sample image; performing second decoding processing based on a plurality of second feature images with different scales and a second mask image through an initial second decoding structure to obtain a plurality of sample mask images with different layers; updating the sample mask images of a plurality of different layers based on sample edge transition characteristics between a sample main body and a background area included in each sample mask image through an initial second decoding structure to obtain a prediction mask image; and training the initial second decoding structure based on the mask loss between the prediction mask image and the key mark corresponding to the second sample image until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the mask penalty includes a global penalty and a scale penalty; the second training module 1508 is further configured to determine a global loss between the prediction mask and a matting label corresponding to the second sample image; adjusting the prediction mask map to different scales, and determining the scale loss of the prediction mask map under different scales; constructing a target loss function based on the global loss and the scale loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second training module 1508 is further configured to determine a mask loss between the second mask image and the segmentation label corresponding to the second sample image; determining mask loss between the prediction mask image and the key label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the mask loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second sample image set further includes a foreground color label corresponding to the second sample image, and the second training module 1508 is further configured to perform a second decoding process based on the second feature maps and the second mask images of the plurality of different scales by using the initial second decoding structure, to obtain a pre-prediction Jing Setu; determining a foreground color loss between the predicted foreground color map and a foreground color label corresponding to the second sample image; determining a mask loss between the prediction mask image and the key label corresponding to the second sample image; constructing an objective loss function based on the mask loss and the foreground color loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

In one embodiment, the second sample image includes a composite image therein; a second training module 1508, configured to determine a channel loss of the color channel of the prediction mask corresponding to the composite image and the corresponding matting label; determining the mask loss between the prediction mask image and the key label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the channel loss; training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

The respective modules in the image processing apparatus and the training apparatus of the image processing model described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device, which may be a terminal or a server, is provided, and this embodiment is described by taking the computer device as a server as an example, and the internal structure thereof may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing image processing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image processing method.

It will be appreciated by those skilled in the art that the structure shown in fig. 16 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. An image processing method, the method comprising:

2. The method of claim 1, wherein the method is performed by an image processing model comprising an encoding structure comprised of a plurality of encoding units, a first decoding structure comprised of a plurality of first decoding units, a second decoding structure comprised of a plurality of second decoding units; the encoding structure is used for performing feature encoding processing on an original image, the first decoding structure is used for performing first decoding processing, and the second decoding structure is used for performing second decoding processing.

3. The method according to claim 2, wherein the performing feature encoding processing on the original image to obtain feature maps with a plurality of different scales includes:

Acquiring an original image, and inputting the original image into a coding structure;

sequentially carrying out coding processing by a plurality of coding units in the coding structure based on input data corresponding to the coding units respectively to obtain characteristic diagrams output by the coding units respectively; the input data of the initial coding unit in the coding structure is the original image, and the scales of the feature maps output by different coding units in the coding structure are different.

4. The method according to claim 2, wherein the performing a first decoding process based on the feature maps of the plurality of different scales to obtain a mask image includes:

and based on the feature graphs of the different scales, sequentially performing first decoding processing by a plurality of first decoding units in a first decoding structure based on the input data corresponding to each first decoding unit until a mask image is output by a last first decoding unit, wherein the input data of the current first decoding unit in the first decoding structure comprises a decoding result corresponding to a previous first decoding unit and a feature graph output by a coding unit corresponding to the current first decoding unit, and the decoding result corresponding to the current first decoding unit is used for forming the input data of a subsequent first decoding unit.

5. The method according to claim 2, wherein the performing a second decoding process based on the feature maps of the plurality of different scales and the mask image to obtain a plurality of candidate mask maps of different levels includes:

determining input data corresponding to each second decoding unit in the second decoding structure, wherein the input data corresponding to the second decoding units comprises a feature map output by a corresponding encoding unit and a candidate mask map output by a previous second decoding unit before the second decoding unit, and at least part of the input data corresponding to the second decoding units further comprises the mask image;

6. The method according to claim 1, wherein the updating the candidate mask patterns of the plurality of different levels according to the edge transition characteristics between the target subject and the background region included in each candidate mask pattern to obtain the target mask pattern includes:

Amplifying the candidate mask map of the current level to the same scale as the candidate mask map of the next adjacent level to obtain an amplified candidate mask map;

extracting an edge transition image between a target main body and a background area in the amplified candidate mask image, wherein the edge transition image comprises edge transition characteristics;

performing fusion processing based on the edge transition image and the candidate mask map of the next adjacent layer to obtain an updated mask map;

taking the updated mask map as a candidate mask map of the current level in the next round, returning to the step of amplifying the candidate mask map of the current level to the same scale as the candidate mask map of the next adjacent level, and continuing to execute until the updated mask map corresponding to the candidate mask map of the last level is obtained;

and taking the updated mask map corresponding to the candidate mask map of the last layer as a target mask map.

7. The method of claim 6, wherein the fusing the edge transition image and the candidate mask map of the next adjacent layer to obtain an updated mask map comprises:

performing morphological processing on the edge transition image to obtain a processed edge transition image, wherein the morphological processing comprises at least one of corrosion operation and expansion operation;

8. The method according to claim 1, wherein the method further comprises:

performing a second decoding process based on the feature maps of the plurality of different scales and the mask image to obtain a front Jing Setu including the target subject;

the determining the matting result corresponding to the target main body according to the target mask map comprises the following steps:

and carrying out fusion processing on the front Jing Setu and the target mask image to obtain a matting result corresponding to the target main body.

9. The method according to any one of claims 1 to 8, wherein the method is performed by an image processing model, the image processing model being obtained by a training step comprising:

10. A method of training an image processing model, the method comprising:

11. The method of claim 10, wherein the first training the initial encoding structure and the initial first decoding structure through the first sample image set until a first stop condition is met, obtaining a trained encoding structure and a trained first decoding structure, comprises:

Performing feature coding processing on the first sample image through the initial coding structure to obtain a plurality of first feature images with different scales;

performing first decoding processing on the basis of the first feature maps with the different scales through the initial first decoding structure to obtain a first mask image corresponding to the first sample image;

training the initial coding structure and the initial first decoding structure according to the difference between the segmentation labels corresponding to the first mask image and the first sample image until a first stopping condition is met, and obtaining a trained coding structure and a trained first decoding structure.

12. The method of claim 10, wherein said performing a second training on said initial second decoding structure through said second set of sample images and based on the training-completed encoding structure and the first decoding structure until a second stop condition is met, results in a training-completed second decoding structure, comprising:

performing feature coding processing on the second sample image through the coding structure to obtain a plurality of second feature images with different scales;

performing first decoding processing on the basis of the second feature maps with the different scales through the first decoding structure to obtain a second mask image corresponding to the second sample image;

Performing a second decoding process based on the second feature maps of the plurality of different scales and the second mask image through the initial second decoding structure to obtain a plurality of sample mask maps of different levels;

updating the sample mask patterns of the plurality of different layers based on sample edge transition characteristics between the sample main body and a background area included in each sample mask pattern through the initial second decoding structure to obtain a prediction mask pattern;

and training the initial second decoding structure based on the mask loss between the prediction mask map and the key label corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure.

13. The method of claim 12, wherein the mask penalty comprises a global penalty and a scale penalty; training the initial second decoding structure based on the mask loss between the prediction mask map and the key mark corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure, including:

determining global loss between the prediction mask image and the matting icon corresponding to the second sample image;

Adjusting the prediction mask map to different scales, and determining scale loss of the prediction mask map under the different scales;

constructing a target loss function based on the global loss and the scale loss;

training the initial second decoding structure through the target loss function until the second stopping condition is met, and obtaining a trained second decoding structure.

14. The method according to claim 12, wherein the method further comprises:

determining mask loss between the second mask image and the segmentation label corresponding to the second sample image;

training the initial second decoding structure based on the mask loss between the prediction mask map and the key mark corresponding to the second sample image until a second stopping condition is met, and obtaining a trained second decoding structure, including:

determining mask loss between the prediction mask map and the matting icon corresponding to the second sample image, and constructing a target loss function based on the mask loss and the mask loss;

15. The method of claim 12, wherein the second set of sample images further comprises a foreground color tag corresponding to the second sample image, the method further comprising:

performing a second decoding process based on the second feature maps of the plurality of different scales and the second mask image through the initial second decoding structure to obtain a pre-prediction Jing Setu;

determining a foreground color loss between the predicted foreground color map and a foreground color label corresponding to the second sample image;

determining a mask loss between the prediction mask image and the key label corresponding to the second sample image;

constructing an objective loss function based on the mask loss and the foreground color loss;

16. The method of claim 12, wherein the second sample image comprises a composite image; the method further comprises the steps of:

determining channel loss of a prediction mask image corresponding to the synthesized image and a corresponding matting label on a color channel;

determining a mask loss between the prediction mask map and the matting label corresponding to the second sample image, and constructing a target loss function based on the mask loss and the channel loss;

17. An image processing apparatus, characterized in that the apparatus comprises:

18. A training apparatus for an image processing model, the apparatus comprising:

19. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.