CN111292331A

CN111292331A - Image processing method and device

Info

Publication number: CN111292331A
Application number: CN202010110152.2A
Authority: CN
Inventors: 王涌壮; 张晓鹏; 谢凌曦; 钮敏哲; 张维; 田奇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-02-23
Filing date: 2020-02-23
Publication date: 2020-06-16
Anticipated expiration: 2040-02-23
Also published as: CN111292331B

Abstract

The application provides a method and a device for image processing. Relate to the artificial intelligence field, concretely relates to computer vision field. The method comprises the following steps: acquiring first spatial feature information based on original feature data of a first type of image processing task; acquiring second characteristic data according to the original characteristic data and the first spatial characteristic information of the second image processing task; performing second image processing on the second characteristic data to obtain a processing result of a second image processing task; the first image processing task and the second image processing task are respectively one of a target detection task and an example segmentation task and the other is the same as the target detection task and the example segmentation task. By providing spatial feature information for one of target detection and instance segmentation, feature data of target detection and/or instance segmentation can be corrected, and prediction accuracy of instance segmentation task can be improved.

Description

Image processing method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for image processing.

Background

In recent years, a deep neural network has been excellent in automated understanding of visual signals such as images and videos. In order to understand semantic information contained in each pixel in an image, object detection and semantic segmentation are proposed. Object detection or semantic segmentation can only roughly determine which object's rectangular detection box or semantic class a pixel belongs to. To achieve a finer image understanding, example segmentation is proposed. Example segmentation based on object detection and semantic segmentation, it is possible to further determine to which object in which semantic class each pixel in the image belongs. Example segmentation may be applied to tasks such as video surveillance or autopilot.

In the prior art, an instance segmentation task model based on a multi-task learning framework is adopted to realize instance segmentation. The example segmentation task model adopts an object detection task model as prior output, and then an additional segmentation mask prediction model is used for predicting whether the object belongs to the object one by one in an object detection frame given by the object detection task model.

It should be understood that both the target detection task and the instance segmentation task can perform position judgment on the same target, but when the target detection task and the instance segmentation task are executed by the existing instance segmentation task model, the prediction results of the two tasks are inconsistent, so that the prediction result of the instance segmentation is inaccurate.

Improving the prediction accuracy of the instance segmentation task is an urgent problem to be solved.

Disclosure of Invention

The application provides an image processing method and device, which can effectively improve the prediction accuracy of an instance segmentation task.

In a first aspect, a method of image processing is provided, the method comprising: acquiring first spatial feature information based on original feature data of a first type of image processing task; acquiring second characteristic data according to the original characteristic data of a second type of image processing task and the first spatial characteristic information; performing second image processing on the second characteristic data to obtain a processing result of the second image processing task; the first image processing task and the second image processing task are respectively one of a target detection task and an example segmentation task and the other of the target detection task and the example segmentation task; and acquiring the original characteristic data of the first image processing task and the original characteristic data of the second image processing task based on the image data to be processed.

The method provides spatial characteristic information for example segmentation through target detection, and for example segmentation, characteristic data can be corrected through the spatial characteristic information of the target detection, so that the prediction results of the target detection and the example segmentation can be consistent to a certain extent, and the prediction accuracy of an example segmentation task can be improved.

The method provides the spatial characteristic information for target detection through example segmentation, and for target detection, the characteristic data can be corrected through the spatial characteristic information of the target detection, so that the target detection and the prediction result of the example segmentation can be consistent to a certain extent, and the prediction accuracy of an example segmentation task can be improved.

Therefore, according to the method and the device, the spatial feature information is provided for the other party through one party of target detection and example segmentation, the feature data of the provided party can be corrected through the spatial feature information of the other party, and therefore the prediction results of the target detection and the example segmentation can be consistent to a certain extent, and therefore the prediction accuracy of the example segmentation task can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: acquiring second spatial feature information based on the original feature data of the second image processing task; acquiring first characteristic data according to the original characteristic data of the first image processing task and the second spatial characteristic information; and carrying out first image processing on the first characteristic data to obtain a processing result of the first image processing task.

The target detection and the example segmentation mutually provide spatial characteristic information, and for the target detection and the example segmentation, the characteristic data can be corrected through the spatial characteristic information of the other side, so that the prediction results of the target detection and the example segmentation can be consistent to a greater extent, and the prediction accuracy of the example segmentation task can be improved.

Therefore, the target detection and the example segmentation mutually provide spatial feature information, and the prediction accuracy of the example segmentation task can be further improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the acquiring of the first spatial feature information based on the original feature data of the first image processing task includes: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

Therefore, according to the method and the device, the transverse characteristic acquisition and the longitudinal characteristic acquisition are firstly carried out on the spatial characteristic information of the target detection, then the transverse characteristic and the longitudinal characteristic are recombined, and then the recombined spatial characteristic information is provided for the segmentation example, so that the accuracy of the spatial characteristic information segmented by the example can be improved, and the prediction accuracy of the example segmentation task can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first feature data and the second feature data are obtained by performing the following operation, where an initial value of i is 1, and N is a positive integer.

In step S1, spatial feature information X1 is acquired based on the feature data IF1_ i.

In step S2, spatial feature information X2 is acquired based on the feature data IF2_ i.

In step S3, feature data OF1_ i is obtained according to the feature data IF1_ i and the spatial feature information X2.

In step S4, feature data OF2_ i is obtained according to the feature data IF2_ i and the spatial feature information X1.

And step S5, judging whether the value of i is equal to N, if not, turning to step S6, and if so, turning to step S7.

Step S6, adding 1 to the value OF i, taking the feature data OF1_ (i-1) as feature data IF1_ i, taking the feature data OF2_ (i-1) as feature data IF2_ i, and going to step S1.

In step S7, the feature data OF1 — i is the first feature data, and the feature data OF2 — i is the second feature data.

When the value of i is 1, the feature data IF1_ i is the original feature data of the first image processing task, and the feature data IF2_ i is the original feature data of the second image processing task.

In the method, the spatial feature information is provided by executing multiple rounds of target detection and example segmentation, so that the feature data of the target detection and the example segmentation can be better corrected, the prediction results of the target detection and the example segmentation can be consistent to a greater extent, and the prediction accuracy of the example segmentation task can be improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; wherein the executing the first image processing task on the first feature data comprises: processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data; wherein, the performing the second image processing on the second feature data to obtain the processing result of the second image processing task includes: and processing the second feature data by using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second feature data, wherein the segmentation mask prediction model is obtained by utilizing a detection auxiliary loss function, the detection auxiliary loss function restrains the output of the segmentation mask prediction model through target detection label information, and the target detection label information is used for training the detection frame prediction model.

In the present application, in training the divided mask prediction model, the model accuracy of the divided mask prediction model can be improved by constraining the output of the divided mask prediction model using the target detection label information.

With reference to the first aspect, in a possible implementation manner of the first aspect, the detection auxiliary loss function includes a vertical detection auxiliary loss function and a horizontal detection auxiliary loss function, where the vertical detection auxiliary loss function constrains vertical information of a prediction result output by the segmented mask prediction model through the target detection tag information, and the horizontal detection auxiliary loss function constrains horizontal information of the prediction result output by the segmented mask prediction model through the target detection tag information.

In the present application, the performance of the divided mask prediction model can be further improved by constraining the horizontal information and the vertical information of the divided mask prediction result output by the divided mask prediction model using the target detection label information, respectively.

With reference to the first aspect, in a possible implementation manner of the first aspect, the acquiring second feature data according to the original feature data of the second type of image processing task and the first spatial feature information includes: and acquiring the second characteristic data by processing the original characteristic data and the first spatial characteristic information of the second image processing task by using a convolution layer.

With reference to the first aspect, in a possible implementation manner of the first aspect, the obtaining third spatial feature information based on the raw feature data of the target detection task includes: acquiring the third spatial feature information by processing the original feature data of the target detection task using a convolutional layer; the obtaining of the transverse characteristic information and the longitudinal characteristic information respectively according to the third spatial characteristic information includes: and processing the third spatial feature information by using a pooling layer to acquire the transverse feature information and the longitudinal feature information.

In a second aspect, a method of image processing is provided, the method comprising: inputting image data to be processed into a segmentation mask prediction model; and obtaining a segmentation mask prediction result of the image data to be processed by using the segmentation mask prediction model, wherein the segmentation mask prediction model is obtained by utilizing a detection auxiliary loss function for training, and the detection auxiliary loss function restrains the output of the segmentation mask prediction model through target detection label information.

With reference to the second aspect, in a possible implementation manner of the second aspect, the detection auxiliary loss function includes a vertical detection auxiliary loss function and a horizontal detection auxiliary loss function, where the vertical detection auxiliary loss function constrains vertical information of the prediction result output by the segmented mask prediction model through the target detection tag information, and the horizontal detection auxiliary loss function constrains horizontal information of the prediction result output by the segmented mask prediction model through the target detection tag information.

In a third aspect, a method of image processing is provided, the method comprising: acquiring target detection label information; and training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function restrains the output of the segmentation mask prediction model through the target detection label information.

With reference to the third aspect, in a possible implementation manner of the third aspect, the detection auxiliary loss function includes a vertical detection auxiliary loss function and a horizontal detection auxiliary loss function, where the vertical detection auxiliary loss function constrains vertical information of the prediction result output by the segmented mask prediction model through the target detection tag information, and the horizontal detection auxiliary loss function constrains horizontal information of the prediction result output by the segmented mask prediction model through the target detection tag information.

It is to be understood that the performance of the divided mask prediction model can be further improved by constraining the horizontal information and the vertical information of the divided mask prediction result output by the divided mask prediction model using the object detection label information, respectively.

In a fourth aspect, an apparatus for image processing is provided, the apparatus comprising the following means.

The first acquisition unit is used for acquiring first spatial feature information based on original feature data of a first type of image processing task.

And the second acquisition unit is used for acquiring second characteristic data according to the original characteristic data of the second image processing task and the first spatial characteristic information.

And the first processing unit is used for carrying out second image processing on the second characteristic data to obtain a processing result of a second image processing task.

The first image processing task and the second image processing task are respectively one of a target detection task and an example segmentation task and the other is the same as the target detection task and the example segmentation task.

And acquiring the original characteristic data of the first image processing task and the original characteristic data of the second image processing task based on the image data to be processed.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the apparatus further includes the following unit.

And the third acquisition unit is used for acquiring second spatial feature information based on the original feature data of the second image processing task.

And the fourth acquisition unit is used for acquiring the first characteristic data according to the original characteristic data of the first image processing task and the second spatial characteristic information.

And the second processing unit is used for carrying out first image processing on the first characteristic data to obtain a processing result of the first image processing task.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task.

Wherein the first obtaining unit is configured to: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the apparatus obtains the first feature data and the second feature data by performing the following operations, where an initial value of i is 1, and N is a positive integer:

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the second processing unit is used for processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data; the first processing unit is configured to process the second feature data using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second feature data.

The segmented mask prediction model is obtained by utilizing a detection auxiliary loss function for training, the detection auxiliary loss function restrains the output of the segmented mask prediction model through target detection label information, and the target detection label information is used for training the detection frame prediction model.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the detection auxiliary loss function includes a vertical detection auxiliary loss function and a horizontal detection auxiliary loss function, where the vertical detection auxiliary loss function constrains vertical information of the prediction result output by the segmented mask prediction model through the target detection tag information, and the horizontal detection auxiliary loss function constrains horizontal information of the prediction result output by the segmented mask prediction model through the target detection tag information.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second obtaining unit is configured to obtain the second feature data by processing the original feature data and the first spatial feature information of the second image processing task using a convolutional layer.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the first obtaining unit is configured to: acquiring the third spatial feature information by processing the original feature data of the target detection task using the convolutional layer; and processing the third spatial feature information by using a pooling layer to acquire the transverse feature information and the longitudinal feature information.

In a fifth aspect, an apparatus for image processing is provided, the apparatus comprising the following elements.

And the input unit is used for inputting the image data to be processed into the segmentation mask prediction model.

And the processing unit is used for obtaining a segmentation mask prediction result of the image data to be processed by using the segmentation mask prediction model.

The segmented mask prediction model is obtained by utilizing a detection auxiliary loss function for training, and the detection auxiliary loss function restrains the output of the segmented mask prediction model through target detection label information.

With reference to the fifth aspect, in a possible implementation manner of the fifth aspect, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a transverse detection auxiliary loss function.

The longitudinal detection auxiliary loss function restrains the longitudinal information of the prediction result output by the divided mask prediction model through the target detection label information, and the transverse detection auxiliary loss function restrains the transverse information of the prediction result output by the divided mask prediction model through the target detection label information.

In a sixth aspect, an apparatus for image processing is provided, the apparatus comprising the following means.

And the acquisition unit is used for acquiring the target detection label information.

And the training unit is used for training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function restrains the output of the segmentation mask prediction model through the target detection label information.

With reference to the sixth aspect, in a possible implementation manner of the sixth aspect, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a transverse detection auxiliary loss function.

In a seventh aspect, an apparatus for image processing is provided, the apparatus comprising: a memory for storing a program; a processor for executing the memory-stored program, the processor being adapted to perform the method of the first, second or third aspect described above when the memory-stored program is executed.

In an eighth aspect, there is provided a computer readable medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first, second or third aspect described above.

In a ninth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first, second or third aspect described above.

A tenth aspect provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method of the first, second, or third aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the method in the first aspect, the second aspect, or the third aspect.

In an eleventh aspect, an electronic device is provided, which includes the apparatus provided in the fourth, fifth, sixth, or seventh aspect.

Based on the above description, the present application provides spatial feature information to one of the object detection and the instance segmentation to the other, and for the provided one, the feature data thereof can be corrected by the spatial feature information of the other, so that the prediction results of the object detection and the instance segmentation can be consistent to some extent, and therefore, the prediction accuracy of the instance segmentation task can be improved.

In addition, in the process of training the divided mask prediction model, the output of the divided mask prediction model is restrained by using the target detection label information, so that the model accuracy of the divided mask prediction model can be improved.

Drawings

FIG. 1 is a conceptual illustration of image classification, object detection, semantic segmentation, and example segmentation.

FIG. 2 is a schematic block diagram of an example segmented task model based on a multi-task learning framework.

Fig. 3 is a schematic flowchart of a method of image processing provided in an embodiment of the present application.

Fig. 4 is another schematic flow chart of a method of image processing provided by an embodiment of the present application.

Fig. 5 is a further schematic flowchart of a method of image processing provided in an embodiment of the present application.

Fig. 6 is a further schematic flowchart of a method of image processing provided in an embodiment of the present application.

Fig. 7 is a schematic flow chart of a method of image processing according to another embodiment of the present application.

Fig. 8 is a schematic flow chart of a method of image processing according to still another embodiment of the present application.

Fig. 9 is a schematic block diagram of an apparatus for image processing provided in an embodiment of the present application.

Fig. 10 is a schematic block diagram of module 831 of fig. 9.

FIG. 11 is another schematic block diagram of an apparatus for image processing provided in an embodiment of the present application

Fig. 12 is a further schematic block diagram of an apparatus for image processing provided in an embodiment of the present application.

Fig. 13 is a schematic block diagram of a system for image processing provided in an embodiment of the present application.

Fig. 14 and fig. 15 are schematic views of application scenarios of the present application.

Fig. 16 is a further schematic block diagram of an apparatus for image processing provided in an embodiment of the present application.

FIG. 17 is a further schematic block diagram of an apparatus for image processing provided in an embodiment of the present application

Fig. 18 is a further schematic block diagram of an apparatus for image processing according to an embodiment of the present application.

Fig. 19 is a further schematic block diagram of an apparatus for image processing according to an embodiment of the present application.

Fig. 20 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

To facilitate understanding of the embodiments of the present application, several concepts related to the embodiments of the present application will be described below.

In recent years, a deep neural network has been excellent in automated understanding of visual signals such as images and videos. Currently, computer vision tasks include image classification (image classification), object detection (object detection), semantic segmentation (semantic segmentation), and instance segmentation (instance segmentation). These concepts are described below in conjunction with fig. 1. In the example of fig. 1, the picture contains 1 person, 2 sheep and 1 dog.

As shown in the upper left corner of fig. 1, the image classification refers to a classification to which the image is judged. For example, in learning classification, there are four data sets, namely, human (person), sheep (sheet), dog (dog), and cat (cat), and image classification is what classification is included in a given picture to be obtained (or output). For example, in the example of FIG. 1, the output of the image classification task is to note the classifications in the picture: human, sheep, dog.

As shown in the upper right corner of fig. 1, the object detection is simply to find out what objects are in the picture and the positions of the objects (for example, the objects are enclosed by a rectangular frame, which can be called a detection frame). For example, in the example of fig. 1, the output of the target detection task is labeled the bounding boxes of1 person, 2 sheep, and 1 dog in the picture (e.g., the rectangular box in the upper right-hand picture of fig. 1).

As shown in the lower left corner of fig. 1, semantic segmentation means that each point pixel point in a picture needs to be distinguished, instead of only framing a target with a rectangular frame, but different instances of the same object do not need to be separately segmented. For example, in the example of fig. 1, the output of the semantic segmentation task is to mark people, sheep, and dogs in the picture, but not necessarily to mark sheep 1 and sheep 2. Semantic segmentation is also the object segmentation in the general sense.

As shown in the lower right hand corner of fig. 1, instance segmentation refers to a combination of object detection and semantic segmentation. With respect to the bounding box of target detection, the instance segmentation can be accurate to the edge of the object, and with respect to semantic segmentation, the instance segmentation needs to label different instances of the same object on the graph. For example, in the example of fig. 1, there are 1 person, 2 sheep, 1 dog, and the example segmentation task is to label these examples.

The prediction result of the instance partition may be referred to as a partition mask. The segmentation mask quality may characterize how good the prediction results of the instance segmentation are.

It should be understood that FIG. 1 is intended to be illustrative only and not limiting.

The present application relates generally to object detection and instance segmentation.

The existing mainstream instance segmentation task model is often based on a multi-task learning framework. The multitask learning framework refers to a model which can be used for simultaneously performing a plurality of tasks, and is divided into a backbone network (such as the backbone network 210 shown in fig. 2) and a branch network (such as the

branch networks

221, 222, 223 shown in fig. 2), wherein data is input into the backbone network to obtain a feature map, and then different branch networks perform different task outputs.

Fig. 2 is a schematic structural diagram of a conventional example segmentation task model 200. The example split task model 200 includes a backbone network 210, a multi-class branch network 221, a detection branch network 222, and a split branch network 223. The example segmented task model 200 takes the detection branch network 222 as an a priori output and then predicts whether it belongs to a given target pixel-by-pixel within the target detection box using an additional segmentation branch network 223. The segmentation branch network 223, the multi-classification branch network 221, and the detection branch network 222 are all processed based on the feature map acquired by the backbone network 210. The multi-class branch network 221 and the detection branch network 222 perform feature processing and task output using a shared full connection layer, and the division branch network 223 performs feature processing and task output using independent convolutional layers.

It should be understood that both the target detection task and the instance segmentation task may make a position determination for the target (coarse rectangular detection frame position and fine pixel position, as shown in FIG. 1). However, when the target detection task and the example segmentation task are executed by using the example segmentation task model 200 shown in fig. 2, the prediction results of the two tasks are inconsistent, which indicates that at least one of the prediction results of the two tasks is inaccurate, thereby reducing the prediction accuracy of the example segmentation task.

In view of the above problems, the present application provides an image processing method and apparatus, which can effectively improve the prediction accuracy of an instance segmentation task.

Fig. 3 is a schematic diagram of a method 300 for image processing according to an embodiment of the present disclosure. As shown in fig. 3, the method 300 includes the following steps S310, S320 and S330.

S310, acquiring first spatial feature information based on original feature data of a first image processing task.

And S320, acquiring second feature data according to the original feature data and the first spatial feature information of the second image processing task.

The first image processing task and the second image processing task are respectively one of a target detection task and an example segmentation task and the other is the same as the target detection task and the example segmentation task. In other words, the first image processing task is one of the object detection task and the instance segmentation task, and the second image processing task is the other

The original characteristic data of the first image processing task and the original characteristic data of the second image processing task are obtained based on the image data to be processed.

The image data to be processed represents an image to be subjected to the detection frame prediction and the segmentation mask prediction, such as an image input to the backbone network 210 shown in fig. 2.

For example, the original feature data of the target detection task represents data obtained by processing the image data to be processed by the feature acquisition network of the target detection task.

As another example, the original feature data of the target detection task may represent data obtained after the image data to be processed is processed by the backbone network 200 shown in fig. 2 and the feature acquisition network of the target detection task.

As another example, the original feature data of the target detection task may represent data obtained after the image data to be processed is processed by the backbone network 200, the regional proposal network, and the feature acquisition network of the target detection task, as shown in fig. 2.

The meaning of the original feature data of the example segmentation task is similar to the description of the original feature data of the target detection task, and is not described herein again.

For example, the original feature data of the first image processing task is obtained by processing the image data to be processed through the feature acquisition network of the first image processing task. The original feature data of the second image processing task is obtained by processing the image data to be processed through the feature acquisition network of the second image processing task.

And S330, performing second image processing on the second characteristic data to obtain a processing result of a second image processing task.

In the case where the first image processing task is a target detection task and the second image processing task is an instance segmentation task, the method 300 provided by the present embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on raw feature data of the target detection task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the instance segmentation task. At S330, an instance segmentation task is performed on the second feature data to obtain a segmentation mask prediction result for the second feature data.

In step S320, a plurality of methods may be employed to obtain the second feature data according to the original feature data and the first spatial feature information of the instance segmentation task. For example, the first spatial feature information is directly stitched with the original feature data of the instance segmentation task. For example, the original feature data of the instance segmentation task is processed, and then the first spatial feature information is spliced with the processed original feature data.

It should be understood that the example segmentation is provided with spatial feature information through target detection, and for example segmentation, feature data of the example segmentation can be corrected through the spatial feature information of the target detection, so that the target detection can be consistent with a prediction result of the example segmentation to a certain extent, and therefore the prediction accuracy of an example segmentation task can be improved.

The "object detection agrees with the prediction result of the example division" mentioned herein means that the pixels within the object detection frame predicted by the object detection all belong to this object.

Therefore, the embodiment provides the spatial feature information for instance segmentation through object detection, so that the accuracy of the spatial feature information of the instance segmentation can be improved, and the prediction accuracy of the instance segmentation task can be improved.

In the case where the first image processing task is an instance segmentation task and the second image processing task is an object detection task, the method 300 provided by the present embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on the original feature data of the instance division task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the target detection task. In S330, a target detection task is performed on the second feature data, and a target detection prediction result of the second feature data is obtained.

In step S320, a plurality of methods may be employed to obtain the second feature data according to the original feature data and the first spatial feature information of the target detection task. For example, the first spatial feature information is directly spliced with the original feature data of the target detection task. For example, the original feature data of the target detection task is processed to a certain extent, and then the first spatial feature information is spliced with the processed original feature data.

It should be understood that the example segmentation provides the spatial feature information for the target detection, and for the target detection, the feature data thereof can be corrected by the spatial feature information of the target detection, so that the target detection can be consistent with the prediction result of the example segmentation to some extent, and therefore the prediction accuracy of the example segmentation task can be improved.

Therefore, the embodiment provides the spatial feature information for target detection through instance segmentation, and can improve the accuracy of the spatial feature information of the instance segmentation, so that the prediction accuracy of the instance segmentation task can be improved.

As can be seen from the above, in the embodiment of the present application, one of the target detection and the instance segmentation provides the spatial feature information to the other, and for the provided one, the feature data can be corrected by the spatial feature information of the other, so that the prediction results of the target detection and the instance segmentation can be consistent to a certain extent, and therefore, the prediction accuracy of the instance segmentation task can be improved.

Optionally, as shown in fig. 4, the method 300 further includes steps S340, S350, and S360.

And S340, acquiring second spatial feature information based on the original feature data of the second image processing task.

And S350, acquiring first feature data according to the original feature data and the second spatial feature information of the first image processing task.

And S360, performing first image processing on the first characteristic data to obtain a processing result of the first image processing task.

Taking the first image processing task as a target detection task and the second image processing task as an example segmentation task, the method 300 provided by the embodiment includes the following steps.

In step S310, first spatial feature information is acquired based on raw feature data of the target detection task. In step S320, second feature data is obtained according to the original feature data and the first spatial feature information of the instance segmentation task. At S330, an instance segmentation task is performed on the second feature data to obtain a segmentation mask prediction result for the second feature data. In step S340, second spatial feature information is acquired based on the original feature data of the instance division task. And S350, acquiring first characteristic data according to the original characteristic data and the second spatial characteristic information of the target detection task. And S360, executing a target detection task on the first characteristic data to obtain a target detection prediction result of the first characteristic data.

It should be understood that, by mutually providing spatial feature information by target detection and instance segmentation, for both target detection and instance segmentation, feature data thereof can be corrected by the spatial feature information of the other, so that the prediction results of target detection and instance segmentation can be made consistent to a greater extent, and therefore the prediction accuracy of the instance segmentation task can be improved.

It should also be understood that, by mutually providing spatial feature information for target detection and instance segmentation, mutual supervision of a target detection task and an instance segmentation task can be realized, so that prediction accuracy of the instance segmentation task can be jointly improved.

Therefore, the embodiment of the application can further improve the prediction accuracy of the instance segmentation task by mutually providing the spatial feature information through the target detection and the instance segmentation.

For convenience of description and understanding, the following convention is made hereinafter. Recording spatial feature information acquired based on original feature data of a target detection task as first spatial feature information; recording the spatial characteristic information obtained based on the original characteristic data of the example segmentation task as second spatial characteristic information; marking original characteristic data of the task segmented according to the example and characteristic data obtained by the first space characteristic information as second characteristic data; and recording the characteristic data obtained according to the original characteristic data of the target detection task and the second spatial characteristic information as first characteristic data.

Alternatively, in the embodiment shown in fig. 3 or fig. 4, in the case that the first image processing task is the object detection task and the second image processing task is the instance division task, the step S310 includes the following steps S310, S312 and S313, as shown in fig. 5.

S311, acquiring third spatial feature information based on the original feature data of the target detection task.

And S312, respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information.

S313, the transverse characteristic information and the longitudinal characteristic information are recombined to obtain first spatial characteristic information.

As one example. It is assumed that, in step S311, the third spatial feature information obtained based on the original feature data of the target detection task is a feature map having a length, a width, and a number of channels of h × w × c. In step S312, a global maximum pooling operation is performed on the feature map along the horizontal direction and the vertical direction, respectively, to obtain a horizontal feature with a dimension w × c and a vertical feature with a dimension h × c. In step S313, the horizontal features with the scale of w × c and the vertical features with the scale of h × c are recombined into a feature map with the size of h × w × c, which is the first spatial feature information for providing to the example segmentation task. In the feature map obtained in step S313, the feature response at each position is the mean of the horizontal and vertical feature responses of the corresponding row and column.

It should be appreciated that the detection frame information predicted by the target detection is relatively coarse with respect to the segmentation mask, i.e., the pixel information, obtained by the example segmentation. Alternatively, the coarse detection frame information is erroneous with respect to the pixel information. For example, the target detection and example segmentation both branches have corresponding feature maps h × w × c, but without special processing, the detection frame branches ultimately only need to output the top-left and bottom-right vertex coordinates of the frame, whereas the example segmentation prediction outputs every pixel belonging to the detection object, so the feature maps of the detection frame branches are coarse and erroneous compared to the example segmentation branches with respect to the example segmentation.

In the embodiment of the application, the transverse characteristic acquisition and the longitudinal characteristic acquisition are firstly carried out on the spatial characteristic information of the target detection, and then the transverse characteristic and the longitudinal characteristic are recombined to obtain the recombined spatial characteristic information. Equivalently, the horizontal information and the vertical information are obtained from the spatial feature information detected by the target to replace the original pixel information. The original pixel information refers to the original spatial feature information of the target detection. The reconstructed spatial feature information is shared with the segmentation examples, so that the error of a rough detection frame relative to a precise segmentation mask can be reduced, and the reconstructed spatial feature information is more beneficial to improving the accuracy of the spatial feature information segmented by the examples.

Therefore, according to the embodiment of the application, the transverse feature acquisition and the longitudinal feature acquisition are performed on the spatial feature information of the target detection, the transverse feature and the longitudinal feature are recombined, and the recombined spatial feature information is provided for the segmentation example, so that the accuracy of the spatial feature information segmented by the example can be improved, and the prediction accuracy of the example segmentation task can be improved.

In some embodiments, taking the first image processing task as the target detection task and the second image processing task as the example segmentation task as an example, in step S330, the second feature data is input into the segmentation mask prediction model, and the segmentation mask prediction result of the second feature data is obtained by using the segmentation mask prediction model.

For example, the segmented mask prediction model is trained using a pixel-by-pixel classification loss function that constrains the output of the segmented mask prediction model through segmented mask label information.

Alternatively, the segmented mask prediction model may be trained by the method 800 of the embodiments below.

In some embodiments, taking the first image processing task as the target detection task and the second image processing task as the example segmentation task as an example, in step S360, the target detection prediction result of the first feature data may be obtained by using the detection frame prediction model.

For example, the detection box prediction model is trained using a detection regression loss function that constrains an output of the detection box prediction model by target detection label information.

For example, in the embodiment shown in fig. 3 or fig. 4, in step 320, the original feature data and the first spatial feature information of the second image processing task are processed by using the convolutional layer to obtain the second feature data.

For example, in the embodiment shown in fig. 4, in step 350, the first feature data is obtained by processing the original feature data and the second spatial feature information of the first image processing task using the convolutional layer.

As an example, a first image processing task is taken as a target detection task, and a second image processing task is taken as an example segmentation task. In step 350, the first feature data may be obtained by processing the second spatial feature information and the raw feature data of the target detection task using the convolutional layer as in block 910 of fig. 11. In step 320, the convolution layer in module 920 in fig. 11 may be used to process the first spatial feature information and the original feature data of the example segmentation task, and obtain the second feature data.

For example, in the embodiment shown in fig. 3 or fig. 4, in step 310, first spatial feature information is obtained by processing raw feature data of a first image processing task using a convolutional layer.

For another example, in the embodiment shown in fig. 4, in step 340, the second spatial feature information is obtained by processing the raw feature data of the second image processing task using the convolutional layer.

For another example, in the embodiment shown in fig. 5, in step S311, third spatial feature information is acquired by using the original feature data of the convolutional layer processing target detection task; in step S312, acquiring transverse feature information and longitudinal feature information by processing the third spatial feature information using the pooling layer; in step S313, the horizontal feature information and the vertical feature information are processed by the reassembly layer to obtain first spatial feature information.

As an example, a first image processing task is taken as a target detection task, and a second image processing task is taken as an example segmentation task. In step S311, third spatial feature information may be acquired using the original feature data of the convolution layer processing target detection task as in the sub-unit 931 in fig. 11. In step S312, the third spatial feature information may be processed using the band-wise convolutional layer as in the sub-unit 931 in fig. 11, and the horizontal feature information and the vertical feature information may be acquired. In step S313, the transverse feature information and the longitudinal feature information may be recombined using a recombination layer with a band direction as in the sub-unit 931 in fig. 11, and first spatial feature information may be acquired. In step 340, the raw feature data of the instance splitting task may be processed using the convolutional layer as in block 932 in fig. 11 to obtain second spatial feature information.

For example, the method 300 in the above embodiments may be performed by the device 900, the device 1100, or the device 1200 in the following embodiments.

As an example, the method 300 is performed by the apparatus 900 in the embodiments below. Referring to fig. 4 and 9, step S310 may be performed by the first spatial feature information acquisition unit 931, step S320 may be performed by the instance division task feature acquisition module 920, step S340 may be performed by the second spatial feature information acquisition unit 932, and step S350 may be performed by the target detection task feature acquisition module 910. Referring to fig. 5 and 10, step S311 may be performed by the sub-unit 1001 in the first spatial feature information acquisition unit 931, and steps S312 and S313 may be performed by the sub-unit 1002 in the first spatial feature information acquisition unit 931, where step S312 may be implemented by a band-direction pooling layer in the sub-unit 1002, and step S313 may be implemented by a band-direction reorganizing layer in the sub-unit 1002.

Therefore, according to the embodiment of the application, the spatial feature information is provided for the other party through one party of the target detection and the example division, and for the provided party, the feature data can be corrected through the spatial feature information of the other party, so that the prediction results of the target detection and the example division can be consistent to a certain extent, and the prediction accuracy of the example division task can be improved.

Furthermore, the target detection and the example segmentation mutually provide spatial feature information, and for the target detection and the example segmentation, feature data can be corrected through the spatial feature information of the other side, so that the prediction results of the target detection and the example segmentation can be consistent to a greater extent, and the prediction accuracy of the example segmentation task can be improved.

As shown in fig. 6, an embodiment of the present application further provides a method 600 for image processing. The method 600 includes the following steps S610 and S620.

S610, acquiring first characteristic data according to the original characteristic data of the target detection task, and acquiring second characteristic data according to the original characteristic data of the example segmentation task.

S620, obtaining the target detection prediction result of the first characteristic data, and obtaining the example mask prediction result of the second characteristic data.

As shown in fig. 6, in step S610, the first feature data and the second feature data are acquired by performing the following operations. Wherein, the initial value of i is 1, and N is a positive integer.

S0, the original characteristic data OF the target detection task is used as characteristic data OF1_1, and the original characteristic data OF the instance division task is used as characteristic data OF2_ 1. In other words, when the value of i is 1, the feature data IF1_ i is the original feature data of the target detection task, and the feature data IF2_ i is the original feature data of the example segmentation task.

S1, acquiring spatial feature information X1 based on the feature data IF1_ i.

S2, acquiring spatial feature information X2 based on the feature data IF2_ i.

S3, according to the characteristic data IF1_ i and the spatial characteristic information X2, obtaining characteristic data OF1_ i.

S4, according to the characteristic data IF2_ i and the spatial characteristic information X1, obtaining characteristic data OF2_ i.

S5, judging whether the value of i is equal to N, if not, going to step S6, if yes, going to step S7.

S6, adding 1 to the value OF i, taking the feature data OF1_ (i-1) as feature data IF1_ i, taking the feature data OF2_ (i-1) as feature data IF2_ i, and going to step S1.

In step S6, the following formula can be used to represent feature data OF1_ (i-1) as feature data IF1_ i and feature data OF2_ (i-1) as feature data IF2_ i.

IF1_i＝OF1_(i-1)

IF2_i＝OF2_(i-1)。

S7 shows the feature data OF1_ i as the first feature data and the feature data OF2_ i as the second feature data.

Alternatively, step S1 in method 600 may adopt the implementation method of step 310 shown in fig. 5. The description is given above and will not be repeated here.

In the embodiment of the application, the target detection and the example segmentation mutually provide the spatial feature information, and for the target detection and the example segmentation, the feature data can be corrected through the spatial feature information of the other side, so that the prediction results of the target detection and the example segmentation can be consistent to a greater extent, and the prediction accuracy of the example segmentation task can be improved.

In addition, through executing the operation of mutually providing spatial feature information by multiple rounds of target detection and example segmentation, the feature data of the target detection and the example segmentation can be better corrected, so that the prediction results of the target detection and the example segmentation can be more consistent, and the prediction accuracy of the example segmentation task can be improved.

For example, the method 600 may be performed by the apparatus 1200 of embodiments below.

As shown in fig. 7, an embodiment of the present application further provides a method 700 for image processing, where the method 700 includes the following steps S710 and S720.

S710, inputting the image data to be processed into a segmentation mask prediction model.

S720, obtaining a prediction result of the segmentation mask of the image data to be processed by using the prediction model of the segmentation mask.

In other words, the detection aided loss function uses the target detection tag information to constrain the split mask prediction results of the split mask prediction model.

It should be appreciated that target detection label information is typically used to train the detection box prediction model, and that the detection regression loss function as shown in fig. 13 uses the target detection label information to constrain the output of the detection box prediction model.

For example, the segmented mask prediction model may be obtained by the method 800 of the embodiments below.

It is to be appreciated that in training the split mask prediction model, by using the target detection tag information to constrain the split mask prediction results output by the split mask prediction model, the model accuracy of the split mask prediction model may be improved. Therefore, the segmentation mask prediction model is adopted to execute the instance segmentation task, and the prediction accuracy of the instance segmentation task can be improved.

It should be appreciated that in addition to detecting the auxiliary loss function, a pixel-by-pixel classification loss function is also used in training the segmentation mask prediction model, which uses the segmentation mask label information to constrain the output of the segmentation mask prediction model.

In addition, since the current instance segmentation task needs to adopt the rectangular detection frame region output by the target detection task as prior information (as shown in fig. 2), when the prediction result of the rectangular detection frame region is not accurate, the prediction accuracy of the instance segmentation task is reduced. In other words, when inaccurate target detection prediction results are used as a priori information of the instance segmentation task, the prediction results of the instance segmentation may be affected, for example, a segmentation mask with worse quality may be predicted.

In the embodiment of the application, in the process of training the segmented mask prediction model, the accuracy of the segmented mask prediction model can be improved by using the target detection label information to restrict the output of the segmented mask prediction model.

Optionally, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a transverse detection auxiliary loss function.

The longitudinal detection auxiliary loss function restrains the longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection label information, and the transverse detection auxiliary loss function restrains the transverse information of the prediction result output by the segmentation mask prediction model through the target detection label information.

Alternatively, in the embodiment shown in fig. 7, the image data to be processed may be the second feature data obtained in the method 300 or the method 600 of the above-described embodiment.

It should be understood that the embodiment of the application can more effectively improve the prediction accuracy of the instance segmentation task.

As shown in fig. 8, an embodiment of the present application further provides a method 800 for neural network training, where the method 800 includes the following steps S810 and S820.

And S810, acquiring target detection label information.

And S820, training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function restricts the output of the segmentation mask prediction model through target detection label information.

In the process of training the divided mask prediction model, the accuracy of the divided mask prediction model can be improved by using the target detection label information to constrain the divided mask prediction result output by the divided mask prediction model.

It should be appreciated that performing the instance segmentation task using the segmentation mask prediction model obtained by the method 800 shown in FIG. 8 may improve the prediction accuracy of the instance segmentation task.

Optionally, the detection auxiliary loss function includes a longitudinal detection auxiliary loss function and a transverse detection auxiliary loss function. The longitudinal detection auxiliary loss function restrains the longitudinal information of the prediction result output by the segmentation mask prediction model through the target detection label information, and the transverse detection auxiliary loss function restrains the transverse information of the prediction result output by the segmentation mask prediction model through the target detection label information.

For example, for a prediction result of a segmented mask with a size of w × h and corresponding detected frame label information, the prediction result is uniformly divided into n × n blocks, each block has a size of (w/n) × (h/n), then a horizontal and vertical global maximum pooling operation is performed on each block to obtain a corresponding horizontal mask and a vertical mask, and then a horizontal and vertical auxiliary loss function is calculated using the result obtained from the detected frame label information and the result obtained from the segmented mask prediction to constrain the output of the segmented mask prediction model.

Alternatively, the segmentation mask prediction model obtained in the embodiment shown in fig. 8 may be used to process the second feature data in the method 300 or the method 600 to obtain the segmentation mask prediction result of the second feature data.

The various embodiments described herein may be implemented as stand-alone solutions or combined in accordance with inherent logic and are intended to fall within the scope of the present application.

Embodiments of the methods provided herein are described above, and embodiments of the apparatus provided herein are described below. It should be understood that the description of the apparatus embodiments corresponds to the description of the method embodiments, and therefore, for brevity, details are not repeated here, since the details that are not described in detail may be referred to the above method embodiments.

As shown in fig. 9, an embodiment of the present application further provides an apparatus 900 for image processing. The apparatus 900 includes a target detection task feature obtaining module 910, an instance segmentation task feature obtaining module 920, and a spatial feature information aligning module 930.

The target detection task feature obtaining module 910 is configured to obtain first feature data based on original feature data of a target detection task. This first feature data is used to perform a target detection task, and as shown in fig. 9, detection frame prediction is performed.

The example segmentation task feature obtaining module 920 is configured to obtain second feature data based on the original feature data of the example segmentation task. This second feature data is used to perform an instance segmentation task, as shown in FIG. 9, to perform segmentation mask prediction.

The spatial feature information alignment module 930 is configured to align spatial feature information of the target detection task and the instance segmentation task.

Alternatively, as shown in fig. 9, the spatial feature information alignment module 930 includes a first spatial feature information obtaining unit 931 configured to obtain first spatial feature information according to the original feature data of the target detection task, and provide the first spatial feature information to the instance segmentation task feature obtaining module 920. Correspondingly, the example segmentation task feature obtaining module 920 is configured to fuse the first spatial feature information and the original feature data of the example segmentation task, and output the first feature data.

Optionally, as shown in fig. 9, the spatial feature information alignment module 930 further includes a second spatial feature information obtaining unit 932, configured to obtain second spatial feature information from the raw feature data of the instance segmentation task, and provide the second spatial feature information to the instance object detection task feature obtaining module 910. Correspondingly, the target detection task feature obtaining module 910 is configured to fuse the second spatial feature information with the original feature data of the target detection task, and output second feature data.

Alternatively, as shown in fig. 10, the first spatial feature information acquisition unit 931 includes a spatial feature information acquisition subunit 1001 and a spatial feature information processing subunit 1002.

The spatial feature information acquiring subunit 1001 is configured to acquire third spatial feature information based on the original feature data of the target detection task.

The spatial feature information processing subunit 1002 includes a band-direction pooling layer and a band-direction recombination layer. And the directional pooling layer is used for acquiring the transverse characteristic and the longitudinal characteristic from the third spatial characteristic information. The directional recombination layer is used for recombining the transverse characteristic and the longitudinal characteristic and outputting first spatial characteristic information.

For example, the apparatus 900 may be used to perform the method 300 in the above embodiments.

Referring to fig. 4 and 9, the target detection task feature obtaining module 910 is configured to perform step S350 in the foregoing embodiment, the example division task feature obtaining module 920 is configured to perform step S320 in the foregoing embodiment, the spatial feature information aligning module 930 is configured to perform steps S310 and S340 in the foregoing embodiment, the first spatial feature information obtaining unit 931 is configured to perform step S310 in the foregoing embodiment, and the second spatial feature information obtaining unit 932 is configured to perform step S340 in the foregoing embodiment.

Referring to fig. 5 and 10, the sub-unit 1001 in the first spatial feature information acquiring unit 931 is configured to perform step S311 in the foregoing embodiment, and the sub-unit 1002 in the first spatial feature information acquiring unit 931 is configured to perform step S312 and step S313 in the foregoing embodiment, wherein the band-direction pooling layer in the sub-unit 1002 is configured to implement step S312, and the band-direction reorganizing layer in the sub-unit 1002 is configured to implement step S313.

The relevant description is given above in detail, and for brevity, it is not repeated here.

One example of an apparatus 900 is described below with reference to fig. 11.

By way of example, and not limitation, one example of an apparatus 900 is the apparatus 1100 of FIG. 11.

For example, the target detection task feature obtaining module 910 and the instance division task feature obtaining module 920 are respectively formed by stacked convolution layers. It should be appreciated that the target detection task and the example segmentation task are two different computer vision tasks, and thus, the stacked convolutional layers used by the target detection task feature acquisition module 910 are different from the stacked convolutional layers used by the example segmentation task feature acquisition module 920.

As shown in fig. 11, the target detection task feature obtaining module 910 includes 2 convolutional layers with convolutional kernel size of1 × 1 and 1 convolutional layer with convolutional kernel size of 3 × 3. The example segmentation task feature acquisition module 1020 includes 1 convolutional layer with a convolutional kernel size of1 × 1, and 1 convolutional layer with a convolutional kernel size of 3 × 3.

By way of example and not limitation, the target detection task feature acquisition module 910 generates an output internal data stream of first feature data based on an input, and the input of the target detection task feature acquisition module 910 is described as follows, the original feature data a0 of the original feature data A0. first passes through a convolutional layer with a convolutional kernel size of1 × 1 to obtain a feature map a1 with a channel number of 1024, the feature map a1 passes through a convolutional layer with a convolutional kernel size of 3 × 3 to obtain a feature map a2 with a channel number of 256, the feature map a2 is spliced with second spatial feature information (also referred to as a spatial information feature map) C2 from the spatial feature information alignment module 930 along a channel dimension (e.g., ◎ in 910 in fig. 11), and then passes through a convolutional layer with a convolutional kernel size of1 × 1 to obtain a feature map A3 with a channel number of 1024, the feature map A3 is added with the original feature data a0 to serve as an output O1., and the output O1 is the output first feature data output by the target detection task feature acquisition module 910.

The input of the example segmentation task feature obtaining module 920 is taken as original feature data B0., the original feature data B0 is first passed through a convolutional layer with convolution kernel size of 3 × 3 to obtain a feature map B1 with channel number of 256, the feature map B1 is spliced with second spatial feature information (also referred to as spatial information feature map) C1 from the spatial feature information alignment module 930 along the channel dimension (e.g., ◎ in 920 in fig. 11), and then passed through a convolutional layer with convolution kernel size of1 × 1 to obtain a feature map B2 with channel number of 256 as output O2, the output O2 is the second feature data output by the example segmentation task feature obtaining module 920.

As shown in fig. 11, the spatial feature information alignment module 930 may be implemented by a convolutional layer, or by a convolutional layer and a pooling layer. The sub-unit 1001 in the first spatial feature information acquisition unit 931 includes 1 convolution layer having a convolution kernel size of1 × 1. The sub-unit 1002 in the first spatial feature information acquisition unit 931 includes a band-direction pooling layer and a band-direction recombination layer. The second spatial feature information acquisition unit 932 includes 1 convolution layer having a convolution kernel size of1 × 1.

By way of example and not limitation, the first spatial feature information acquisition unit 931 acquires the internal data stream of the first spatial feature information as follows. The input of the first spatial feature information acquisition unit 931 is written as original feature data C10. The original characteristic data C10 is subjected to convolution layer with convolution kernel size of1 × 1 to obtain third spatial characteristic information C11; the third spatial feature information C11 passes through the pooling layer in the belt direction to obtain transverse features and longitudinal features; the transverse features and the longitudinal features pass through the recombination layer with the direction, and first spatial feature information (also called a spatial information feature map) C1 is obtained. The first spatial feature information C1 is provided to the instance segmentation task feature acquisition module 920.

By way of example and not limitation, the second spatial feature information acquisition unit 932 acquires the internal data stream of the second spatial feature information as follows. The input of the first spatial feature information acquisition unit 931 is written as original feature data C20. The original feature data C20 is passed through a convolution layer with a convolution kernel size of1 × 1, and a second spatial information feature map C2 is obtained. The second spatial feature information C2 is provided to the object detection task feature acquisition module 910.

It should be noted that fig. 11 is only an example and not a limitation. That is, apparatus 900 the apparatus 1100 shown in fig. 11 is but one alternative implementation of the apparatus 900.

The apparatus 900 may have various possible alternative configurations, provided that the method 300 of the above embodiments can be implemented.

For example, the structure of the target detection task feature acquisition module 910 may be a different convolution layer from the stack shown in fig. 11, the structure of the example segmentation task feature acquisition module 920 may be a different convolution layer from the stack shown in fig. 11, and the structure of the spatial feature information alignment module 930 may be a different convolution layer and a different pooling layer from those shown in fig. 11.

In fig. 11, the input (C10) of the first spatial feature information acquisition unit 931 is the same as the input (a0) of the target detection task feature acquisition module 910, and the input (C20) of the second spatial feature information acquisition unit 932 is the same as the input (B0) of the example division task feature acquisition module 920. But the application is not so limited.

Alternatively, the input (C10) of the first spatial feature information acquisition unit 931 may be different from the input (a0) of the target detection task feature acquisition module 910, and the input (C20) of the second spatial feature information acquisition unit 932 may be different from the input (B0) of the instance division task feature acquisition module 920.

Referring to fig. 11, as an example, the input (C10) of the first spatial feature information obtaining unit 931 may be a feature map a1 obtained after the input a0 of the target detection task feature obtaining module 910 passes through a convolutional layer with a convolutional kernel size of1 × 1, or may be a feature map a2 obtained after a0 passes through a convolutional layer with a convolutional kernel size of1 × 1 and then passes through a convolutional layer with a convolutional kernel size of 3 × 3.

Referring to fig. 11, as an example, the input (C20) of the second spatial feature information acquisition unit 932 may be a feature map B1 obtained after the input B0 of the example segmentation task feature acquisition module 920 passes through a convolutional layer having a convolution kernel size of 3 × 3.

The floor product form of the device 900 and the device 1100 can be used for multi-target precise positioning service of a customized scene. For example, the apparatus 900 or the apparatus 1100 may be deployed in a computing node of a related device.

As shown in fig. 12, an embodiment of the present application further provides an apparatus 1200 for image processing. The apparatus 1200 comprises n apparatuses 900 of the above embodiments, such as the apparatus 900(1), the apparatus 900(2), …, and the apparatus 900(n) shown in fig. 12.

n is a positive integer. Device 900(i) represents the ith device 900 in device 1200, i being 1, 2. In practical application, the value of n may be determined according to application requirements, which is not limited in the present application.

Each device 900 that device 1200 includes may be referred to as an interleaved finger subnetwork 900 and device 1200 may be referred to as an interleaved finger network 1200.

In device 1200, the output of each interleaved sub-network of branches 900 serves as the input to the next interleaved sub-network of branches 900. That is, the output of the target detection task feature capture module 910 in each interleaved sub-network 900(i) serves as the input of the target detection task feature capture module 910 in the next interleaved sub-network 900(i +1), and the output of the instance split task feature capture module 920 in each interleaved sub-network 900(i) serves as the input of the instance split task feature capture module 920 in the next interleaved sub-network 900(i + 1).

Optionally, the structure and parameters of the n interleaved sub-networks 900 in the device 1200 are the same.

For example, each of the staggered sub-networks 900 in device 1200 is device 1100 as shown in FIG. 11.

Optionally, the structure and parameters of the n interleaved sub-networks 900 in the device 1200 are not identical.

For example, the architecture of each of the staggered sub-networks 900 in device 1200 is identical to the architecture of device 900 shown in fig. 9, but the specific structure of one portion of staggered sub-network 900 is as shown in device 1100 in fig. 11, and the specific structure of another portion of staggered sub-network 900 is different from that in fig. 11.

The apparatus 1200 may be used to perform the method 600 of the above embodiments.

The floor product form of the device 1200 can serve for multi-target accurate positioning in a customized scene. For example, the apparatus 1200 may be deployed in a computing node of a related device.

As shown in fig. 13, the present application further provides a system 1300 for image processing. The system 1300 includes a backbone network 1310, a regional proposal network 1320, a fully connected layer 1330, an interleaved branch network 1340, a multi-class prediction model 1350, a detection box prediction model 1360, and a segmentation mask prediction model 1370. The interleaver 1340 is the apparatus 1200 in the above embodiment.

The system 1300 may be used to perform image classification tasks, object detection tasks, and instance segmentation tasks. For example, the image data to be processed is input into the system 1300, and the system 1300 may output the class, detection box, and segmentation mask prediction results for each object.

By way of example, the operational flow of the system 1300 performing the image classification task, the object detection task, and the instance segmentation task includes the following steps.

Step 1), using the backbone network 1310 to perform feature acquisition on the image data to be processed, so as to obtain the image features of the whole image.

Step 2), candidate region positions of a plurality of targets are generated using the region proposal network 1320, and a feature map of each candidate region is acquired, that is, a candidate region feature as shown in fig. 13 is obtained.

And step 3), processing the candidate region features by using the full link layer 1330, that is, processing the feature map of each candidate region generated by the region proposal network 1320 to obtain the classification feature data for inputting the multi-classification prediction model 1350.

And 4) processing the classification characteristic data by using a multi-classification prediction model 1350 to obtain a multi-classification prediction result.

Step 5), processing the candidate region features by using the interleaved branching network 1340, that is, processing the feature map of each candidate region generated by the region proposal network 1320, to obtain the detected feature data (corresponding to the first feature data in the above embodiment) for inputting the detection box prediction model 1360 and the segmented feature data (corresponding to the second feature data in the above embodiment) for segmenting the mask prediction model 1370.

And 6), performing target detection processing on the detection characteristic data by using the detection frame prediction model 1360 to obtain a detection frame prediction result.

And 7) processing the segmentation characteristic data by using the segmentation mask prediction model 1370 to obtain a segmentation mask prediction result.

The execution sequence of the steps 1) to 7) is determined by the inherent logical relationship, and is not limited to the sequence presented in the above text.

For example, in system 1300, fully connected layer 1330 and multi-class prediction model 1350 may be collectively referred to as a multi-class branch network; the crossbar 1340 and the test box prediction model 1360 may be collectively referred to as a target test branch network; the interleaved finger network 1340 and the segmentation mask prediction model 1370 may be collectively referred to as an example segmented finger network.

In system 1300, a hierarchical branching network uses independent fully-connected layers to obtain feature maps and perform class prediction. The target detection finger network and the example segmentation finger network together use the interleaved finger network 1340 to obtain their respective feature data.

It should be understood that in the system 1300, by using the interleaved branching network 1340 (i.e., the apparatus 1200 shown in fig. 11) provided by the embodiment of the present application to obtain the feature data for the target detection task and the instance segmentation task, mutual supervision of the target detection task and the instance segmentation task may be implemented, so that the prediction accuracy of the instance segmentation task may be jointly improved.

Optionally, the system 1300 may also be used to train and deploy instance segmentation task models. As shown in FIG. 13, the system 1300 can build an example segmentation network model of a general scene by obtaining given image data from an image training data warehouse and given label information from a label data warehouse

By way of example, the operational flow of training an instance segmentation task model by the system 1300 includes the following steps.

Step (1), obtaining given image data from the image training data warehouse and inputting the given image data into the backbone network 1310.

Step (2), performing the above steps 1) to 7), as detailed above, which is not described herein again.

In step (2), the multi-class prediction model 1350, the detection box prediction model 1360, and the segmentation mask prediction model 1370 are trained by acquiring target class label information, target detection label information, and segmentation mask label information from the label data warehouse.

For example, the multi-class prediction model 1350 is trained by a multi-class loss function that uses target class label information to constrain the output of the multi-class prediction model 1350.

For example, the detection box prediction model 1360 is trained by detecting a regression loss function that uses target detection label information to constrain the output of the detection box prediction model 1360.

For example, the segmentation mask prediction model 1370 is trained by a pixel-by-pixel classification loss function that uses the segmentation mask label information to constrain the output of the segmentation mask prediction model 1370.

Optionally, in step (2), the segmented mask prediction model 1370 is trained by the segmented mask label information and the target detection label information.

For example, the segmentation mask prediction model 1370 is trained using a pixel-by-pixel classification loss function that constrains the output of the segmentation mask prediction model using the segmentation mask tag information and a detection-aided loss function that constrains the output of the segmentation mask prediction model by the target detection tag information.

As an example, the method 800 provided in the above embodiment is employed to train the segmentation mask prediction model 1370. The description is given above and will not be repeated here.

After the training is completed by the system 1300, the final parameters of the model are obtained, and then the model and the corresponding parameters are deployed in a test environment, so that an instance segmentation network model of a general scene can be obtained.

In the final deployment process, only image data is input by the algorithm, and the final output result is the target class, the detection frame and the segmentation mask prediction result without label information and each loss function.

The floor product form of the system 1300 may serve for multi-target accurate positioning in a customized scene. For example, the system 1300 may be deployed in a computing node of a related device, and may generate a pixel-level accurate positioning solution for a specified category object for a customer by accessing a visual data input interface of a current scene (e.g., the scenes shown in fig. 14 and 15).

The method and the device can be applied to the automatic analysis and understanding of the image data in the fields of automatic driving, video monitoring and the like which need accurate analysis of the target position.

The application scene one: pedestrian vehicle segmentation system in automatic driving system.

In the automatic driving task, the vehicle system collects image data through a camera, then identifies various pedestrians, vehicles and other vehicles on the road from the image, and judges the accurate positions of the pedestrians, vehicles and other vehicles to help select a final vehicle control strategy. As shown in fig. 14, with the system 1300 provided by the present application, a pedestrian and vehicle segmentation system suitable for an automatic driving task can be trained by using a pedestrian vehicle data warehouse under a given automatic driving scenario, and then deployed into an automatic driving system, so as to improve the accuracy of the system.

Application scenario two: a target segmentation system in video surveillance.

In the field of video monitoring, people need to pay attention to various targets in a monitored video, and meanwhile, the accurate positions of the targets are automatically judged and tracked and analyzed. As shown in fig. 15, by using the system 1300 provided by the present application, training is performed in a data warehouse under a video monitoring scene, and then the training is deployed in a target scene, so that each target position can be more accurately positioned, and further, the related attributes and other information of each target can be analyzed, thereby realizing automatic and accurate monitoring and behavior analysis.

Table 1 shows the performance of the example segmentation task model provided in the present application and other existing models in the target detection and example segmentation tasks of the public data set under the same experimental setup. As can be seen from table 1, the present application achieves superior performance on both the target detection and instance segmentation tasks compared to the existing solutions.

Table 1: target detection and example segmentation representation effect of example segmentation task model and existing model on public data set MS COCO based on application

Table 2 shows the enhancement effect brought by the staggered branch network 1200 and the auxiliary detection loss function provided in the present application, where the auxiliary detection loss function mainly enhances the example segmentation effect, and the staggered branch network can both enhance the target detection effect and enhance the example segmentation effect.

Table 2: cross-branching network 1200 and effectiveness analysis of auxiliary detection loss function (MS COCO)

As shown in fig. 16, an embodiment of the present application further provides an apparatus 1600 for image processing. The apparatus 1600 includes the following elements.

A first obtaining unit 1610, configured to obtain first spatial feature information based on raw feature data of a first type of image processing task.

The second obtaining unit 1620 is configured to obtain second feature data according to the original feature data of the second type of image processing task and the first spatial feature information.

The first processing unit 1630 is configured to perform second type image processing on the second feature data to obtain a processing result of the second type image processing task.

Optionally, the apparatus 1600 further comprises the following units.

A third obtaining unit 1640, configured to obtain second spatial feature information based on the raw feature data of the second type of image processing task.

A fourth obtaining unit 1650, configured to obtain the first feature data according to the original feature data of the first image processing task and the second spatial feature information.

The second processing unit 1660 is configured to perform a first image processing on the first feature data to obtain a processing result of the first image processing task.

Optionally, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the first obtaining unit 1610 is configured to: acquiring third spatial feature information based on the original feature data of the target detection task; respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information; and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

Optionally, the apparatus 1600 obtains the first feature data and the second feature data by performing the following operations, where an initial value of i is 1 and N is a positive integer in the following operations.

In the embodiment of the application, through executing the operation of mutually providing the spatial feature information by the multi-round target detection and the example segmentation, the feature data of the target detection and the example segmentation can be better corrected, so that the prediction results of the target detection and the example segmentation can be more consistent, and the prediction accuracy of the example segmentation task can be improved.

Optionally, the first image processing task is a target detection task, and the second image processing task is an instance segmentation task; the second processing unit 1660 is configured to process the first feature data using a detection frame prediction model to obtain a target detection prediction result of the first feature data; the first processing unit 1630 is configured to process the second feature data by using a segmentation mask prediction model to obtain a segmentation mask prediction result of the second feature data.

Optionally, the detection auxiliary loss function includes a vertical detection auxiliary loss function and a horizontal detection auxiliary loss function, where the vertical detection auxiliary loss function constrains vertical information of the prediction result output by the divided mask prediction model through the target detection tag information, and the horizontal detection auxiliary loss function constrains horizontal information of the prediction result output by the divided mask prediction model through the target detection tag information.

In the embodiment of the present application, the performance of the divided mask prediction model can be further improved by using the target detection label information to respectively constrain the horizontal information and the vertical information of the divided mask prediction result output by the divided mask prediction model.

Optionally, the second obtaining unit 1620 is configured to obtain the second feature data by processing the original feature data and the first spatial feature information of the second image processing task using a convolutional layer.

Optionally, the first obtaining unit 1610 is configured to: acquiring the third spatial feature information by processing the original feature data of the target detection task using the convolutional layer; and processing the third spatial feature information by using a pooling layer to acquire the transverse feature information and the longitudinal feature information.

The apparatus 1600 may be integrated on a terminal device, a network device, or a chip.

The apparatus 1600 may be deployed on a computing node of a related device, and may generate a pixel-level accurate positioning solution for a specified category object for a customer by accessing a visual data input interface of the scene.

As shown in fig. 17, an embodiment of the present application further provides an apparatus 1700 for image processing. The apparatus 1700 includes the following elements.

An input unit 1710 for inputting image data to be processed into the segmentation mask prediction model.

A processing unit 1720 for obtaining a prediction result of the segmentation mask for the image data to be processed using the segmentation mask prediction model.

The apparatus 1700 may be integrated on a terminal device, a network device, or a chip.

The apparatus 1700 may be deployed on a computing node of a related device, and by accessing a visual data input interface of the scene, a pixel-level accurate positioning solution for a specified category object can be generated for a customer.

As shown in fig. 18, an embodiment of the present application further provides an image processing apparatus 1800. The apparatus 1800 includes the following units.

The obtaining unit 1810 is configured to obtain target detection tag information.

A training unit 1820, configured to train to obtain a segmentation mask prediction model by using a detection-aided loss function, where the detection-aided loss function constrains an output of the segmentation mask prediction model according to the target detection label information.

The apparatus 1800 may be integrated on a terminal device, a network device, or a chip.

As shown in fig. 19, an embodiment of the present application further provides an apparatus 1900 for image processing. The apparatus 1900 includes a processor 1910, the processor 1910 being coupled with a memory 1920, the memory 1920 being configured to store computer programs or instructions, the processor 1910 being configured to execute the computer programs or instructions stored by the memory 1920, so that the method in the above method embodiments is performed.

Optionally, as shown in fig. 19, the apparatus 1900 may further include a memory 1920.

Optionally, as shown in fig. 19, the apparatus 1900 may further include a data interface 1930, where the data interface 1930 is used for transmitting data with the outside.

Optionally, the apparatus 1900 is used to implement the method 300 in the above embodiments as an approach.

Alternatively, the apparatus 1900 is used to implement the method 600 in the above embodiments.

Optionally, the apparatus 1900 is used to implement the method 700 in the above embodiment as a further solution.

Optionally, the apparatus 1900 is used to implement the method 800 in the above embodiments as a further solution.

Embodiments of the present application also provide a computer-readable medium storing program code for execution by a device, the program code including instructions for performing the method of the above-described embodiments.

Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method of the above embodiments.

The embodiment of the present application further provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface to execute the method of the above embodiment.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in the foregoing embodiment.

An embodiment of the present application also provides an electronic device, which includes any one or more of the apparatus 900, the apparatus 1100, the apparatus 1200, the system 1300, the apparatus 1500, the apparatus 1600, the apparatus 1700 in the foregoing embodiments.

Fig. 20 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 2000. The chip may be provided in any one or more of the following devices or systems:

the apparatus 900 shown in fig. 9, the apparatus 1100 shown in fig. 11, the apparatus 1200 shown in fig. 12, the system 1300 shown in fig. 13, the apparatus 1600 shown in fig. 16, the apparatus 1700 shown in fig. 17, the apparatus 1800 shown in fig. 18, the apparatus 1900 shown in fig. 19.

The

methods

300, 600, 700, or 800 in the above method embodiments may all be implemented in a chip as shown in fig. 20.

The neural network processor 2000 is mounted as a coprocessor on a main processor (Host CPU), and tasks are allocated by the main CPU. The core portion of the neural network processor 2000 is an arithmetic circuit 2003, and the controller 2004 controls the arithmetic circuit 2003 to acquire data in a memory (the weight memory 2002 or the input memory 2001) and perform arithmetic.

In some implementations, the arithmetic circuit 2003 internally includes a plurality of processing units (PEs). In some implementations, the arithmetic circuitry 2003 is a two-dimensional systolic array. The arithmetic circuit 2003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2003 is a general purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 2003 fetches the data corresponding to the matrix B from the weight memory 2002, and buffers it on each PE in the arithmetic circuit 2003. The arithmetic circuit 2003 fetches the matrix a data from the input memory 2001 and performs matrix arithmetic on the matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 2008.

The vector calculation unit 2007 may further process the output of the operation circuit 2003, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 2007 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit can 2007 store the vector of processed outputs to a unified memory (also referred to as a unified buffer) 2006. For example, the vector calculation unit 2007 may apply a non-linear function to the output of the arithmetic circuit 2003, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 2007 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 2003, e.g., for use in subsequent layers in a neural network.

The

method

300, 600, 700 or 800 in the above method embodiments may be performed by 2003 or 2007.

The unified memory 2006 is used to store input data and output data.

The weight data directly passes through a memory unit access controller 2005 (DMAC) to carry input data in the external memory to the input memory 2001 and/or the unified memory 2006, store the weight data in the external memory into the weight memory 2002, and store the data in the unified memory 2006 into the external memory.

A Bus Interface Unit (BIU) 2010, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 2009 through a bus.

An instruction fetch buffer 2009 connected to the controller 2004, for storing instructions used by the controller 2004;

and the controller 2004 is configured to call the instruction cached in the instruction memory 2009 to implement controlling of the working process of the operation accelerator.

In the embodiment of the present application, the data here may be image data to be processed.

Generally, the unified memory 2006, the input memory 2001, the weight memory 2002, and the instruction fetch memory 2009 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

It should be noted that the reference numbers of the first, second, third, fourth, etc. are merely used for convenience of description and are not intended to limit the scope of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a universal serial bus flash disk (UFD) (UFD may also be referred to as a U-disk or a flash disk for short), a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image processing, comprising:

acquiring first spatial feature information based on original feature data of a first type of image processing task;

acquiring second characteristic data according to the original characteristic data of a second type of image processing task and the first spatial characteristic information;

performing second image processing on the second characteristic data to obtain a processing result of the second image processing task;

the first image processing task and the second image processing task are respectively one of a target detection task and an example segmentation task and the other of the target detection task and the example segmentation task;

2. The method of claim 1, further comprising:

acquiring second spatial feature information based on the original feature data of the second image processing task;

acquiring first characteristic data according to the original characteristic data of the first image processing task and the second spatial characteristic information;

and carrying out first image processing on the first characteristic data to obtain a processing result of the first image processing task.

3. The method according to claim 1 or 2, wherein the first image processing task is a target detection task and the second image processing task is an instance segmentation task;

the acquiring of the first spatial feature information based on the original feature data of the first image processing task includes:

acquiring third spatial feature information based on the original feature data of the target detection task;

respectively acquiring transverse characteristic information and longitudinal characteristic information according to the third spatial characteristic information;

and recombining the transverse characteristic information and the longitudinal characteristic information to obtain the first spatial characteristic information.

4. The method of claim 2, wherein said obtaining second characteristic data, and said obtaining first characteristic data, comprises:

acquiring the first characteristic data and the second characteristic data by executing the following operations, wherein an initial value of i is 1, and N is a positive integer:

step S1, acquiring spatial feature information X1 based on the feature data IF1_ i;

step S2, acquiring spatial feature information X2 based on the feature data IF2_ i;

step S3, obtaining characteristic data OF1_ i according to the characteristic data IF1_ i and the spatial characteristic information X2;

step S4, obtaining characteristic data OF2_ i according to the characteristic data IF2_ i and the spatial characteristic information X1;

step S5, judging whether the value of i is equal to N,

IF not, adding 1 to the value OF i, taking the characteristic data OF1_ (i-1) as characteristic data IF1_ i, taking the characteristic data OF2_ (i-1) as characteristic data IF2_ i, going to step S1,

if yes, using the characteristic data OF1_ i as the first characteristic data, and using the characteristic data OF2_ i as the second characteristic data;

5. The method according to any one of claims 2 to 4, wherein the first image processing task is an object detection task and the second image processing task is an instance segmentation task;

wherein the executing the first image processing task on the first feature data comprises:

processing the first characteristic data by using a detection frame prediction model to obtain a target detection prediction result of the first characteristic data;

wherein, the performing the second image processing on the second feature data to obtain the processing result of the second image processing task includes:

processing the second feature data using a segmentation mask prediction model to obtain a segmentation mask prediction result for the second feature data,

6. The method of claim 5, wherein the detection-aided loss functions comprise longitudinal detection-aided loss functions and transverse detection-aided loss functions,

7. The method according to any one of claims 1 to 6, wherein the obtaining second feature data according to the original feature data of the second type of image processing task and the first spatial feature information comprises:

and acquiring the second characteristic data by processing the original characteristic data and the first spatial characteristic information of the second image processing task by using a convolution layer.

8. The method of claim 3, wherein the obtaining third spatial feature information based on the raw feature data of the target detection task comprises:

acquiring the third spatial feature information by processing the original feature data of the target detection task using a convolutional layer;

the obtaining of the transverse characteristic information and the longitudinal characteristic information respectively according to the third spatial characteristic information includes:

and processing the third spatial feature information by using a pooling layer to acquire the transverse feature information and the longitudinal feature information.

9. A method of image processing, comprising:

inputting image data to be processed into a segmentation mask prediction model;

obtaining a segmentation mask prediction result of the image data to be processed using the segmentation mask prediction model,

10. The method of claim 9, wherein the detection-aided loss functions comprise a longitudinal detection-aided loss function and a transverse detection-aided loss function,

11. A method of image processing, comprising:

acquiring target detection label information;

and training by using a detection auxiliary loss function to obtain a segmentation mask prediction model, wherein the detection auxiliary loss function restrains the output of the segmentation mask prediction model through the target detection label information.

12. The method of claim 11, wherein the detection-aided loss functions comprise a longitudinal detection-aided loss function and a transverse detection-aided loss function,

13. An apparatus for image processing, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor for performing the method of any of claims 1 to 12 when the program stored in the memory is executed.

14. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, which when executed performs the method of any of claims 1 to 12.

15. A chip comprising at least one processor and a data interface;

the at least one processor is configured to invoke and run a computer program stored on a memory via the data interface to cause the chip to perform the method of any of claims 1 to 12.