CN112053358B

CN112053358B - Method, device, equipment and storage medium for determining instance category of pixel in image

Info

Publication number: CN112053358B
Application number: CN202011040874.1A
Authority: CN
Inventors: 单鼎一; 梅树起
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2024-09-13
Anticipated expiration: 2040-09-28
Also published as: CN112053358A

Abstract

The application discloses a method, a device, equipment and a storage medium for determining the instance category of pixels in an image, wherein the method comprises the following steps: acquiring an image to be detected, wherein the image to be detected comprises examples of the number of targets; performing downsampling processing on the image to be detected to obtain sharing characteristics; carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features; performing instance analysis processing on the shared features to obtain instance features; the example features include spatial location features of each pixel in the image to be detected; fusing the semantic features with the example features to determine fusion features of each pixel in the image to be detected; and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected. By adopting the technical scheme of the application, the accuracy of determining the example category of the pixel is improved.

Description

Method, device, equipment and storage medium for determining instance category of pixel in image

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining an instance class of a pixel in an image.

Background

The building detection industry practice based on satellite images is mostly a double-stage target detection algorithm mask-rcnn series, wherein the first stage is rough detection of a target instance position frame: roughly positioning the object to be detected of the positive sample in the picture and classifying the positive and negative samples to obtain a series of candidate frame positions (anchor frames), wherein the second stage is a small position regression classification network and a semantic segmentation network: the regression classification network is responsible for carrying out multi-class fine classification on the positive sample object and carrying out regression on the offset of the circumscribed rectangular frame relative to the anchor point frame, and the semantic segmentation network is responsible for front background segmentation of the pixel level single instance.

In the prior art, mask-rcnn is taken as an example, and in dense villages, the following problems exist in the rural small house detection task: 1. the problem of low instance recall rate caused by too small instance targets and great influence on the design and detection difficulty of the candidate frames in the first stage. 2. The edges of the examples of the output results are often unclear, which can cause detection of a plurality of small-area capping results, lower accuracy and failure to meet the application standard. In addition, many pictures have too many (thousands) of building examples, the foreground pixel characteristics can reach 100 ten thousand, and the clustering difficulty is extremely high. Often the buildings in the same picture have many similarities, and the far-clustered instances cannot be the same instance, but the textures are almost the same, which can lead to clustering errors.

Therefore, it is necessary to provide a method, a device and a storage medium for determining the class of the pixel in the image, so as to improve the accuracy of determining the class of the pixel, thereby ensuring that the examples in different areas cannot be clustered into one class and accelerating the clustering speed.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for determining the instance category of a pixel in an image, which can improve the accuracy of determining the instance category of the pixel, thereby ensuring that the instances in different areas cannot be clustered into one category and accelerating the clustering speed.

In one aspect, the present application provides a method for determining an example class of a pixel in an image, the method comprising:

acquiring an image to be detected, wherein the image to be detected comprises examples of the number of targets;

Performing downsampling processing on the image to be detected to obtain sharing characteristics;

Carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features;

Performing instance analysis processing on the shared features to obtain instance features; the example features include spatial location features of each pixel in the image to be detected;

Fusing the semantic features with the example features to determine fusion features of each pixel in the image to be detected;

and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected.

Another aspect provides an example class determination apparatus for a pixel in an image, the apparatus comprising:

The image acquisition module is used for acquiring an image to be detected, wherein the image to be detected comprises examples of the number of targets;

the shared characteristic determining module is used for carrying out downsampling processing on the image to be detected to obtain shared characteristics;

The semantic feature determining module is used for carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features;

The example feature determining module is used for carrying out example analysis processing on the shared features to obtain example features; the example features include spatial location features of each pixel in the image to be detected;

The fusion feature determining module is used for fusing the semantic features with the example features and determining fusion features of each pixel in the image to be detected;

And the instance category determining module is used for determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected.

Another aspect provides an example class determination device for a pixel in an image, the device comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement the example class determination method for a pixel in an image as described above.

Another aspect provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the example class determination method of pixels in an image as described above.

Another aspect provides a computer storage medium storing at least one instruction or at least one program loaded and executed by a processor to implement an example class determination method of a pixel in an image as described above.

The method, the device, the equipment and the storage medium for determining the instance category of the pixel in the image have the following technical effects:

The method comprises the steps of performing downsampling treatment on an image to be detected to obtain shared features, and then performing semantic segmentation treatment and instance analysis treatment on the shared features to obtain semantic features and instance features, wherein the instance features comprise spatial position features of each pixel in the image to be detected; fusing the semantic features with the example features to determine fusion features of each pixel in the image to be detected; and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected. According to the method, the semantic features obtained by the connected domain regional clustering method are fused with the example features comprising the spatial position features, so that the fusion features of each pixel are obtained, the accuracy of determining the example category of the pixel is improved, the situation that the examples in different regions cannot be clustered into one category is ensured, and the clustering speed is increased.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an example class determination system for pixels in an image according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating an example method for determining a class of a pixel in an image according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a method for performing downsampling on an image to be detected to obtain a shared feature according to an embodiment of the present application;

FIG. 4 is a flow diagram of a method for determining semantic branch networks and instance branch networks provided by embodiments of the present application;

FIG. 5 is a flowchart of a method for determining a fusion feature of each pixel in the image to be detected according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a U-Net network framework provided by an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an example class determining device for pixels in an image according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a shared feature determining module according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

Specifically, the scheme provided by the embodiment of the application relates to the field of machine learning of artificial intelligence. Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. The application automatically segments a large number of examples in the image through the machine learning model, and has high accuracy.

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an example class determining system for pixels in an image according to an embodiment of the present application, and as shown in fig. 1, the example class determining system for pixels in an image may at least include a server 01 and a client 02.

Specifically, in the embodiment of the present disclosure, the server 01 may include a server that operates independently, or a distributed server, or a server cluster that is formed by a plurality of servers, and may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and artificial intelligence platforms. The server 01 may include a network communication unit, a processor, a memory, and the like. In particular, the server 01 may be used to determine an instance class of pixels in an image.

Specifically, in the embodiment of the present disclosure, the client 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, an intelligent wearable device, or other types of physical devices, or may include software running in the physical devices, for example, web pages provided by some service providers to users, or may also provide applications provided by the service providers to users. Specifically, the client 02 may be configured to display an image corresponding to each instance in the image to be detected.

In the following, an example method for determining a class of pixels in an image according to the present application is described, and fig. 2 is a schematic flow chart of an example method for determining a class of pixels in an image according to an embodiment of the present application, where the method includes steps as described in the examples or the flow chart, but may include more or less steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). As shown in fig. 2, the method may include:

S201: and obtaining an image to be detected, wherein the image to be detected comprises examples of the number of targets.

In the embodiment of the specification, the image to be detected can include a plurality of examples, the target number can be more than 2, and the examples in the image have similar structures and same attributes; examples in the same image belong to the same category, the examples can be dense and fine objects in the image to be detected, for example, examples can be buildings or vehicles, and the target number can be set to be larger than the preset number. The structure of each instance in the image to be detected is similar, and when the map is drawn according to the image to be detected, the instance in the image to be detected needs to be divided, and the characteristics of the size, the position and the like of each instance are acquired.

S203: and carrying out downsampling treatment on the image to be detected to obtain sharing characteristics.

In this embodiment of the present disclosure, as shown in fig. 3, the performing downsampling processing on the image to be detected to obtain a shared feature may include:

s2031: extracting an edge texture feature set of the image to be detected;

In the embodiment of the present disclosure, the edge texture feature set may include a plurality of edge texture features, where the edge texture features are bottom features of the deep learning network, that is, feature maps of the front layer network, and the visual representations of the edge texture features are typically points, lines, faces, and angles. The edge texture feature set of the image to be detected can be extracted through a bottom convolution layer, namely, the edge texture feature set of the image to be detected is extracted through multiple convolution and pooling operations.

S2033: determining edge texture combination characteristics according to the edge texture characteristic set of the image to be detected;

In the embodiment of the present disclosure, deep learning is a stepwise abstract process, in which bottom edge texture features are combined and abstracted into middle layer features, namely, instance local features, and then instance local information is abstracted into instance integral features, namely, edge texture combined features, so as to perform instance category learning.

In the embodiment of the specification, the edge texture features of the image to be detected are combined through the middle and high-layer convolution layers to obtain edge texture combined features; the edge texture features can be fused through multiple convolution and pooling operations to obtain edge texture combination features; for example, after obtaining points, lines, planes and angles of the examples in the image to be detected, determining local features of the examples in the image to be detected through a middle-layer convolution layer, further convoluting and pooling the local features of the examples through a high-layer convolution layer, and determining integral features of the examples in the image to be detected, namely edge texture combination features, so that example category learning is performed.

S2035: carrying out normalization normal distribution processing on the edge texture combined characteristics to obtain normalization characteristics;

In the present embodiment, the normalized normal distribution processing is one functional layer in deep learning: BN-BatchNorm layers, BN is to forcibly pull the distribution of the input values of any neuron of each layer of neural network back to the standard normal distribution with the mean value of 0 and the variance of 1 through a certain normalization means, and essentially forcibly pull the distribution with more and more deviation back to the standard distribution, so that the smaller change of the input values can cause larger change of the loss function, thereby enabling the gradient to be larger and avoiding the gradient vanishing problem; and the gradient is large, which means that the learning convergence speed is high, and the training speed can be greatly increased.

S2037: and carrying out nonlinear mapping processing on the normalized features to obtain the shared features.

In the embodiment of the present specification, the edge texture feature set of the image to be detected may be extracted by a bottom convolution layer; combining the edge texture features of the image to be detected through the middle and high-layer convolution layers to obtain edge texture combined features; carrying out normalization normal distribution processing on the edge texture combination characteristics through a normalization layer to obtain normalization characteristics; and performing nonlinear mapping processing on the normalized features through an activation layer to obtain the shared features.

In the embodiment of the present specification, by the downsampling process, the shared features of the image to be detected may be obtained, and the shared features may be used for semantic segmentation and instance analysis processing.

In the embodiment of the present specification, downsampling is represented as a deep learning step-by-step process, and the bottom layer, middle layer and high layer information are respectively realized through multiple convolution, pooling and other operations. Wherein both rolling and pooling have the effect of reducing the size. During the multiple convolution and pooling operations from the bottom layer to the middle layer to the higher convolution layer, the depth feature scale is gradually reduced and the channel number is gradually increased, for example, the image is changed from 256 (length) by 256 (width) by 3 (channel number) to 128 (length) by 128 (width) by 20 (channel number), the length and width are both reduced, and the channel number is increased, as shown in fig. 6 04.

In a specific embodiment, the image to be detected may be a building image, wherein an example is a building, the image includes thousands of building examples, and the bottom convolution layer extraction features, i.e., the edge texture feature set of the image, are edges, points, lines, and corners of the building; the middle convolution layer extraction features are local features of the building, such as top features, side features and the like of a room; the high-rise convolution layer extraction features, namely edge texture combination features, are integral features of the building.

S205: and carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features.

In this embodiment of the present disclosure, the semantic features include background features and foreground features of the image to be detected; the method for carrying out semantic segmentation processing on the shared features by adopting the connected domain regional clustering method comprises the following steps:

And carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain the background features and the foreground features of the image to be detected.

In this embodiment of the present disclosure, after the step of performing semantic segmentation processing on the shared feature by using a connected domain region clustering method to obtain the background feature and the foreground feature of the image to be detected, the method further includes:

A first mask of background features and a second mask of foreground features of the image to be detected are determined.

In this embodiment of the present disclosure, the background feature and the foreground feature of the image to be detected may be determined by a connected domain regional clustering method, and a mask corresponding to the foreground feature and the background feature is generated, for example, the mask of the foreground feature may be set to 1, and the mask of the background feature may be set to 0, so as to distinguish the foreground from the background.

In the embodiment of the present disclosure, the performing semantic segmentation processing on the shared feature by using a connected domain regional clustering method, where obtaining the semantic feature includes:

and carrying out semantic segmentation processing on the shared features through a semantic branch network to obtain semantic features.

In the embodiment of the present disclosure, the semantic branch network may include a plurality of upsampling modules, where deconvolution layer operations may be performed on the shared feature by the plurality of upsampling modules, where each upsampling module performs deconvolution operations to achieve size amplification; providing the necessary feature information for the higher layer upsampling fusion. The input of each up-sampling module is not only from the output characteristics of the previous up-sampling module, but also from the corresponding shared characteristic layer with the same size in the down-sampling process, so that the two characteristics can be added in the module for better fusion of characteristic information, and the information fusion is realized by carrying out convolution operation, so that the prediction of the image foreground and the background is carried out, and the semantic characteristics are obtained.

In the embodiment of the present specification, the luminance value of the binary image has only two states: black (0) and white (255). In practical applications, the analysis of many images is finally converted into the analysis of binary images, such as: detecting the foreground of an image; the most important method for binary image analysis is connected region marking, which is the basis of all binary image analysis, and each single connected region forms a marked block through marking white pixels (targets) in the binary image, so that further geometric parameters such as outlines, circumscribed rectangles, mass centers, invariant moment and the like of the blocks can be obtained. In an image, the smallest unit is a pixel, each pixel is surrounded by 8 adjacent pixels, and the common adjacent relationship is 2: 4 abutment and 8 abutment. 4 are adjacent to a total of 4 points, namely up, down, left and right. The 8 contiguous points are 8 in total, including diagonally positioned points, and visually, the points that are in communication with each other form one region, while the points that are not in communication form a different region. All the sets of interconnected points are called a communication area. The connected domain regional clustering method can separate the foreground and the background of the image; when the example is a building, the prospect is the building.

S207: performing instance analysis processing on the shared features to obtain instance features; the example features include spatial location features of each pixel in the image to be detected.

In the embodiment of the specification, deconvolution layer operation can be performed on the shared feature through a plurality of up-sampling modules, and each up-sampling module performs deconvolution operation to realize size amplification; providing the necessary feature information for the higher layer upsampling fusion. The input of each up-sampling module is not only from the output characteristics of the previous up-sampling module, but also from the corresponding shared characteristic layer with the same size in the down-sampling process, so that the two characteristics can be added in the module for better fusion of characteristic information, and convolution operation is carried out to realize information fusion, and the pixel characteristics in the image are learned to obtain example characteristics. The example features include spatial location features of each pixel in the image to be detected, and the spatial location features of the pixels can be characterized by two-dimensional spatial coordinates of the pixels. The spatial position features can reflect the spatial region difference between pixels, so that the feature similarity of adjacent pixels is improved, and the spatial difference between pixels with far distance is increased.

In this embodiment of the present disclosure, the example feature further includes a texture feature of each pixel in the image to be detected, and the performing an example analysis process on the shared feature to obtain an example feature includes:

And carrying out instance analysis processing on the shared features to obtain texture features and spatial position features of each pixel in the image to be detected.

In the embodiment of the present specification, the texture feature of the pixel may be characterized by using an eight-dimensional feature vector; at this time, the example features are represented by ten-dimensional feature vectors, and the latter two dimensions are the space coordinates corresponding to the pixels; thus, the texture difference and the space regional difference between different examples can be distinguished;

in an embodiment of the present disclosure, the performing an instance analysis on the shared feature to obtain an instance feature includes:

and carrying out instance analysis processing on the shared features through an instance branch network to obtain instance features.

In the present description embodiment, the texture feature may be characterized by a feature vector of a fixed dimension (e.g., 8 dimensions).

In this embodiment of the present disclosure, as shown in fig. 4, before the step of performing semantic segmentation processing on the shared feature through the semantic branch network to obtain the semantic feature, the method further includes:

S2041: constructing a cross entropy loss function of a first network;

s2043: constructing an intra-class aggregation loss function and an inter-class distinction loss function of a second network;

S2045: determining the sum of the cross entropy loss function, the intra-class polymerization degree loss function and the inter-class distinction degree loss function as a comprehensive loss function;

S2047: respectively adjusting parameters of the first network and the second network to obtain a current first network and a current second network;

s2049: calculating comprehensive loss values corresponding to the current first network and the current second network;

s20411: and when the comprehensive loss value is smaller than a preset threshold value, determining the current first network as the semantic branch network, and determining the current second network as the instance branch network.

In an embodiment of the present specification, the method further includes:

S20413: and when the comprehensive loss value is greater than or equal to the preset threshold value, repeating the steps of: and respectively adjusting parameters of the first network and the second network to obtain a current first network and a current second network.

In the embodiment of the present specification, the preset threshold may be set according to actual situations. In the training process of the semantic branch network, labeling semantic labels for each pixel in a training image, wherein the semantic labels comprise foreground labels and background labels; during the training of the example branching network, each pixel in the training image needs to be labeled with a feature tag, which may include texture features and spatial location features.

In the embodiment of the present specification, the cross entropy loss function of the first network may be:

Where pi is the prediction probability, yi is the class label (0, 1), and N is the number of features.

The intra-class aggregation loss function of the second network may be:

wherein CC is the number of instances of the training image, δv is the intra-class penalty factor, μc is the average value of a certain intra-class feature, and xi is a certain pixel feature;

the inter-class distinguishability-loss function of the second network may be:

Where CC is the number of instances in the training image, δ _d is the inter-class penalty factor, and μ _ca,μ_cb is the average of features within a certain class.

In an embodiment of the present disclosure, the first network and the second network are both in the same deep learning network, and the method further includes:

constructing a regularization loss function of the deep learning network;

accordingly, the determining the sum of the cross entropy loss function, the intra-class variability loss function, and the inter-class distinguishability loss function as a composite loss function includes:

and determining the sum of the cross entropy loss function, the intra-class polymerization loss function, the inter-class distinction loss function and the regularization loss function as a comprehensive loss function.

In the embodiment of the specification, the regularization loss function may be an L1 regularization function or an L2 regularization function, and when the comprehensive loss function is calculated, the regularization loss function is introduced, so that a model corresponding to the network can be prevented from being over-fitted, and the generalization capability of the model is improved.

In the embodiment of the present disclosure, the deep learning network may be a U-Net network, which is a classical full convolution network (i.e., no full connection operation in the network). The input of the network is a picture with the edge subjected to mirror image operation; on the left side of the network is a series of downsampling operations made up of convolution volumes Max Pooling, which is referred to as the compression path. The compression path consists of 4 blocks, each block uses 3 effective volumes and 1 maximum pooling (Max Pooling) downsampling, and the number of Feature maps (Feature maps) after each downsampling is multiplied by 2; the right part of the network is the extension path (expansive path). Also consisting of 4 blocks, each block was multiplied by 2 by deconvolution before beginning, while halving its number (slightly different in the last layer), and then merging with the Feature Map of the left symmetric compression path, and U-Net was normalized by clipping the Feature Map of the compression path to the same size as the Feature Map of the expansion path, since the size of the Feature Map of the left compression path and the size of the Feature Map of the right expansion path are different. The convolution operation of the extended path still uses an efficient convolution operation.

S209: and fusing the semantic features with the example features to determine the fusion features of each pixel in the image to be detected.

In this embodiment of the present disclosure, as shown in fig. 5, the fusing the semantic feature with the example feature, and determining the fused feature of each pixel in the image to be detected may include:

s2091: fusing a first mask of the background characteristics of the image to be detected with texture characteristics and spatial position characteristics of pixels corresponding to the background in the image to be detected to obtain a first fusion result;

s2093: fusing a second mask of the foreground features of the image to be detected with texture features and spatial position features of pixels corresponding to the foreground in the image to be detected to obtain a second fusion result;

S2095: and determining the fusion characteristic of each pixel in the image to be detected according to the first fusion result and the second fusion result.

In the embodiment of the specification, a connected domain regional clustering and spatial position feature fusion strategy is adopted, so that on one hand, pixels in different regions can be guaranteed not to be clustered into one category, and on the other hand, the clustering speed is increased.

S2011: and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected.

In an embodiment of the present disclosure, after the step of determining an instance class of each pixel in the image to be detected, the method further includes:

Determining a pixel set corresponding to each instance category through a density clustering algorithm;

and responding to the triggering operation on the display interface, and displaying the image corresponding to each instance in the image to be detected according to the pixel set corresponding to each instance category.

In the embodiment of the present disclosure, the density-based clustering method is based on the density of the data sets in spatial distribution, and does not need to preset the number of clusters, so that the method is particularly suitable for clustering data sets with unknown contents. And representative algorithms are: DBSCAN, OPTICS. Taking the DBSCAN algorithm as an example, the DBSCAN aims at finding the maximum set of the densely connected objects. The classical Density clustering-based algorithm DBSCAN (Density-Based Spatial Clustering of Application with Noise, density-based spatial clustering application with noise) is a Density clustering algorithm based on high-Density connected regions.

The basic algorithm flow of DBSCAN is as follows: all objects reachable from the P density are extracted from any object P through breadth-first search according to the threshold and the parameters, and a cluster is obtained. If P is a core object, the corresponding object can be marked as the current class at a time and expanded based on the current class. After a complete cluster is obtained, a new object is selected and the process is repeated. If P is a boundary object, it is marked as noise and discarded.

Specifically, in the embodiment of the present specification, the example after density clustering may generate an example polygon composed of a small number of points by using a suitable vectorization algorithm; the image corresponding to the instance may be a polygonal structure; the triggering operation on the display interface may be sliding, clicking or dragging or other operations of the user on the display interface, for example, the user may click "image preview" in the display interface, and construct and display an image corresponding to each instance in the image to be detected, where the instances in the image are separated from each other.

In an embodiment of the present disclosure, after the step of determining, by using a density clustering algorithm, a pixel set corresponding to each instance class, the method may further include:

Sending a pixel set corresponding to each instance category to a terminal; and the terminal responds to the operation on the display interface to construct and display the image corresponding to each instance in the image to be detected.

Specifically, in the embodiment of the present disclosure, examples in the image may be identified by using different colors, so that a user can distinguish different examples in the image; the terminal can comprise a map application program, and the map application program can respond to the operation on the display interface to construct and display the image corresponding to each instance in the image to be detected; and thus, the map information corresponding to the image to be detected is intuitively displayed to the user.

In a specific embodiment, as shown in fig. 6, the network framework corresponding to the method of the present application is divided into three parts, namely, feature extraction downsampling, feature fusion upsampling and instance clustering (clustering). The network frame corresponding model is the example segmentation image determining model; in the application process, the image to be detected is directly input into the instance segmentation image determining model, and the output instance segmentation image can be obtained. Specifically, firstly, the downsampling processing network 04 processes the image 03 to be detected to obtain a sharing characteristic; then, the shared features are respectively input into an instance branch network 05 and a semantic branch network 06 to obtain an instance feature map 07 and a semantic feature map 08; and obtaining a pixel cluster map 09 in the example according to the example feature map 07 and the semantic feature map 08, and finally obtaining the example segmentation image 10.

In a specific embodiment, the method of the application is used for dividing the building examples in the dense area and is applied to the map application program, the accuracy is more than 95%, and the generation efficiency of the map data is greatly improved.

As can be seen from the technical solutions provided in the embodiments of the present specification, a shared feature is obtained by performing downsampling processing on an image to be detected, and then semantic segmentation processing and instance analysis processing are performed on the shared feature, so as to obtain a semantic feature and an instance feature, where the instance feature includes a spatial position feature of each pixel in the image to be detected; fusing the semantic features with the example features to determine fusion features of each pixel in the image to be detected; and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected. According to the method, the semantic features obtained by the connected domain regional clustering method are fused with the example features comprising the spatial position features, so that the fusion features of each pixel are obtained, the accuracy of determining the example category of the pixel is improved, the situation that the examples in different regions cannot be clustered into one category is ensured, and the clustering speed is increased.

The embodiment of the application also provides a device for determining the example category of the pixels in the image, as shown in fig. 7, the device comprises:

An image acquisition module 710, configured to acquire an image to be detected, where the image to be detected includes an instance of a target number;

the sharing feature determining module 720 is configured to perform downsampling processing on the image to be detected to obtain sharing features;

The semantic feature determining module 730 is configured to perform semantic segmentation processing on the shared feature by using a connected domain regional clustering method to obtain a semantic feature;

An instance feature determining module 740, configured to perform instance analysis processing on the shared feature to obtain an instance feature; the example features include spatial location features of each pixel in the image to be detected;

A fusion feature determining module 750, configured to fuse the semantic feature with the instance feature, and determine a fusion feature of each pixel in the image to be detected;

the instance category determining module 760 is configured to determine an instance category of each pixel in the image to be detected according to the fusion feature of each pixel in the image to be detected.

In some embodiments, the apparatus may further comprise:

The pixel set determining module is used for determining a pixel set corresponding to each instance category through a density clustering algorithm;

And the image display module is used for responding to the triggering operation on the display interface, and displaying the image corresponding to each instance in the image to be detected according to the pixel set corresponding to each instance category.

In some embodiments, the semantic features include background features and foreground features of the image to be detected; the semantic feature determination module may include:

The semantic feature determining unit is used for carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain background features and foreground features of the image to be detected.

In some embodiments, the apparatus may further comprise:

and the mask determining module is used for determining a first mask of the background characteristic and a second mask of the foreground characteristic of the image to be detected.

In some embodiments, the example features further include texture features for each pixel in the image to be detected, and the example feature determination module may include:

and the example feature determining unit is used for carrying out example analysis processing on the shared feature to obtain texture features and spatial position features of each pixel in the image to be detected.

In some embodiments, the fusion feature determination module may include:

the first fusion result determining unit is used for fusing a first mask of the background characteristic of the image to be detected with the texture characteristic and the spatial position characteristic of the pixel corresponding to the background in the image to be detected to obtain a first fusion result;

The second fusion result determining unit is used for fusing a second mask of the foreground characteristic of the image to be detected with the texture characteristic and the spatial position characteristic of the pixels corresponding to the foreground in the image to be detected to obtain a second fusion result;

And the fusion characteristic determining unit is used for determining the fusion characteristic of each pixel in the image to be detected according to the first fusion result and the second fusion result.

In some embodiments, as shown in fig. 8, the shared feature determining module 720 may include:

An edge texture feature set extraction unit 7210 for extracting an edge texture feature set of the image to be detected;

An edge texture combination feature determining unit 7220 for determining an edge texture combination feature according to the edge texture feature set of the image to be detected;

The normalized feature determining unit 7230 is configured to perform normalized normal distribution processing on the edge texture combined feature to obtain a normalized feature;

And the shared feature determining unit 7240 is configured to perform nonlinear mapping processing on the normalized feature to obtain the shared feature.

In some embodiments, the apparatus may further comprise:

The cross entropy loss function construction module is used for constructing a cross entropy loss function of the first network;

The loss function construction module of the second network is used for constructing an intra-class aggregation loss function and an inter-class distinction loss function of the second network;

the comprehensive loss function determining module is used for determining the sum of the cross entropy loss function, the intra-class aggregation loss function and the inter-class distinction loss function as a comprehensive loss function;

the parameter adjustment module is used for respectively adjusting parameters of the first network and the second network to obtain a current first network and a current second network;

the comprehensive loss value calculation module is used for calculating the comprehensive loss values corresponding to the current first network and the current second network;

and the branch network determining module is used for determining the current first network as the semantic branch network and determining the current second network as the instance branch network when the comprehensive loss value is smaller than a preset threshold value.

In some embodiments, the apparatus may further comprise:

The repeating module is used for repeating the steps when the comprehensive loss value is greater than or equal to the preset threshold value: and respectively adjusting parameters of the first network and the second network to obtain a current first network and a current second network.

The device and method embodiments in the device embodiments described are based on the same inventive concept.

The embodiment of the application provides an example class determining device for pixels in an image, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the example class determining method for pixels in an image, as provided by the embodiment of the method.

Embodiments of the present application also provide a computer storage medium that may be provided in a terminal to store at least one instruction or at least one program related to an example class determination method for a pixel in an image in a method embodiment, where the at least one instruction or at least one program is loaded and executed by the processor to implement the example class determination method for a pixel in an image provided in the method embodiment.

Alternatively, in the present description embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The memory according to the embodiments of the present disclosure may be used to store software programs and modules, and the processor executes the software programs and modules stored in the memory to perform various functional applications and data processing. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The embodiment of the method for determining the instance category of the pixels in the image provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Taking the operation on the server as an example, fig. 9 is a block diagram of a hardware structure of the server according to a method for determining an instance class of a pixel in an image according to an embodiment of the present application. As shown in fig. 9, the server 900 may vary considerably in configuration or performance and may include one or more central processing units (Central Processing Units, CPUs) 910 (the processor 910 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 930 for storing data, one or more storage mediums 920 (e.g., one or more mass storage devices) for storing applications 923 or data 922. Wherein memory 930 and storage medium 920 may be transitory or persistent storage. The program stored on the storage medium 920 may include one or more modules, each of which may include a series of instruction operations on a server. Still further, the central processor 910 may be configured to communicate with a storage medium 920 and execute a series of instruction operations in the storage medium 920 on the server 900. The server 900 may also include one or more power supplies 960, one or more wired or wireless network interfaces 950, one or more input/output interfaces 940, and/or one or more operating systems 921, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The input-output interface 940 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 900. In one example, the input-output interface 940 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices through a base station to communicate with the internet. In one example, the input/output interface 940 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

It will be appreciated by those skilled in the art that the configuration shown in fig. 9 is merely illustrative and is not intended to limit the configuration of the electronic device. For example, server 900 may also include more or fewer components than shown in fig. 9, or have a different configuration than shown in fig. 9.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

The embodiment of the method, the device, the server or the storage medium for determining the instance category of the pixels in the image provided by the application can be seen that the method, the device, the server or the storage medium obtains the sharing feature by carrying out downsampling processing on the image to be detected, and then carries out semantic segmentation processing and instance analysis processing on the sharing feature respectively to obtain the semantic feature and the instance feature, wherein the instance feature comprises the spatial position feature of each pixel in the image to be detected; fusing the semantic features with the example features to determine fusion features of each pixel in the image to be detected; and determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected. According to the method, the semantic features obtained by the connected domain regional clustering method are fused with the example features comprising the spatial position features, so that the fusion features of each pixel are obtained, the accuracy of determining the example category of the pixel is improved, the situation that the examples in different regions cannot be clustered into one category is ensured, and the clustering speed is increased.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, device, storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only required.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A method for determining an instance class of pixels in an image, the method comprising:

Carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features; the semantic features comprise background features and foreground features of the image to be detected;

Determining a first mask of background features and a second mask of foreground features of the image to be detected;

Performing instance analysis processing on the shared features to obtain instance features; the example features comprise spatial position features of each pixel in the image to be detected and texture features of each pixel;

Fusing a first mask of the background characteristics of the image to be detected with texture characteristics and spatial position characteristics of pixels corresponding to the background in the image to be detected to obtain a first fusion result;

Fusing a second mask of the foreground features of the image to be detected with texture features and spatial position features of pixels corresponding to the foreground in the image to be detected to obtain a second fusion result;

Determining fusion characteristics of each pixel in the image to be detected according to the first fusion result and the second fusion result;

2. The method of claim 1, wherein after the step of determining an instance class for each pixel in the image to be detected, the method further comprises:

3. The method of claim 1, wherein performing instance analysis processing on the shared feature to obtain an instance feature comprises:

4. The method of claim 1, wherein downsampling the image to be detected to obtain a shared feature comprises:

extracting an edge texture feature set of the image to be detected;

determining edge texture combination characteristics according to the edge texture characteristic set of the image to be detected;

Carrying out normalization normal distribution processing on the edge texture combined characteristics to obtain normalization characteristics;

and carrying out nonlinear mapping processing on the normalized features to obtain the shared features.

5. The method of claim 1, wherein the performing semantic segmentation processing on the shared feature by using a connected domain regional clustering method to obtain a semantic feature comprises:

carrying out semantic segmentation processing on the shared features through a semantic branch network to obtain semantic features;

Performing instance analysis processing on the shared feature to obtain an instance feature, wherein the instance feature comprises:

6. The method of claim 5, wherein prior to the step of semantically partitioning the shared features via a semantic branch network to obtain semantic features, the method further comprises:

Constructing a cross entropy loss function of a first network;

Constructing an intra-class aggregation loss function and an inter-class distinction loss function of a second network;

Determining the sum of the cross entropy loss function, the intra-class polymerization degree loss function and the inter-class distinction degree loss function as a comprehensive loss function;

Respectively adjusting parameters of the first network and the second network to obtain a current first network and a current second network;

Calculating comprehensive loss values corresponding to the current first network and the current second network;

and when the comprehensive loss value is smaller than a preset threshold value, determining the current first network as the semantic branch network, and determining the current second network as the instance branch network.

7. An example class determination device for pixels in an image, the device comprising:

The semantic feature determining module is used for carrying out semantic segmentation processing on the shared features by adopting a connected domain regional clustering method to obtain semantic features; the semantic features comprise background features and foreground features of the image to be detected;

A mask determining module, configured to determine a first mask of a background feature and a second mask of the foreground feature of the image to be detected;

The example feature determining module is used for carrying out example analysis processing on the shared features to obtain example features; the example features comprise spatial position features of each pixel in the image to be detected and texture features of each pixel;

the instance category determining module is used for determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected;

Wherein, the fusion characteristic determining module comprises:

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. The apparatus of claim 7, wherein the example feature determination module comprises:

10. The apparatus of claim 7, wherein the shared feature determination module comprises:

an edge texture feature set extracting unit, configured to extract an edge texture feature set of the image to be detected;

An edge texture combination feature determining unit, configured to determine an edge texture combination feature according to the edge texture feature set of the image to be detected;

The normalization feature determining unit is used for carrying out normalization normal distribution processing on the edge texture combination feature to obtain a normalization feature;

and the shared feature determining unit is used for carrying out nonlinear mapping processing on the normalized features to obtain the shared features.

11. The apparatus of claim 7, wherein the semantic feature determining module is further configured to perform semantic segmentation processing on the shared feature through a semantic branch network to obtain a semantic feature;

The example feature determining module is further configured to perform example analysis processing on the shared feature through an example branch network to obtain an example feature.

12. The apparatus of claim 11, wherein the apparatus further comprises:

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. An instance class determination device for pixels in an image, the device comprising a processor and a memory, the memory having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement the instance class determination method for pixels in an image as claimed in any one of claims 1-6.

15. A computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the method of instance class determination of pixels in an image according to any one of claims 1-6.