CN115049827A

CN115049827A - Target object detection and segmentation method, device, equipment and storage medium

Info

Publication number: CN115049827A
Application number: CN202210554832.2A
Authority: CN
Inventors: 郭湘; 韩旭
Original assignee: Guangzhou Weride Technology Co Ltd
Current assignee: Guangzhou Weride Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-09-13
Anticipated expiration: 2042-05-19
Also published as: CN115049827B

Abstract

The invention relates to the technical field of automatic driving, and discloses a target object detection and segmentation method, a device, equipment and a storage medium, which are used for improving the accuracy and efficiency of target object detection and segmentation and reducing the code maintenance cost. The target object detection segmentation method comprises the following steps: acquiring a target object two-dimensional image data set and target object three-dimensional space image data; performing feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset; carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain three-dimensional space feature data of the target object; performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view; and carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

Description

Target object detection and segmentation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of automatic driving, in particular to a target object detection and segmentation method, device, equipment and storage medium.

Background

At present, the camera is mature and stable in process, low in manufacturing cost and rich in information, so that the camera becomes an important sensing element in the field of unmanned driving. The camera itself does not have absolute ranging capability, so theoretically the target object (e.g., obstacle, lane line, etc. information) appears on a two-dimensional plane of the camera image without three-dimensional spatial position information.

In the prior art, commonly used three-dimensional space visual target object detection schemes include a monocular three-dimensional space frame detection scheme and a bird's-eye view image detection segmentation scheme. The three-dimensional space frame detection scheme predicts the actual position of each two-dimensional object in three-dimensional space based on the image perspective itself. However, a plurality of cameras are usually arranged in an unmanned scene, some target objects can be observed completely by the aid of the cameras, and the three-dimensional space frame detection scheme has design defects in processing of cross-camera objects. The bird's-eye view detection and segmentation scheme divides visual-based three-dimensional space detection and segmentation into a pixel-by-pixel depth estimation task and a pseudo-point cloud-based bird's-eye view detection and segmentation task, so that the whole task is no longer end-to-end learning, and the problems of large information loss and high code maintenance cost exist.

Disclosure of Invention

The invention provides a target object detection segmentation method, a target object detection segmentation device, a target object detection segmentation equipment and a storage medium, which are used for performing feature mapping fusion on a target object two-dimensional image dataset and target object three-dimensional space image data respectively, realizing an end-to-end target object detection segmentation task, improving the accuracy and efficiency of target object detection segmentation and reducing the code maintenance cost.

In order to achieve the above object, a first aspect of the present invention provides a target object detection and segmentation method, including: acquiring a target object two-dimensional image data set and target object three-dimensional space image data; performing feature extraction on the target object two-dimensional image data set through a preset mesh division model to obtain a target object two-dimensional feature data set; carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain three-dimensional space feature data of the target object; performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view; and carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

In a possible embodiment, the performing feature extraction on the target object two-dimensional image dataset through a preset mesh partition model to obtain a target object two-dimensional feature dataset includes: performing convolution operation on the target object two-dimensional image data set through a trunk feature extraction network in a preset grid division model to obtain initial two-dimensional feature data corresponding to each target object two-dimensional image data; and respectively carrying out convolution processing on initial two-dimensional feature numbers corresponding to the two-dimensional image data of each target object according to two convolution kernels with preset sizes to obtain a two-dimensional feature data set of the target object, wherein the two-dimensional feature data set of the target object comprises a two-dimensional key feature matrix and a two-dimensional value feature matrix.

In a possible implementation manner, the gridding and feature processing the three-dimensional spatial image data of the target object to obtain three-dimensional spatial feature data of the target object includes: carrying out grid division on the three-dimensional space image data of the target object to obtain three-dimensional space grid data of the target object; and expanding the characteristic dimension corresponding to the target object three-dimensional space grid data to a target characteristic dimension according to a preset convolution kernel to obtain target object three-dimensional space characteristic data, wherein the target object three-dimensional space characteristic data is expressed as a query characteristic matrix.

In a possible embodiment, the performing eigenmap fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object to obtain a feature tensor of a three-dimensional top view includes: scattering the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set to the target object three-dimensional space feature data through a scattering operator to obtain a three-dimensional key feature matrix and a three-dimensional value feature matrix; and performing feature fusion according to the three-dimensional space feature data of the target object, the three-dimensional key feature matrix and the three-dimensional value feature matrix to obtain a feature tensor of the three-dimensional top view.

In a possible implementation manner, the performing feature fusion according to the three-dimensional spatial feature data of the target object, the three-dimensional key feature matrix, and the three-dimensional value feature matrix to obtain a feature tensor of a three-dimensional top view includes: performing similarity calculation on the three-dimensional space characteristic data of the target object and the three-dimensional key characteristic matrix according to an inner product operator to obtain a similarity calculation result; multiplying the similarity calculation result by the three-dimensional value feature matrix to obtain an initial feature tensor of the three-dimensional space of the target object; and carrying out maximum pooling on the initial feature tensor of the three-dimensional space of the target object to obtain the feature tensor of the three-dimensional top view.

In a possible embodiment, the performing eigenmap fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object to obtain a feature tensor of a three-dimensional top view includes: calculating the depth value and the mean square error corresponding to the two-dimensional characteristic data of each target object according to the two-dimensional characteristic data set of the target object; and performing feature projection according to the three-dimensional space feature data of the target object and the depth value and mean square error corresponding to the two-dimensional feature data of each target object to obtain a feature tensor of the three-dimensional top view.

In a possible implementation manner, the detection and segmentation are performed according to the feature tensor of the three-dimensional top view, so as to obtain a target object detection and segmentation result: performing convolution operation on the feature tensor of the three-dimensional top view through a preset detection segmentation model to obtain a target feature map; and carrying out detection segmentation processing on the target characteristic graph to obtain a target object detection segmentation result.

A second aspect of the present invention provides a target object detection and segmentation apparatus, including: the acquisition module is used for acquiring a target object two-dimensional image data set and a target object three-dimensional space image data; the extraction module is used for extracting the features of the target object two-dimensional image data set through a preset mesh division model to obtain a target object two-dimensional feature data set; the processing module is used for carrying out grid division and characteristic processing on the three-dimensional space image data of the target object to obtain three-dimensional space characteristic data of the target object; the mapping fusion module is used for performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view; and the detection and segmentation module is used for carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

In a possible implementation manner, the extraction module is specifically configured to: performing convolution operation on the target object two-dimensional image data set through a trunk feature extraction network in a preset grid division model to obtain initial two-dimensional feature data corresponding to each target object two-dimensional image data; and respectively carrying out convolution processing on initial two-dimensional characteristic numbers corresponding to the two-dimensional image data of each target object according to two convolution kernels with preset sizes to obtain a two-dimensional characteristic data set of the target object, wherein the two-dimensional characteristic data set of the target object comprises a two-dimensional key characteristic matrix and a two-dimensional value characteristic matrix.

In a possible implementation manner, the processing module is specifically configured to: carrying out grid division on the three-dimensional space image data of the target object to obtain three-dimensional space grid data of the target object; and expanding the characteristic dimension corresponding to the target object three-dimensional space grid data to a target characteristic dimension according to a preset convolution kernel to obtain target object three-dimensional space characteristic data, wherein the target object three-dimensional space characteristic data is expressed as a query characteristic matrix.

In a possible implementation manner, the mapping fusion module further includes: the scattering unit is used for scattering the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set to the target object three-dimensional space feature data through a scattering operator to obtain a three-dimensional key feature matrix and a three-dimensional value feature matrix; and the fusion unit is used for performing feature fusion according to the three-dimensional space feature data of the target object, the three-dimensional key feature matrix and the three-dimensional value feature matrix to obtain a feature tensor of the three-dimensional top view.

In a possible embodiment, the fusion unit is specifically configured to: performing similarity calculation on the three-dimensional space characteristic data of the target object and the three-dimensional key characteristic matrix according to an inner product operator to obtain a similarity calculation result; multiplying the similarity calculation result by the three-dimensional value feature matrix to obtain an initial feature tensor of the three-dimensional space of the target object; and carrying out maximum pooling on the initial feature tensor of the three-dimensional space of the target object to obtain the feature tensor of the three-dimensional top view.

In a possible implementation manner, the mapping fusion module further includes: the calculation unit is used for calculating the depth value and the mean square error corresponding to the two-dimensional characteristic data of each target object according to the two-dimensional characteristic data set of the target object; and the projection unit is used for performing feature projection according to the three-dimensional space feature data of the target object and the depth values and mean square deviations corresponding to the two-dimensional feature data of each target object to obtain a feature tensor of the three-dimensional top view.

In a possible implementation manner, the detection and segmentation module is specifically configured to: performing convolution operation on the feature tensor of the three-dimensional top view through a preset detection segmentation model to obtain a target feature map; and carrying out detection segmentation processing on the target characteristic graph to obtain a target object detection segmentation result.

A third aspect of the present invention provides a target object detection segmentation apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the target object detection segmentation apparatus to perform the target object detection segmentation method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned target object detection segmentation method.

In the technical scheme provided by the invention, a target object two-dimensional image data set and a target object three-dimensional space image data are obtained; performing feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset; carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain three-dimensional space feature data of the target object; performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view; and carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result. In the embodiment of the invention, the characteristic tensor of the three-dimensional top view is obtained by projecting the image characteristics in the two-dimensional image data set of the target object into the three-dimensional space image data of the target object, and the target object is detected and segmented according to the characteristic tensor of the three-dimensional top view, so that the end-to-end target object detection and segmentation task is realized, the accuracy and efficiency of the target object detection and segmentation are improved, and the code maintenance cost is reduced.

Drawings

Fig. 1 is a schematic diagram of an embodiment of a target object detection and segmentation method in an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a target object detection and segmentation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a target object detection and segmentation apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of another embodiment of the target object detection and segmentation apparatus according to the embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a target object detection and segmentation apparatus in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a target object detection segmentation method, a target object detection segmentation device, a target object detection segmentation equipment and a storage medium, which are used for performing feature mapping fusion on a target object two-dimensional image dataset and target object three-dimensional space image data respectively, realizing an end-to-end target object detection segmentation task, improving the accuracy and efficiency of target object detection segmentation and reducing the code maintenance cost.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a target object detection and segmentation method according to an embodiment of the present invention includes:

101. and acquiring a target object two-dimensional image data set and target object three-dimensional space image data.

The target object may be an obstacle or a lane line, and is not particularly limited herein, and specifically, the server receives a target object detection and segmentation request; the server extracts a target object identifier from the target object detection and segmentation request; the server inquires image storage path information from a preset image data table according to the target object identifier; and the server reads the target object two-dimensional image data set and the target object three-dimensional space image data from a preset file directory according to the image storage path information. The target object identification is used to uniquely identify the target object.

It should be noted that the target object two-dimensional image data set is a plurality of high-resolution RGB images acquired by a plurality of preset cameras (for example, RGB cameras) in an unmanned scene at different viewing angles, the image sizes corresponding to the target object two-dimensional image data may be the same or different, and are not limited herein specifically, and the target object three-dimensional spatial image data is a pre-constructed spatial image. The server presets a target object identifier and maps and stores the target object identifier, a target object two-dimensional image data set and target object three-dimensional space image data into a preset image data table.

It is to be understood that the executing subject of the present invention may be the target object detection and segmentation apparatus, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

102. And performing feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset.

The preset gridding division model can be a convolutional neural network model which is trained in advance, and the preset gridding division model is used for indicating that a grid point feature mapping relation between the target object three-dimensional space image data and the target object two-dimensional image data set is determined according to camera internal parameters and camera external parameters which are calibrated by a preset camera. The camera internal parameters comprise an internal parameter matrix and a distortion parameter matrix, and the camera external parameters comprise a rotation matrix and a translation matrix.

Specifically, the server performs image preprocessing on the target object two-dimensional image dataset to obtain a preprocessed target object two-dimensional image dataset, for example, performs operations such as filtering and denoising, image binarization, and image size adjustment on the target object two-dimensional image dataset to obtain a preprocessed target object two-dimensional image dataset, where feature map sizes corresponding to the preprocessed target object two-dimensional image dataset are the same; the server inputs the preprocessed target object two-dimensional image dataset into a backbone network (namely, a backbone feature extraction network) in a preset grid division model to perform multilayer convolution processing, so as to obtain a target object two-dimensional feature dataset, wherein each target object two-dimensional feature data corresponds to a plurality of two-dimensional grid points, grid image pixels in each two-dimensional grid point are the same, and visual angle intervals are equal. The input feature map size corresponding to the target object two-dimensional image dataset is nxkxhxwx3, the output feature map size corresponding to the target object two-dimensional feature dataset is nxkxh ' × W ' × C ', wherein N is the batch size of the target object two-dimensional image dataset and is a positive integer, K is an image dimension (for example, K is 6) in each frame, H is a height in each target object two-dimensional image data, W is a width in each target object two-dimensional image data, H ' is a height in each target object two-dimensional feature data, W ' is a width in each target object two-dimensional feature data, H, W, H ' and W ' are positive values in the value ranges, and 3 represents R, G, B color values in the target object two-dimensional image data.

103. And carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain the three-dimensional space feature data of the target object.

Specifically, the server performs mesh division on the target object three-dimensional space image data through a preset mesh division model to obtain target object three-dimensional space mesh data, wherein the target object three-dimensional space mesh data comprise N space lattice points, and the mesh image pixels in the space lattice points are the same; and the server sequentially performs feature extraction and feature dimension expansion processing on the N space lattice points in the three-dimensional space grid data of the target object through a preset convolution core to obtain the three-dimensional space feature data of the target object.

It should be noted that the size of the input feature map corresponding to the target object three-dimensional space grid data is nxlxlxlxmxyxy × 3, where L is the length corresponding to the target object three-dimensional space grid data, M is the width corresponding to the target object three-dimensional space grid data, Y is the height corresponding to the target object three-dimensional space grid data, and 3 is that each space grid point in the target object three-dimensional space grid data is represented by three space coordinates (x, Y, z); the size of the output feature map corresponding to the target object three-dimensional space grid data is nxl × M × Y × C ″, where L is a length corresponding to the target object three-dimensional space feature data, M is a width corresponding to the target object three-dimensional space feature data, Y is a height corresponding to the target object three-dimensional space feature data, and C ″ may be 128 or 256, and is not limited herein. The target object three-dimensional spatial feature data includes N spatial grid points, for example, each spatial grid point has a feature dimension from 3 liters to 128 or 256.

104. And performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of the three-dimensional top view.

It can be understood that, for each spatial lattice point in the three-dimensional spatial feature data of the target object, a corresponding two-dimensional lattice point on the two-dimensional feature data of the target object may be matched according to the camera external parameters and the camera internal parameters. The feature tensor of the three-dimensional top view corresponds to an eigenmap size of N × L × M × C ". In some embodiments, the server performs feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object based on an attention mechanism to obtain a feature tensor of the three-dimensional top view. Further, the server can also determine the feature tensor of the three-dimensional top view by a feature projection mode. In some embodiments, the server calculates a depth value and a mean square error corresponding to each target object two-dimensional feature data according to the target object two-dimensional feature data set; the server performs feature projection according to the three-dimensional space feature data of the target object and the depth value and the mean square error corresponding to the two-dimensional feature data of each target object to obtain a feature tensor of a three-dimensional top view.

Further, the server can set the height in the three-dimensional space characteristic data of the target object as unit length 1 to obtain the three-dimensional top view characteristic data of the target object; and the server performs feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional top view feature data of the target object to obtain a feature tensor of the three-dimensional top view.

105. And carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

It can be understood that the feature tensor for the three-dimensional top view has an eigenmap height of 1. And the server fuses the shallow scale information and the deep scale information based on the feature tensor of the three-dimensional top view, and detects and segments the target object. In some embodiments, the server performs convolution operation on the feature tensor of the three-dimensional top view through a preset detection segmentation model to obtain a target feature map, wherein the preset detection segmentation model is a pre-trained model, and may be a convolutional neural network model or a deep learning network model, and the specific details are not limited herein; the server detects and divides the target characteristic diagram to obtain a target object detection division result, and specifically, the server detects the target characteristic diagram to obtain a target detection result; and the server performs characteristic segmentation processing on the target detection result to obtain a target object detection segmentation result.

It should be noted that the target object detection segmentation result has a corresponding target detection box, and the server may also calculate the value of the loss function according to the target detection box and a preset true value detection box; the server adjusts the parameters of the preset detection segmentation model according to the value of the loss function and retrains the parameters until the value of the loss function meets the preset convergence condition, the server completes iterative optimization training on the preset detection segmentation model, and the accuracy of the preset detection segmentation model in detecting and segmenting the target object is improved.

In the embodiment of the invention, the characteristic tensor of the three-dimensional top view is obtained by projecting the image characteristics in the two-dimensional image data set of the target object into the three-dimensional space image data of the target object, and the target object is detected and segmented according to the characteristic tensor of the three-dimensional top view, so that the end-to-end target object detection and segmentation task is realized, the accuracy and efficiency of the target object detection and segmentation are improved, and the code maintenance cost is reduced.

Referring to fig. 2, another embodiment of the method for detecting and segmenting a target object according to the embodiment of the present invention includes:

201. and acquiring a target object two-dimensional image data set and target object three-dimensional space image data.

The specific execution process of step 201 is similar to the specific execution process of step 101, and details thereof are not repeated here.

202. And performing feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset.

The input characteristic diagram size corresponding to the target object two-dimensional image dataset is NxKxHxW x 3, N is the batch size of the target object two-dimensional image dataset, K is the number of images in each frame (namely image dimension), the images in each frame comprise an upper image, a lower image, a left image, a right image, a front image and a rear image of a preset camera shooting target object, H is the height in each target object two-dimensional image data, W is the width in each target object two-dimensional image data, and 3 represents R, G, B color values in the target object two-dimensional image data.

In some embodiments, the server performs convolution operation on the target object two-dimensional image dataset through a trunk feature extraction network in the preset mesh partition model to obtain initial two-dimensional feature data corresponding to each target object two-dimensional image data, where the trunk feature extraction network may be a residual neural network, a dense convolution network, or a deep convolution network vgnet, and is not limited herein specifically, the number of the initial two-dimensional feature data corresponding to each target object two-dimensional image data is N, the feature map size of the initial two-dimensional feature data corresponding to each target object two-dimensional image data is K × H '× W' × C, the height H 'in the initial two-dimensional feature data corresponding to each target object two-dimensional image data and the width W' in the initial two-dimensional feature data corresponding to each target object two-dimensional image data may be a fraction of H and W, respectively, e.g., one quarter or one eighth, etc., the number of channels C may be 128; and the server performs convolution processing on the initial two-dimensional feature number corresponding to the two-dimensional image data of each target object according to two convolution kernels with preset sizes to obtain a two-dimensional feature data set of the target object, wherein the two-dimensional feature data set of the target object comprises a two-dimensional key feature matrix and a two-dimensional value feature matrix. The two convolution kernels with the preset size may be convolution kernels with the same size, and the two preset convolution kernels may be 1 × 1 or 5 × 5, which is not limited herein. That is, the server performs feature extraction operation on the initial two-dimensional feature number corresponding to each target object two-dimensional image data according to two convolution kernels with preset sizes to obtain a target object two-dimensional feature data set, where the target object two-dimensional feature data set includes a two-dimensional key feature matrix key and a two-dimensional value feature matrix value, and the sizes corresponding to the key and the value are N × K × H '× W' × C ', where N is consistent with the batch size of the target object two-dimensional image data set, and C' may be 128.

203. And carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain the three-dimensional space feature data of the target object.

The three-dimensional space characteristic data of the target object comprises a plurality of space lattice points. In some embodiments, the server performs mesh division on the target object three-dimensional space image data to obtain target object three-dimensional space mesh data, where a feature map size corresponding to the target object three-dimensional space mesh data is nxl × M × Y × 3, where N is the number of meshes corresponding to the target object three-dimensional space mesh data, L is a length corresponding to the target object three-dimensional space mesh data, e.g., L is 800, M is a width corresponding to the target object three-dimensional space mesh data, e.g., M is 800, Y is a height corresponding to the target object three-dimensional space mesh data, e.g., Y is 4 or 8, and 3 is that each space lattice point in the target object three-dimensional space mesh data has its feature represented by three space coordinates (x, Y, z); and the server expands the characteristic dimension corresponding to the target object three-dimensional space grid data to a target characteristic dimension according to a preset convolution kernel to obtain target object three-dimensional space characteristic data, wherein the target object three-dimensional space characteristic data is represented as a query characteristic matrix, and the query characteristic matrix can be represented as query. The preset convolution kernel may be 1 × 1 or 3 × 3, and is not limited herein. The feature dimension corresponding to the target object three-dimensional space grid data is 3, and the feature dimension corresponding to the target object three-dimensional space feature data is consistent with the feature dimension corresponding to the target object two-dimensional feature data set, that is, the feature map size corresponding to the target object three-dimensional space feature data is N × L × M × Y × C ", for example, the feature dimension C" corresponding to the target object three-dimensional space feature data is 128.

204. And respectively scattering the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set to the target object three-dimensional space feature data through a scattering operator to obtain a three-dimensional key feature matrix and a three-dimensional value feature matrix.

The scattering operator is scatter _ nd, and the scatter _ nd is used for performing interpolation calculation processing on the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set and the target object three-dimensional space feature data respectively. Specifically, the server extracts a two-dimensional key feature matrix key and a two-dimensional value feature matrix value from a target object two-dimensional feature data set; the server inserts the two-dimensional key feature matrix key into the target object three-dimensional space feature data query through a scattering operator to obtain a three-dimensional key feature matrix key'; and the server inserts the two-dimensional key feature matrix value into the target object three-dimensional space feature data query through a scattering operator to obtain a three-dimensional value feature matrix value ', wherein the feature diagram size corresponding to the three-dimensional key feature matrix key ' and the feature diagram size corresponding to the three-dimensional value feature matrix value ' are both N multiplied by K multiplied by L multiplied by M multiplied by Y multiplied by C ".

205. And performing feature fusion according to the three-dimensional space feature data, the three-dimensional key feature matrix and the three-dimensional value feature matrix of the target object to obtain the feature tensor of the three-dimensional top view.

It should be noted that the server obtains a weight coefficient corresponding to the three-dimensional key feature matrix key 'by calculating similarity or correlation between the three-dimensional spatial feature data query of the target object and the three-dimensional key feature matrix key' based on the attention mechanism, and then performs weighted calculation on the three-dimensional value feature matrix value 'based on the weight coefficient corresponding to the three-dimensional key feature matrix key', so as to obtain the feature tensor of the three-dimensional top view. In some embodiments, the server performs similarity calculation on the three-dimensional spatial feature data of the target object and the three-dimensional key feature matrix according to an inner product operator to obtain a similarity calculation result, and specifically, the server performs dot multiplication on the three-dimensional spatial feature data of the target object and the three-dimensional key feature matrix according to the inner product operator to obtain a similarity calculation result (i.e., a weight coefficient corresponding to the three-dimensional key feature matrix key'), where a size corresponding to the similarity calculation result is N × K × L × M × Y × 1, i.e., a feature dimension corresponding to the similarity calculation result is 1; the server multiplies the similarity calculation result by the three-dimensional value feature matrix to obtain a target object three-dimensional space initial feature tensor, wherein the similarity calculation result is a weighting coefficient of the three-dimensional value feature matrix, and the size of an eigengraph corresponding to the target object three-dimensional space initial feature tensor is NxKxLxMxYxC; and the server performs maximum pooling on the initial feature tensor of the three-dimensional space of the target object to obtain the feature tensor of the three-dimensional top view. Further, the server performs maximum pooling processing on the picture dimension and the height dimension in the initial feature tensor of the three-dimensional space of the target object through a preset pooling convolution kernel to obtain the feature tensor of the three-dimensional top view, wherein the feature graph dimension corresponding to the feature tensor of the three-dimensional top view is nxlxlxlxlxmxc ", the size of the preset pooling convolution kernel may be 2 × 2, or 4 × 4, and specifically, the size is not limited here, and C" may be 128.

206. And carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

The specific implementation process of step 206 is similar to that of step 105, and is not described herein again.

In the above description of the target object detection and segmentation method in the embodiment of the present invention, referring to fig. 3, a target object detection and segmentation apparatus in the embodiment of the present invention is described below, and an embodiment of the target object detection and segmentation apparatus in the embodiment of the present invention includes:

an obtaining module 301, configured to obtain a target object two-dimensional image data set and a target object three-dimensional space image data;

an extraction module 302, configured to perform feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset;

the processing module 303 is configured to perform meshing and feature processing on the target object three-dimensional space image data to obtain target object three-dimensional space feature data;

a mapping fusion module 304, configured to perform feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object, so as to obtain a feature tensor of a three-dimensional top view;

and the detection and segmentation module 305 is configured to perform detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

Referring to fig. 4, another embodiment of the target object detection and segmentation apparatus according to the embodiment of the present invention includes:

an extraction module 302, configured to perform feature extraction on the target object two-dimensional image dataset through a preset mesh division model, so as to obtain a target object two-dimensional feature dataset;

and a detection and segmentation module 305, configured to perform detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

In a possible implementation, the extracting module 302 may be further specifically configured to:

performing convolution operation on the target object two-dimensional image data set through a trunk feature extraction network in a preset grid division model to obtain initial two-dimensional feature data corresponding to each target object two-dimensional image data;

and respectively carrying out convolution processing on initial two-dimensional characteristic numbers corresponding to the two-dimensional image data of each target object according to two convolution kernels with preset sizes to obtain a two-dimensional characteristic data set of the target object, wherein the two-dimensional characteristic data set of the target object comprises a two-dimensional key characteristic matrix and a two-dimensional value characteristic matrix.

In a possible implementation manner, the processing module 303 may be further specifically configured to:

performing grid division on the three-dimensional space image data of the target object to obtain three-dimensional space grid data of the target object;

and expanding the characteristic dimension corresponding to the target object three-dimensional space grid data to a target characteristic dimension according to a preset convolution kernel to obtain target object three-dimensional space characteristic data, wherein the target object three-dimensional space characteristic data is expressed as a query characteristic matrix.

In a possible implementation, the mapping fusion module 304 may further include:

a scattering unit 3041, configured to scatter the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set to the target object three-dimensional space feature data through a scattering operator, respectively, to obtain a three-dimensional key feature matrix and a three-dimensional value feature matrix;

the fusion unit 3042 is configured to perform feature fusion according to the three-dimensional spatial feature data of the target object, the three-dimensional key feature matrix, and the three-dimensional value feature matrix, to obtain a feature tensor of the three-dimensional top view.

In a possible embodiment, the fusion unit 3042 may be further specifically configured to:

performing similarity calculation on the three-dimensional space characteristic data of the target object and the three-dimensional key characteristic matrix according to an inner product operator to obtain a similarity calculation result;

multiplying the similarity calculation result by the three-dimensional value feature matrix to obtain an initial feature tensor of the three-dimensional space of the target object;

and carrying out maximum pooling on the initial feature tensor of the three-dimensional space of the target object to obtain the feature tensor of the three-dimensional top view.

a calculating unit 3043, configured to calculate, according to the target object two-dimensional feature data set, a depth value and a mean square error corresponding to each target object two-dimensional feature data;

the projection unit 3044 is configured to perform feature projection according to the three-dimensional spatial feature data of the target object, and the depth value and the mean square error corresponding to the two-dimensional feature data of each target object, so as to obtain a feature tensor of the three-dimensional top view.

In a possible implementation manner, the detection and segmentation module 305 is specifically configured to:

performing convolution operation on the feature tensor of the three-dimensional top view through a preset detection segmentation model to obtain a target feature map;

and carrying out detection segmentation processing on the target characteristic graph to obtain a target object detection segmentation result.

Fig. 3 and fig. 4 describe the target object detection and segmentation apparatus in the embodiment of the present invention in detail from the perspective of modularization, and the target object detection and segmentation apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a target object detection and segmentation apparatus 500 according to an embodiment of the present invention, where the target object detection and segmentation apparatus 500 may have relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of computer program operations in the target object detection segmentation apparatus 500. Still further, the processor 510 may be arranged to communicate with the storage medium 530, and execute a series of computer program operations in the storage medium 530 on the target object detection segmentation apparatus 500.

Target object detection segmentation apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows service, Mac OS X, Unix, Linux, FreeBSD, and so forth. It will be appreciated by those skilled in the art that the structure of the target object detection segmentation apparatus shown in fig. 5 does not constitute a limitation of the target object detection segmentation apparatus, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the target object detection segmentation method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A target object detection segmentation method, characterized by comprising:

acquiring a target object two-dimensional image data set and target object three-dimensional space image data;

performing feature extraction on the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset;

carrying out grid division and feature processing on the three-dimensional space image data of the target object to obtain three-dimensional space feature data of the target object;

performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view;

and carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

2. The method for detecting and segmenting a target object according to claim 1, wherein the performing feature extraction on the target object two-dimensional image dataset through a preset mesh partition model to obtain a target object two-dimensional feature dataset comprises:

and respectively carrying out convolution processing on initial two-dimensional feature numbers corresponding to the two-dimensional image data of each target object according to two convolution kernels with preset sizes to obtain a two-dimensional feature data set of the target object, wherein the two-dimensional feature data set of the target object comprises a two-dimensional key feature matrix and a two-dimensional value feature matrix.

3. The method for detecting and segmenting the target object according to claim 1, wherein the step of performing mesh division and feature processing on the three-dimensional space image data of the target object to obtain three-dimensional space feature data of the target object comprises:

carrying out grid division on the three-dimensional space image data of the target object to obtain three-dimensional space grid data of the target object;

4. The method for detecting and segmenting a target object according to claim 1, wherein the performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object to obtain a feature tensor of a three-dimensional top view includes:

scattering the two-dimensional key feature matrix and the two-dimensional value feature matrix in the target object two-dimensional feature data set to the target object three-dimensional space feature data through a scattering operator to obtain a three-dimensional key feature matrix and a three-dimensional value feature matrix;

and performing feature fusion according to the three-dimensional space feature data of the target object, the three-dimensional key feature matrix and the three-dimensional value feature matrix to obtain a feature tensor of the three-dimensional top view.

5. The method for detecting and segmenting a target object according to claim 4, wherein the obtaining of an eigen tensor of a three-dimensional top view by performing eigen fusion according to the three-dimensional spatial eigen data of the target object, the three-dimensional key eigen matrix, and the three-dimensional value eigen matrix comprises:

6. The method for detecting and segmenting a target object according to claim 1, wherein the performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional spatial feature data of the target object to obtain a feature tensor of a three-dimensional top view includes:

calculating the depth value and the mean square error corresponding to the two-dimensional characteristic data of each target object according to the two-dimensional characteristic data set of the target object;

and performing feature projection according to the three-dimensional space feature data of the target object and the depth value and mean square error corresponding to the two-dimensional feature data of each target object to obtain a feature tensor of the three-dimensional top view.

7. The method for detecting and segmenting the target object according to any one of claims 1 to 6, wherein the detecting and segmenting according to the feature tensor of the three-dimensional top view to obtain the target object detection segmentation result comprises:

8. A target object detection segmentation apparatus, characterized by comprising:

the acquisition module is used for acquiring a target object two-dimensional image data set and a target object three-dimensional space image data;

the extraction module is used for extracting the features of the target object two-dimensional image dataset through a preset mesh division model to obtain a target object two-dimensional feature dataset;

the processing module is used for carrying out grid division and characteristic processing on the three-dimensional space image data of the target object to obtain three-dimensional space characteristic data of the target object;

the mapping fusion module is used for performing feature mapping fusion on the two-dimensional feature data set of the target object and the three-dimensional space feature data of the target object to obtain a feature tensor of a three-dimensional top view;

and the detection and segmentation module is used for carrying out detection and segmentation according to the feature tensor of the three-dimensional top view to obtain a target object detection and segmentation result.

9. A target object detection segmentation apparatus characterized by comprising: a memory and at least one processor, the memory having stored therein a computer program;

the at least one processor invokes the computer program in the memory to cause the target object detection segmentation apparatus to perform the target object detection segmentation method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a target object detection segmentation method according to any one of claims 1 to 7.