CN116935356A

CN116935356A - Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method

Info

Publication number: CN116935356A
Application number: CN202310939807.0A
Authority: CN
Inventors: 刘军; 姜广峰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-10-24

Abstract

The invention discloses an automatic driving multi-mode picture and point cloud example segmentation method based on weak supervision, which comprises the steps of processing point cloud data through a 2D frame tag of picture data to obtain rough point cloud pseudo tag data; processing the rough point cloud pseudo tag data through a pseudo tag generator to obtain trained point cloud pseudo tag data; constructing a multi-mode network through a box I nst and an FSD segment; sending the training data set into a multi-mode network for forward propagation, and obtaining a pseudo tag instance segmentation result of multi-mode data in forward propagation; determining a self-supervision loss function of a picture network branch and a cross-supervision loss function of a multi-mode network through a pseudo-label example segmentation result, and determining a final loss function; the parameters of the neural network are back propagated through the final loss function, and a trained multi-modal neural network is obtained; and carrying out instance segmentation prediction on the prediction data set through the trained multi-mode neural network to determine instance targets of the picture and the point cloud.

Description

Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method

Technical Field

The invention belongs to the field of automatic driving and the field of deep learning, and particularly relates to an automatic driving multi-mode picture and point cloud instance segmentation method based on weak supervision.

Background

In recent years, the application of automatic driving technology in the market is expanding, which is attributed to the continuous reduction of the cost of sensors such as cameras, millimeter wave radars, laser radars and the like, and the rapid development of deep learning and related hardware computing power. The autopilot task may be subdivided into a number of subtasks such as target detection, instance segmentation, target tracking, and decision making. Example segmentation plays a key role in these tasks, and it not only can sense information of the surrounding space of the vehicle, but also can provide accurate appearance shape sensing; this provides the basic information for subsequent tracking and decision making and is therefore of vital importance.

Among a plurality of sensors, the laser radar can obtain data in a point cloud format, has the advantages of high resolution, accurate identification, high measurement speed, strong anti-interference capability and the like, but lacks accurate semantic information, and the image data just compensates for the defects; multimodal studies, which are spread around lidar and pictures, are therefore of great interest.

The instance segmentation task aims at assigning each pixel in the image and point cloud to a corresponding object instance and providing an accurate boundary for each object instance; unlike the target detection task, the instance segmentation not only requires identifying different objects in the image, but also performs pixel-level segmentation on each object, i.e., designates a label for each pixel or point to identify which object instance it belongs to; in the instance segmentation task, a common approach is to generate a binary mask, where each pixel or point belongs to either a certain object instance or the background; such a mask may be obtained by pixel-level prediction using deep learning techniques and convolutional neural networks; typically, this task requires training of the model using labeled training data to learn the feature and boundary information of the object instance.

However, the 2D and 3D mask labeling requires a great deal of manpower and financial resources to label images and point clouds, and compared with the example mask and 3D frame labeling, the 2D frame labeling is simpler and has higher cost performance compared with the example mask labeling; although this direction has important practical value, the research on the multi-mode weak supervision method is limited; on the one hand, there is an inherent difference between the point cloud and the image; the point cloud focuses mainly on geometric information, while the image contains semantic and texture information.

Thus, existing 2D weakly supervised approaches have difficulty migrating to the 3D instance segmentation task. On the other hand, point clouds do not have suitable weak supervision labels to distinguish instances; although 3D frame labeling can be used to train neural networks, the labeling burden is also expensive to explore, so it is necessary to design a weakly supervised multi-modal instance segmentation algorithm that can improve both the 2D instance segmentation accuracy and the 3D instance segmentation accuracy.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide an automatic driving picture and point cloud instance segmentation method based on weak supervision.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a weak supervision-based automatic driving picture and point cloud instance segmentation method, which comprises the following steps:

processing the point cloud data through the 2D frame tag of the picture data to obtain rough point cloud pseudo tag data;

processing the rough point cloud pseudo tag data through a pseudo tag generator to obtain trained point cloud pseudo tag data;

constructing a multi-mode network through a box Inst and an FSD segment;

sending the training data set into a multi-mode network for forward propagation, and obtaining a pseudo tag instance segmentation result of multi-mode data in forward propagation;

determining a self-supervision loss function of a picture network branch and a cross-supervision loss function of a multi-mode network through a pseudo-label example segmentation result;

taking weighted summation of a box Inst network loss function supervised by a real tag, an FSD segment network loss function supervised by a pseudo tag, a self-supervision loss function and a cross-supervision loss function as a final loss function;

the parameters of the neural network are back propagated through the final loss function, and a trained multi-modal neural network is obtained;

and carrying out instance segmentation prediction on the prediction data set through the trained multi-mode neural network to determine instance targets of the picture and the point cloud.

In the above scheme, the processing of the point cloud data by the 2D frame tag of the picture data obtains rough point cloud pseudo tag data, specifically: given lidar point cloudN _in Representing the number of input point clouds C _in Representing a feature dimension of the input point cloud; projecting a 3D point cloud to a picture P by means of a sensor calibration matrix of a lidar and a camera _2d ＝Proj(P _3d )＝M×T _(l→c) P _3d Obtaining a mapping relation between picture pixels and point clouds; removing point clouds which are not on the picture according to the mapping relation, taking the point clouds falling onto the 2D frame as foreground points, taking the point clouds outside the 2D frame as background points, and obtaining rough pseudo point cloud tag data ++>

In the above scheme, the processing of the rough point cloud pseudo tag data by the pseudo tag generator obtains trained point cloud pseudo tag data, specifically: by applying laser radar point cloud dataClustering the depth distance dist of each row m, and obtaining a unique clustering label +.>Pseudo tag generator processes coarse pseudo point cloud tag data using RThen, clustering the point space distance again to obtain trained point cloud pseudo tag data +.>

In the above scheme, the multi-mode network is constructed through the box inst and the FSD segment, specifically: the multi-mode neural network is constructed through a box Inst neural network, an FSD segment neural network, a pseudo tag generator, a dynamic pseudo mask tag generation module, an anti-noise module and a multi-mode cross monitoring module.

In the above scheme, the step of sending the training data set to the multi-mode network for forward propagation, and obtaining a pseudo tag example segmentation result of the multi-mode data in forward propagation specifically includes: the method comprises the steps of obtaining a network prediction result of a box Inst through a dynamic pseudo mask label generation module, and obtaining the network prediction result through an FSD segment network of an Exponential Moving Average (EMA).

In the above scheme, the self-supervision loss function of the image network branches and the cross-supervision loss function of the multi-mode network are determined through the pseudo tag example segmentation result, specifically: determination of a boxlnst network self-supervising loss function L by pseudo tag instance segmentation results _pseudo (BCELoss and DiceLoss) and a multi-modal Cross-overseeing loss function L _CSCS (And)。

in the scheme, the BoxInst network loss function and the pseudo tag are monitored through the true tagAndsupervised FSD segment network loss function, self-supervising loss function and cross-supervising lossThe function weighted summation is taken as the final loss function, specifically: determining a boxlnst network loss function L by a real tag 2D box _boxinst (FocalLoss, GIoULoss, crossEntropyLoss, diceLoss and PairwiseLoss) determining FSD segment network loss function L by pseudo tag _FSD (FocalLoss and L1 Loss), self-supervising Loss function L _pseudo (BCELoss and DiceLoss), cross-overseeing loss function L _CSCS (/>And->) The method comprises the steps of carrying out a first treatment on the surface of the By weighted summing the loss functions as a final loss function.

In the above scheme, the parameter of the neural network is back-propagated through the final loss function, so as to obtain the trained multi-modal neural network, specifically: defining a maximum training period E _max Training batch size B, training sample total number D, and batch number of training per cycleOptimizing FSD segment network by AdamW optimizer, setting initial learning rate +.>And a single period cosine annealing learning rate strategy; optimizing a box Inst network through an SGD optimizer, and setting learning rate +.>Reading in batch training data { P, I }, inputting network forward propagation, calculating a loss function, and updating network parameters by gradient backward propagation; judging whether the total training step number step can be divided by the training batch number k in one period; if so, reading the training data from the beginning next time; if not, reading the training data { P, I } in turn next time; judging whether the total training step number step is equal to B _n X k, if true, ending training to obtain trained multimodeAnd if not, continuing training the multi-modal neural network.

In the above scheme, the example target of the picture and the point cloud is determined by performing example segmentation prediction on the prediction data set through the trained multi-mode neural network, specifically: obtaining an instance mask probability map and a confidence score of a box Inst network, sequentially screening instance mask probability maps with high quality through the confidence score, and generating a final instance mask through a mask probability threshold value to determine a picture instance target; and obtaining a semantic label predicted value and a shifted position of each point through the FSD segment network, and carrying out spatial distance clustering on all points with the same semantic predicted value to obtain a final point cloud instance target.

Compared with the prior art, the method combines the capabilities of two modes of pictures and point clouds, carries out cross supervision, and improves the example segmentation performance of the Box Inst and the FSD segment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

fig. 1 is a flowchart of an automatic driving picture and point cloud example segmentation method based on weak supervision according to an embodiment of the present invention.

Fig. 2 is a block diagram of a pseudo tag generator in an automatic driving picture and point cloud instance segmentation method based on weak supervision according to an embodiment of the present invention.

Fig. 3 is a block diagram of a multi-mode network in an automatic driving picture and point cloud instance segmentation method based on weak supervision according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an anti-noise module in an automatic driving picture and point cloud example segmentation method based on weak supervision according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a cross supervision module in an automatic driving picture and point cloud instance segmentation method based on weak supervision according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, article or apparatus that comprises the element.

The embodiment of the invention provides a method for segmenting an automatic driving multi-mode picture and a point cloud instance based on weak supervision, which is shown in fig. 1, and comprises the following steps:

step 101: processing the point cloud data through the 2D frame tag of the picture data to obtain rough point cloud pseudo tag data

Specifically, a given lidar point cloudN _in Representing the number of input point clouds C _in Representing the feature dimension of the input point cloud. Projecting a 3D point cloud to a picture P by means of a sensor calibration matrix of a lidar and a camera _2d ＝Proj(P _3d )＝M×T _(l→c) P _3d And obtaining the mapping relation between the picture pixels and the point cloud.

Removing point clouds which are not on the picture, taking the point clouds falling onto the 2D frame as foreground points, taking the point clouds outside the 2D frame as background points, and obtaining rough pseudo point cloud tag data

For example, let input a frame of point cloud coordinates beThe camera external parameter matrix is->The camera reference matrix is->Obtaining coordinates of the point cloud projected onto the pictureThe formula is described as (1) below. Removing point cloud which does not fall on the picture, and then judging the coordinate P _2d With 2D frame marks (x) _min ，y _min ，x _max ，y _max ) If x is the positional relationship of _min ≤u≤x _max And y is _min ≤v≤y _max Then the current point belongs to the foreground point (example point), otherwise belongs to the background point, and coarse pseudo tag point cloud data is obtained>

Step 102: processing the rough point cloud pseudo tag data through the pseudo tag generator to obtain trained point cloud pseudo tag data

Specifically, as shown in fig. 2, for a lidar having m beams of n measurements in one scan period, the return values of one scan form an mxn matrix.

By applying laser radar point cloud dataClustering the depth distance dist of each row m, and obtaining a unique clustering label +.>

Pseudo tag generator processes coarse pseudo point cloud tag data using cluster tags RThen, clustering the point space distance again to obtain trained point cloud pseudo tag data +.>

For example, the depth distance dist of the m-th row point cloud of the point cloud P is clustered, and the depth of the n-th row point is d _(m，n) The method comprises the steps of carrying out a first treatment on the surface of the The depth of the n-1 th point of the m row is d _(m，n-1) The clustering label is r; distance threshold d _threshold =0.24m. If |d _(m，n) -d _(m，n-1) |≤d _threshold Then the two points have the same cluster label r; if not, the current point p _(m，n) The clustering label is r+1; and traversing all the rows to obtain a final cluster label R.

Using clustered tag R pairsFiltering, namely taking a cluster label r, and obtaining foreground points (example points) under the current r by +.>And background dot->The quantitative ratio satisfies->Then these points P ^r Is in front ofScenic spots; otherwise, the background point is obtained.

For the filtered foreground point P _fg Performing CCL spatial clustering, and if Euclidean distance P between two point clouds _(3d，i) -P _(3d，j) I is less than the distance threshold d _ccl They are connected.

The largest component connected in this figure is considered an example foreground point, with the remaining points considered background points. Obtaining final point cloud pseudo tag data

In some embodiments, most of the current multimodal algorithms are implemented based on the Waymo dataset, so the present invention uses the Waymo dataset as the dataset for weak supervision multimodal instance segmentation; the Waymo dataset is a public dataset collected and completed by the autopilot under the Alphabet flag of google, waymo, covering a wide variety of environments, from dense urban centers to suburban landscapes, and data collected in daytime and nighttime, dawn and dusk, sunny and rainy days. The sensors used include five lidars, five RGB cameras, including lidar and camera shot data from 1000 segments (each data segment size 20 s), are widely used for image segmentation, object detection, object tracking, and other tasks.

The detection categories in the 2D frame initial tag include four categories: vehicles, pedestrians, riders, and logos, and have unique tracking IDs and labeling difficulty tags.

Because of the difference in labels between the Waymo point cloud segmentation and the panoramic segmentation dataset, the instance segmentation results are trained and evaluated only on the panoramic segmentation data, while the instance segmentation results are trained and evaluated on the point cloud segmentation data. Specifically, on the 2D instance segmentation task, the training set contains 61480 images, and the verification set contains 9405 images; for the 3D segmentation task, the dataset contains 23691 frames and 5976 frames for training and validation, respectively. The invention evaluates three categories, vehicle, pedestrian and rider.

Setting the batch size to 8, reading in training data { P, I }, in Waymo laser radar coordinate system, the range of the detection space is X-direction [ -80, 80) m, Y-direction [ -80, 80) m, Z-direction [ -2,4]And (5) rice. The feature dimension of the used point cloud data is C _in =7, including point cloud X-dimensional coordinates, Y-dimensional coordinates, Z-dimensional coordinates, signal reflection intensity, depth distance dist, mth beam, and nth measurement; each frame of point cloud corresponds to five camera picturesThe actual picture batch size is 40.

Step 103: constructing a multi-mode network through a box Inst and an FSD segment;

specifically, as shown in fig. 3, the multi-modal neural network is constructed by a box inst neural network, an FSD segment neural network, a pseudo tag generator, a dynamic pseudo mask tag generation module, an anti-noise module, and a multi-modal cross-monitoring module.

The picture and the point cloud neural network respectively use a box Inst and an FSD segment, and any network structure and parameters are not changed.

Constructing a pseudo tag generator, setting a deep cluster d as shown in FIG. 2 _threshold =0.24m and CCL cluster threshold d _ccl D of different class instances _ccl Different;

for example, vehicle class d _ccl =0.6m, pedestrian category d _ccl =0.1m, rider class d _ccl ＝0.15m。

A dynamic pseudo mask tag generation module is constructed to calculate IoU between the predicted 2D box and the true 2D box of BoxInst and then weight the matched instance mask predictions using IoU and the predicted box score. The formula is defined as follows:

wherein the method comprises the steps ofRepresenting a predictive probability map corresponding to the ith truth box, M _i，j Represents the jth instance corresponding to the ith reference truth box, k represents the weight of centrality, s _i，j And representing the confidence score of the prediction frame corresponding to the ith instance. After obtaining the weighted probability map, we set two thresholds τ _low And τ _high Get dynamic pseudo mask corresponding to the ith instance +.>

For example, one vehicle object O, the corresponding 2D frame true value is B, and BoxInst predicts N instance masks M _i，j Assigned to the vehicle object O, k=1 is set, and weights w of the N instance masks to the vehicle object O are calculated _i，j Then weighting and summing to obtain a predictive probability map of the vehicle object OThen take tau _low ＝0.3，τ _high ＝0.7，/>Greater than tau _high Is 1 and is less than tau _low Is 0, ignores the remainder, and gets a dynamic pseudo mask +.>

An anti-noise module is constructed, as shown in fig. 4, an EMA history prediction result matrix H is created, and the dimension size is N _frame ×N _his ×N _points When the training times reach the set EpochE _g When the current input point cloud queries the corresponding historical prediction result in H, and performs voting to obtain a voting result, and the voting result is used for labeling a pseudo label generatorModifying to obtain final anti-noise pseudo tag +.>

For example, set E _g ＝12，N _his =4, current training Epoch 13, for currently entered point cloud pseudo tag dataVoting the historical prediction results of the first 4 times (namely 9,10,11, 12), if the maximum number of times of category occurrence is not less than 3 times, the voting result is the category, otherwise is the ignored-1 category, and then using the voting result pair +.>Modifying to obtain final anti-noise pseudo tag +.>

Constructing a multi-mode cross supervision module, as shown in fig. 5, using a predictive probability map M obtained by a dynamic pseudo mask tag generation module ^ema An instance mask to oversee FSD segment predictions; an instance pseudo mask of the 3D EMA prediction is used to oversee an instance mask of the BoxInst prediction.

Step 104: the training data set is sent into a multi-mode network to carry out forward propagation, and pseudo tag instance segmentation result 2D instance mask of multi-mode data in forward propagation is obtained2D predictive probability map M ^ema And 3D instance mask->

Step 105: determining a self-supervision loss function of a picture network branch and a cross-supervision loss function of a multi-mode network through a pseudo-label example segmentation result, and taking weighted summation of a box Inst network loss function supervised by a real label, an FSD segment network loss function supervised by a pseudo label, the self-supervision loss function and the cross-supervision loss function as a final loss function;

specifically: determining boxlnst network loss through real tag 2D boxesFunction L _boxinst (FocalLoss, GIoULoss, crossEntropyLoss, diceLoss and PairwiseLoss) determining FSD segment network loss function L by pseudo tag _FSD (FocalLoss and L1 Loss), self-supervising Loss function L _pseudo (BCELoss and DiceLoss), cross-overseeing loss function L _CSCS (And->) The method comprises the steps of carrying out a first treatment on the surface of the By weighted summing said loss functions as a final loss function;

for example, an instance mask probability map for BoxInst network predictionDynamic pseudo-mask 2D instance maskDynamic 2D predictive probability map M ^ema An instance mask for 3D network prediction is +.>3D EMA prediction instance mask is

Defining a self-supervising loss function as:

the cross-overseeing loss function is defined as:

BoxInst network loss function L _boxinst And FSD segment network loss function L _FSD Remain unchanged;

the final loss function is weighted and summed to obtain:

L _total ＝L _2d +L _3d (7)

wherein alpha is ₁ ～α ₆ Set to 1.0, 0.5, 100.0, 1.0, 2.0 to balance the loss term.

Step 106: the parameters of the neural network are back propagated through the final loss function, and a trained multi-modal neural network is obtained;

specifically, define a maximum training period E _max Training batch size B, training sample total number D, and batch number of training per cycleOptimizing FSD segment network by AdamW optimizer, setting initial learning rate +.>And a single period cosine annealing learning rate strategy; optimizing a box Inst network through an SGD optimizer, and setting learning rate +.>Reading in a batch of training data P { P, I }, inputting the forward propagation of the network, countingCalculating a loss function, and updating network parameters by gradient back propagation; judging whether the total training step number step can be divided by the training batch number k in one period; if so, reading the training data from the beginning next time; if not, reading the training data { P, I } in turn next time; judging whether the total training step number step is equal to B _n And x k, if yes, ending training to obtain a trained multi-modal neural network, otherwise, continuing training the multi-modal neural network.

Step 107: and carrying out instance segmentation prediction on the prediction data set through the trained multi-mode neural network to determine instance targets of the picture and the point cloud.

Specifically, for a box Inst neural network, firstly, a predicted 2D frame is obtained, then a corresponding mask probability map is obtained, a probability threshold value is set to be 0.5, if the probability threshold value is larger than the threshold value, the probability threshold value is 1, otherwise, the probability threshold value is 0, and a corresponding 2D instance mask prediction result is obtained; and for the FSD segment, taking the maximum prediction probability of each point as the category of the current point, and then adding an offset distance to each point to perform CCL clustering to obtain a final 3D instance target.

According to the method, the 2D frame is introduced into the point cloud segmentation task, depth and geometric priori information of the point cloud are mined, the prior information is utilized to continuously refine labels of the point cloud in the 2D frame, and the labeling burden is reduced.

The dynamic pseudo mask label generation module and the anti-noise module respectively introduce the current prediction information and the historical prediction information to correct the label, so that the anti-noise capability of the neural network is improved.

The multi-mode cross supervision module provided by the invention realizes multi-mode independent training and joint optimization, and further improves the segmentation performance of each mode network.

The scheme finally carries out instance segmentation result test on the Waymo verification data set, wherein the AP indexes of the 2D instance segmentation to vehicles, pedestrians and riders are 49.72%, 30.18% and 33.32%, respectively; the 3D example segment has AP indices for vehicle, pedestrian, and rider of 62.09%,48.86%,36.27%, respectively; the IoU indexes of the 3D semantic segmentation on vehicles, pedestrians and riders are 89.94%,82.31% and 64.12%, respectively.

Experimental data:

determining a verification data set and an evaluation index;

the invention uses the verification set in the Waymo dataset as verification data, and adopts COCO official picture instance segmentation evaluation index Average Precision (AP) and standard point cloud semantic segmentation index IoU. The above-mentioned APs and IoU show detection accuracy of a certain class, and are used for evaluating the detection performance of a model on a single class, and the higher the values of the APs and IoU are, the higher the detection accuracy is, and the higher the practical value is.

For the AP index, it is defined as:

where P is accuracy, R is Recall, TP is the positive sample predicted to be positive, FP is the positive sample predicted to be negative, FN is the negative sample predicted to be negative, then the mathematical definitions of accuracy and Recall are as follows:

for each sample, the Precision and the Recall of the sample can be obtained, and the AP result can be calculated by plotting the results of a plurality of samples into a Precision-Recall curve after the results are arranged in descending order of confidence. In order to reduce the calculation amount, the interval [0,1] can be divided into L-1 parts by using L equally dividing points, the AP value is calculated in a similar infinitesimal way, and L=11 is selected to calculate the final AP index.

In particular, AP50 and AP75 in COCO refer to TP when IoU is greater than 0.5 or 0.75, respectively; AP means [0.50:0.05:0.95], i.e., ioU, the threshold was set to 0.5,0.55,0.60,0.65 … … 0.95.95, ten APs were calculated, and then averaged to obtain the AP.

For each class, the IoU index, which is defined as:

to illustrate the utility of the present invention, three comparative experiments were set up for 2D instance segmentation. Experimental group a was trained and tested using the condlnst algorithm; the experimental group B is trained and tested by using a BoxInst algorithm; experiment group C was trained and tested using the inventive algorithm described above. All experiments were performed on the same machine, training 24Epoch. The test subjects selected three categories, vehicle, pedestrian and rider, gave the following results, as shown in table 1:

TABLE 1 comparative test results

Compared with experimental groups B and C supervised by using a 2D Box, mAP of the experimental group C is improved by 3.13%, vehicle AP is improved by 1.24%, pedestrians are improved by 2.45%, and riders are improved by 5.63%; compared with a, the weakly supervised C is mAp to 96.57% of full supervision under the AP50 condition, the vehicle to 95.10%, the pedestrian to 115%.36 (9.07% improvement), and the rider to 106.18% (3.1% improvement).

Three comparative experiments were set up for 3D instance segmentation and semantic segmentation. The experimental group A is trained and detected by using a fully supervised FSD segment algorithm; the experimental group B is trained and detected by using an FSD segment algorithm supervised by a 3D frame; experiment group C was trained and tested using the inventive algorithm described above. Experiment group D was trained and tested on the entire Waymo training set using the inventive algorithm described above. All experiments were performed on the same machine, training 24Epoch. The test subjects selected three categories, vehicle, pedestrian and rider, gave the following results, as shown in table 2:

TABLE 2 comparative test results

Comparing the experimental group D with the experimental group B, the example segmentation mAP is improved by 0.78%, and the semantic segmentation mIoU is improved by 6.09%; the rider example segmentation and semantic segmentation were improved by 0.98% and 2.61% respectively in comparison to experimental group D and full supervision experimental group B. mAP and mIoU were raised by 3.31% and 2.49% respectively compared to experimental group D and experimental group C.

The above results indicate that table 1 experiment set C and table 2 experiment set C, D have comparable or higher performance with the more heavily loaded example mask and 3D box supervised algorithm at small targets (pedestrians and riders). The advantage of the invention under the supervision of weak tags (2D boxes) is demonstrated.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. An automatic driving multi-mode picture and point cloud instance segmentation method based on weak supervision is characterized by comprising the following steps:

constructing a multi-mode network through a box Inst and an FSD segment;

2. The method for dividing the point cloud instance and the multi-modal image based on the weak supervision automatic driving according to claim 1, wherein the processing of the point cloud data by the 2D frame tag of the image data obtains rough point cloud pseudo tag data, specifically: given lidar point cloudN _in Representing the number of input point clouds C _in Representing a feature dimension of the input point cloud; projecting a 3D point cloud to a picture P by means of a sensor calibration matrix of a lidar and a camera _2d ＝Proj(P _3d )＝M×T _(l→c) P _3d Obtaining a mapping relation between picture pixels and point clouds; removing point clouds which are not on the picture according to the mapping relation, taking the point clouds falling onto the 2D frame as foreground points, taking the point clouds outside the 2D frame as background points, and obtaining rough pseudo point cloud tag data ++>

3. The method for segmenting the point cloud instance and the automatic driving multi-mode picture based on weak supervision according to claim 1 or 2, wherein the rough point cloud pseudo tag data is processed through a pseudo tag generator to obtain trained point cloud pseudo tag data, specifically: by applying laser radar point cloud dataClustering the depth distance dist of each row m, and obtaining a unique clustering label +.>Pseudo tag generator using R to process coarseRough pseudo point cloud tag dataThen, clustering the point space distance again to obtain trained point cloud pseudo tag data +.>

4. The method for dividing the point cloud instance and the automatic driving multi-mode picture based on weak supervision according to claim 3, wherein the multi-mode network is constructed by a box inst and an FSD segment, specifically: the multi-mode neural network is constructed through a box Inst neural network, an FSD segment neural network, a pseudo tag generator, a dynamic pseudo mask tag generation module, an anti-noise module and a multi-mode cross monitoring module.

5. The method for segmenting the multi-modal image and the point cloud instance of the automatic driving based on the weak supervision according to claim 4, wherein the step of sending the training data set into a multi-modal network for forward propagation is performed, and a pseudo-label instance segmentation result of the multi-modal data in the forward propagation is obtained, specifically: the method comprises the steps of obtaining a network prediction result of a box Inst through a dynamic pseudo mask label generation module, and obtaining the network prediction result through an FSD segment network of an Exponential Moving Average (EMA).

6. The method for dividing the automatically driven multi-modal picture and point cloud instance based on the weak supervision according to claim 5, wherein the determination of the self-supervision loss function of the picture network branch and the cross-supervision loss function of the multi-modal network through the pseudo tag instance division result is specifically as follows: determination of a boxlnst network self-supervising loss function L by pseudo tag instance segmentation results _pseudo (BCELoss and DiceLoss) and multimodal Cross-overseeing loss function(/>And->)。

7. The weakly-supervised autopilot multimodal picture and point cloud instance segmentation method as claimed in claim 6, wherein said BoxInst network loss function, pseudo-tag through supervision of real tagsAnd->The weighted summation of the supervised FSD segment network loss function, the self-supervision loss function and the cross-supervision loss function is taken as a final loss function, and the method specifically comprises the following steps: determining a boxlnst network loss function L by a real tag 2D box _boxinst (FocalLoss, GIoULoss, crossEntropyLoss, diceLoss and PairwiseLoss) determining FSD segment network loss function L by pseudo tag _FSD (FocalLoss and L1 Loss), self-supervising Loss function L _pseudo (BCELoss and DiceLoss), cross-overseeing loss function L _CSCS (/>And->) The method comprises the steps of carrying out a first treatment on the surface of the By weighted summing the loss functions as a final loss function.

8. The method for partitioning an example of a point cloud and a multi-modal image based on weakly supervised autopilot as set forth in claim 7, wherein the parameters of the neural network are back-propagated by a final loss function to obtain a trained multi-modal neural network, withThe body is as follows: defining a maximum training period E _max Training batch size B, training sample total number D, and batch number of training per cycleOptimizing FSD segment network by AdamW optimizer, setting initial learning rateAnd a single period cosine annealing learning rate strategy; optimizing a box Inst network through an SGD optimizer, and setting learning rate +.>Reading in batch training data { P, I }, inputting network forward propagation, calculating a loss function, and updating network parameters by gradient backward propagation; judging whether the total training step number step can be divided by the training batch number k in one period; if so, reading the training data from the beginning next time; if not, reading the training data { P, I } in turn next time; judging whether the total training step number step is equal to B _n And x k, if yes, ending training to obtain a trained multi-modal neural network, otherwise, continuing training the multi-modal neural network.

9. The method for dividing the instance of the image and the point cloud of the automatic driving multi-mode based on the weak supervision according to claim 8, wherein the instance dividing and predicting the prediction data set through the trained multi-mode neural network is specifically used for determining the instance targets of the image and the point cloud, and the method is characterized in that: obtaining an instance mask probability map and a confidence score of a box Inst network, sequentially screening instance mask probability maps with high quality through the confidence score, and generating a final instance mask through a mask probability threshold value to determine a picture instance target; and obtaining a semantic label predicted value and a shifted position of each point through the FSD segment network, and carrying out spatial distance clustering on all points with the same semantic predicted value to obtain a final point cloud instance target.