[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115457257A - Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud - Google Patents

Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud Download PDF

Info

Publication number
CN115457257A
CN115457257A CN202211033470.9A CN202211033470A CN115457257A CN 115457257 A CN115457257 A CN 115457257A CN 202211033470 A CN202211033470 A CN 202211033470A CN 115457257 A CN115457257 A CN 115457257A
Authority
CN
China
Prior art keywords
point cloud
sampling
point
candidate
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211033470.9A
Other languages
Chinese (zh)
Inventor
王博思
孙棣华
赵敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202211033470.9A priority Critical patent/CN115457257A/en
Publication of CN115457257A publication Critical patent/CN115457257A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud, which comprises the following steps: step 1: acquiring point cloud data; step 2: determining a first sampling point from the point cloud data through down-sampling, and recording the first sampling point as a candidate point; and step 3: taking the candidate points as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector; and 4, step 4: judging whether the number of sampling points is not reduced any more, if not, executing the step 5, and if so, executing the step 7; and 5: determining a second sampling point from the candidate points through downsampling; and 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4; and 7: local features of each point cloud are determined.

Description

Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud
Technical Field
The invention belongs to the field of automatic driving, and particularly relates to a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud.
Background
With the improvement of the automation level of the automobile, the traditional method of acquiring single sensor data and combining multi-sensor data cannot meet the detection requirement, and a cross-modal data fusion solution comes up. The modality means a form of data. The visual information and the point cloud are data in two different modes, and although the data exist in different forms, the visual information and the point cloud are used for realizing the continuous improvement of the detection accuracy of the automatic driving target. Emerging cross-modal data fusion can not only combine sensor data of different modalities, but also combine 3D point cloud data of different modalities by relying on technologies such as a neural network and deep learning, and the like, so that innovative innovation is brought to the field of target detection.
In practical application, based on the difference of fusion levels of different modal sensors, the current automated driving cross-modal data fusion algorithm can be divided into three categories: data level fusion, feature level fusion, and decision level fusion. The data level fusion is a bottom fusion mode, can greatly enrich the data information of a detection target, and overcomes the defect of insufficient data information of a single mode by the advantage of mutual compensation and combination of data information of sensors of different modes. The feature level fusion firstly carries out feature extraction on sensor data of different modes respectively, then fuses the extracted features, and finally obtains a detection result according to the fused output. Many fusion methods based on deep learning are also implemented by cascading or weighting features extracted from different sensors by a neural network, such as AVOD, MV3D, roarNet, pointFusion, F-PointNet, and the like. The decision-level fusion algorithm performs feature extraction on the acquired information and outputs decision information, the decision information can be independently operated by a single sensor, then all the decision information is fused, and finally the fusion result is analyzed to make a final decision.
Aiming at the problems of point cloud disorder, no structure and the like, the PointNet creates a processing scheme for directly and deeply learning the point cloud to extract features. The whole network of PointNet is divided into a classification network and a division network. Although the PointNet network is simple and efficient, it is not difficult to find out from the network structure thereof that the PointNet maps each point from a low dimension to a high dimension by using MLP, and then extracts the features of all the mapped points together through the maximum value Chi Huajie, that is, the whole process always processes a single point or all the points. Therefore, it cannot extract fine local features of the object well, thereby affecting the performance of object segmentation.
Therefore, an accurate feature extraction method is required to extract local features of the point cloud.
Disclosure of Invention
In view of the above, the present invention provides a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud.
The purpose of the invention is realized by the following technical scheme:
the invention provides a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud, which comprises the following steps:
step 1: acquiring point cloud data of a viewing cone candidate area, wherein the number of the point cloud data is recorded as N;
step 2: determining N1 first sampling points from the point cloud data through a down-sampling method, and recording the first sampling points as candidate points;
and step 3: taking each candidate point as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector;
and 4, step 4: judging whether the number of sampling points is not reduced any more, if not, executing a step 5, and if so, executing a step 7;
and 5: determining N2 second sampling points from the candidate points by a down-sampling method;
step 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4;
and 7: local features of each point cloud in the point cloud data are determined.
Further, the down-sampling method is a farthest point sampling.
Further, the determining the K nearest points by adopting the KNN algorithm with the abstract feature constraint comprises acquiring the K points which are closest to the three-dimensional space distance and the feature space distance of the candidate points.
Further, acquiring the point cloud data of the viewing cone candidate region includes:
acquiring multi-modal data of a region to be detected, wherein the multi-modal data comprises RGB image data, binocular vision point cloud data and laser radar point cloud data;
extracting image characteristics by using an image detection network according to the RGB image data, and generating a two-dimensional candidate region by using a region candidate network;
generating the viewing cone candidate area by using the binocular vision point cloud data and the two-dimensional candidate area;
and acquiring point cloud data of the view cone candidate area from the binocular vision point cloud data and the radar point cloud data.
Further, the image detection network includes fast-CNN.
Further, the cross-modal vehicle detection method based on the multilayer local sampling and the three-dimensional view cone point cloud further comprises the following steps: fusing the image features and the local features of each point cloud to generate a three-dimensional detection frame; and carrying out vehicle detection on the area to be detected by utilizing the three-dimensional detection frame.
Further, the loss function for fusing the image feature and the local feature of each point cloud is:
Figure BDA0003818365870000031
wherein N is the number of input fusion point clouds,
Figure BDA0003818365870000032
representing real boxesThe offset between the angular position and the angular position of the prediction frame obtained after the ith input fused point cloud,
Figure BDA0003818365870000033
indicating the offset, L, between the predicted real frame and the anchor frame score Representing the loss fraction of the function, L stn Representing the spatial transform regularization loss.
The invention has the beneficial effects that:
according to the invention, the characteristic learning of the network on the point cloud local area is realized through multilayer local sampling and multilayer characteristic extraction, so that the extracted local characteristics of the point cloud are more accurate, the classification precision is improved, and the subsequent target detection is more accurate.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point clouds, according to an embodiment of the application;
FIG. 2 is a schematic diagram of a partial sampling shown in accordance with one embodiment of the present application;
FIG. 3 is a schematic diagram illustrating feature extraction according to one embodiment of the present application;
FIG. 4 is a schematic diagram of a multi-layer local sampling shown in accordance with one embodiment of the present application;
FIG. 5 is a schematic diagram illustrating multi-level feature learning according to one embodiment of the present application;
FIG. 6 is a schematic diagram of local region feature extraction according to one embodiment of the present application;
FIG. 7 is a diagram illustrating a sample selection within a restricted local area according to one embodiment of the present application;
FIG. 8 is a schematic diagram of a cross-modal data feature level fusion network architecture according to an embodiment of the present application;
fig. 9 is a schematic view of a viewing cone candidate region according to an embodiment of the present application.
Detailed Description
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be understood that the preferred embodiments are illustrative of the invention only and are not limiting upon the scope of the invention.
The application provides a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud. According to the method, the characteristics of each point cloud of the vehicle to be detected are extracted by sampling multiple layers of local sampling and multi-level characteristic learning, so that the point cloud characteristics are extracted more accurately, and the classification accuracy can be improved.
Fig. 8 is a schematic diagram of a cross-modal data feature level fusion network architecture according to an embodiment of the present application. As shown in fig. 8, the input of the cross-modal feature level fusion network proposed by the present application has three different modal data, which are: RGB images, binocular vision point clouds, and lidar point clouds. The binocular vision point cloud and the laser radar point cloud are in point cloud forms, but come from different sensors. The binocular vision point cloud is generated by a binocular camera, and the lidar point cloud is generated by a lidar, so that the two have different modes. The RGB image and the binocular vision point cloud come from the same sensor.
In one embodiment, a cross-modal data feature level fusion network comprises three modules: the device comprises an image feature extraction module, a viewing cone region candidate module and a fusion bounding box regression module.
The image feature extraction module can use the Faster-RCNN as an image detection network to extract features of the image, and a region candidate network (RPN) is used for generating a two-dimensional candidate region to prepare for subsequent viewing cone region candidate extraction. Most 3D sensors, particularly real-time depth sensors such as LiDAR (LiDAR), produce data at a resolution much lower than the RGB images acquired by the camera. Therefore, a two-dimensional target area of an image can be extracted with reference to a widely-used two-dimensional target detection network, and the target can be classified.
After the two-dimensional candidate region is obtained, in order to make the subsequent three-dimensional detection result of the target more accurate, the two-dimensional candidate region is first promoted to the three-dimensional view cone candidate region by using the depth information, as shown in fig. 9. The cone region candidate module may generate a cone candidate region using a two-dimensional candidate region generated by the RPN in combination with the binocular vision point cloud. The cone candidate region has the function of reducing the detection range in the three-dimensional space, so that the detection result is more accurate. And then extracting the point cloud characteristics of the vehicle in the view cone candidate area by a multi-stage characteristic extraction method. The multi-stage feature extraction method can be summarized as follows: firstly, dividing an input point cloud into a plurality of local small areas with overlapped parts, then extracting more accurate features from each local small area similar to CNN, then aggregating each local feature to obtain features of higher layers, and finally repeating the above processes until the features of all points in the point cloud are obtained.
The fused bounding box regression module may fuse the extracted image features and point cloud features (e.g., local features of the point cloud) and generate an accurate three-dimensional detection box from coordinate transformations of the view cone point cloud. And then, the three-dimensional detection frame can be utilized to detect the vehicle in the area to be detected.
When vehicle detection is carried out, multi-modal data of a region to be detected can be obtained firstly, wherein the multi-modal data comprises RGB image data, binocular vision point cloud data and laser radar point cloud data; then, according to the RGB image data, extracting image features by using an image detection network, and generating a two-dimensional candidate area by using an area candidate network; generating the view cone candidate area by using the binocular vision point cloud data and the two-dimensional candidate area; and finally, acquiring point cloud data of the view cone candidate area from the binocular vision point cloud data and the radar point cloud data. After the point cloud data of the view frustum candidate region is determined, a multi-stage extraction network may be employed to extract features of the point cloud in the view frustum candidate region.
Fig. 1 is a flowchart of a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud according to an embodiment of the present application.
Step 1: and acquiring point cloud data of the view cone candidate area in the area to be detected, wherein the number of the point cloud data is recorded as N.
Step 2: and determining N1 first sampling points from the point cloud data by a down-sampling method, and recording the first sampling points as candidate points. In some embodiments, the downsampling method may include a farthest point sampling.
And step 3: and taking each candidate point as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector.
And 4, step 4: and (5) judging whether the number of the sampling points is not reduced any more, if not, executing the step 5, and if so, executing the step 7.
And 5: determining N2 second sample points from the candidate points by a down-sampling method. In some embodiments, N2 may be equal to N1.
Step 6: and taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4.
And 7: local features of each point cloud in the point cloud data are determined.
In step 2, the input Point set may be downsampled by iteratively using a Farthest Point Sampling (FPS). In the sampling of the farthest point, the selected local area point set is defined as K points which are nearest to the center o in the euclidean distance in the spherical area with the center o as the spherical center and r as the radius, as shown in fig. 2. After the farthest point has sampled the selected local area, feature extraction is performed using PointNet in its new coordinate system, as shown in fig. 3.
In step 3, after the farthest point sampling and feature extraction are performed on a single local area, a new sampling point can be obtained, which not only has the position information of the original point in the global point cloud, but also has a vector feature. Thus, the sampling point has all the geometric and characteristic information corresponding to K points around the local area. And then sampling and feature extracting are performed on all the divided local regions, and finally a new group of sampling point sets can be obtained, as shown in fig. 4. In order to further enhance the multistage feature extraction capability of the network, the whole multilayer local sampling is repeated for multiple times to realize multistage feature learning, as shown in fig. 5. However, in selecting a local region, if only the euclidean distance between a point and a point is considered, the case of the left diagram in fig. 6 easily occurs. In fig. 6, the point marked with "a" is a type a point, and the point marked with "b" is a type b point. For the class a point, in an ideal situation, if there are many similar points with similar distances around the class a point, since the sampling selection principle is K points with the closest distance, the network will finally extract the features of the class a point, and further distinguish the result of target segmentation and detection as the class a, as shown in the right diagram in fig. 6. However, in practical situations, there may also be a large number of b-class points with close distances around the a-class point, and the number of b-class points is greater than the a-class point, and the b-class point is erroneously determined as the b-class point according to the selection principle, as shown in the left diagram of fig. 6.
In practice, if points are more similar to each other, the abstract features extracted by multi-layer local sampling are closer, that is, the distance between feature vectors corresponding to similar points is smaller. In the three-dimensional coordinate space shown in fig. 7 (a) and the feature space shown in fig. 7 (b), the points marked with "b" are class b points, and the remaining unmarked points are class a points. As shown in fig. 7 (a), in a fixed three-dimensional space, the coordinates of the point cloud are represented as (x, y, z), and for the class a point of the origin of coordinates, the closest euclidean distance around the point is the class b point, in which case the feature extraction is affected. However, in the n-dimensional feature space shown in fig. 7 (b), the more similar the features between the points are, the more easily the points are grouped together, and the closer the abstract feature vectors are. Therefore, the abstract feature vector is also used as a constraint condition for selecting the sampling points, so that guidance can be better provided for the network, namely, the features of the points which are closer to the three-dimensional space distance and the feature space distance of the central point are learned, the feature extraction cannot be interfered by irrelevant points, and the robustness of the network is stronger. And 3, determining K nearest points by adopting a KNN algorithm with abstract feature constraint, namely determining the nearest points from the central point, wherein the Euclidean distance from the central point is considered, and the distance between abstract feature vectors of the points is also considered. Therefore, the accuracy of target detection can be higher by adopting the KNN algorithm with abstract feature constraints.
On the basis of three-dimensional instance segmentation, three-dimensional space target positioning is realized based on residual errors. Three-dimensional target localization is not a regression of the absolute three-dimensional position of the target object, and its deviation from the sensor position may vary over a wide range. For example, the deviation in the KITTI data set ranges from 5 meters to possibly over 50 meters. Aiming at the problem, the method for predicting the point cloud center of the target object in the mask coordinate system specifically comprises the steps of extracting the point cloud with the category of the interested target in the view cone after three-dimensional instance segmentation, and further carrying out standard processing on the coordinate data of the point cloud so as to improve the integral translation invariance of the algorithm.
The method and the device take input three-dimensional point cloud data as anchor points which are densely arranged in a view cone area, and fuse global features (such as global features of images and point cloud global features of binocular cameras) of vehicle targets with all single point cloud features extracted by a network. After the single point cloud features are fused, the point cloud features are set into a three-dimensional anchor frame with the category and the scale consistent with those of the corresponding vehicle, and then the three-dimensional anchor frame is classified and corrected by using new features generated after fusion. The method comprises the following steps of outputting new features of each fused point cloud (namely new features obtained after the global point cloud features of a binocular camera and the local point cloud features of a laser radar are fused) by utilizing a point cloud multilevel feature extraction network, wherein a loss function in the whole process is as follows:
Figure BDA0003818365870000061
in the above formula, N is the number of input fusion points (i.e. features after fusion),
Figure BDA0003818365870000062
representing the offset between the true frame angular amount position and the predicted frame angular amount position obtained after the ith input fusion point,
Figure BDA0003818365870000063
indicating the offset, L, between the predicted real frame and the anchor frame score Representing the loss fraction of the function, L stn Representing the spatial transform regularization loss. The coordinate regression function of the fusion network is:
Figure BDA0003818365870000071
wherein x represents the difference between the prediction frame and the real frame, and the coordinate quantity participating in regression represents seven numerical values with different meanings in the three-dimensional boundary frame, which are respectively a three-dimensional coordinate value of the vehicle target, three values of the length, the width, the height and the orientation angle of the boundary frame.
Table 1 shows the results of 3D AP comparisons for the KITTI validation set vehicle (Car) class using different feature extraction methods. Wherein, "v0" represents a detection algorithm using PointNet as a point cloud feature extraction network, and "v1" represents a detection algorithm using the multi-stage feature extraction point cloud segmentation method provided by the application as the point cloud feature extraction network.
Figure BDA0003818365870000072
TABLE 1
As can be seen from table 1, the accuracy of the multi-stage feature extraction point cloud segmentation method provided by the present application as a detection method of a point cloud feature extraction network is improved.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud is characterized by comprising the following steps:
step 1: acquiring point cloud data of view cone candidate areas in areas to be detected, wherein the number of the point cloud data is recorded as N;
step 2: determining N1 first sampling points from the point cloud data through a down-sampling method, and recording the first sampling points as candidate points;
and step 3: taking each candidate point as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector;
and 4, step 4: judging whether the number of sampling points is not reduced any more, if not, executing a step 5, and if so, executing a step 7;
and 5: determining N2 second sampling points from the candidate points by a down-sampling method;
and 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4;
and 7: local features of each point cloud in the point cloud data are determined.
2. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 1, wherein the down-sampling method is a farthest point sampling.
3. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 1, wherein the determining K nearest neighbor points using KNN algorithm with abstract feature constraints comprises obtaining K points closest to both three-dimensional space distance and feature space distance of the candidate points.
4. The cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud of claim 1, wherein obtaining point cloud data of a view cone candidate region comprises:
acquiring multi-modal data of a region to be detected, wherein the multi-modal data comprises RGB image data, binocular vision point cloud data and laser radar point cloud data;
extracting image characteristics by using an image detection network according to the RGB image data, and generating a two-dimensional candidate region by using a region candidate network;
generating the viewing cone candidate area by using the binocular vision point cloud data and the two-dimensional candidate area;
and acquiring point cloud data of the view cone candidate area from the binocular vision point cloud data and the radar point cloud data.
5. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 4, wherein the image detection network comprises fast-CNN.
6. The cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view point cloud of claim 4, further comprising:
fusing the image features and the local features of each point cloud to generate a three-dimensional detection frame;
and carrying out vehicle detection on the area to be detected by utilizing the three-dimensional detection frame.
7. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point clouds of claim 6, wherein the loss function for fusing the image features and the local features of each point cloud is:
Figure FDA0003818365860000021
wherein N is the number of input fusion point clouds,
Figure FDA0003818365860000022
representing the offset between the actual frame angular position and the predicted frame angular position obtained after the ith input fused point cloud,
Figure FDA0003818365860000023
indicating the offset, L, between the predicted real frame and the anchor frame score Represents the loss fraction of the function, L stn Representing the spatial transform regularization loss.
CN202211033470.9A 2022-08-26 2022-08-26 Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud Pending CN115457257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211033470.9A CN115457257A (en) 2022-08-26 2022-08-26 Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211033470.9A CN115457257A (en) 2022-08-26 2022-08-26 Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud

Publications (1)

Publication Number Publication Date
CN115457257A true CN115457257A (en) 2022-12-09

Family

ID=84300491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211033470.9A Pending CN115457257A (en) 2022-08-26 2022-08-26 Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud

Country Status (1)

Country Link
CN (1) CN115457257A (en)

Similar Documents

Publication Publication Date Title
CN111201451B (en) Method and device for detecting object in scene based on laser data and radar data of scene
CN112233097B (en) Road scene other vehicle detection system and method based on space-time domain multi-dimensional fusion
CN109829398B (en) Target detection method in video based on three-dimensional convolution network
JP4429298B2 (en) Object number detection device and object number detection method
CN110766758B (en) Calibration method, device, system and storage device
JP7091686B2 (en) 3D object recognition device, image pickup device and vehicle
CN114898314B (en) Method, device, equipment and storage medium for detecting target of driving scene
Farag A lightweight vehicle detection and tracking technique for advanced driving assistance systems
CN114495064A (en) Monocular depth estimation-based vehicle surrounding obstacle early warning method
CN114089329A (en) Target detection method based on fusion of long and short focus cameras and millimeter wave radar
CN114972968A (en) Tray identification and pose estimation method based on multiple neural networks
El Bouazzaoui et al. Enhancing RGB-D SLAM performances considering sensor specifications for indoor localization
Liu et al. A lightweight lidar-camera sensing method of obstacles detection and classification for autonomous rail rapid transit
CN112712066B (en) Image recognition method and device, computer equipment and storage medium
JP4918615B2 (en) Object number detection device and object number detection method
Dai et al. Enhanced Object Detection in Autonomous Vehicles through LiDAR—Camera Sensor Fusion.
WO2024015891A1 (en) Image and depth sensor fusion methods and systems
CN115457257A (en) Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud
CN117496401A (en) Full-automatic identification and tracking method for oval target points of video measurement image sequences
CN111460854A (en) Remote target detection method, device and system
CN112766100A (en) 3D target detection method based on key points
Somawirata et al. Road and Obstacle Detection for Autonomous Electrical Vehicle Robot
El-Dalahmeh et al. Enhanced Vehicle Detection through Multi-Sensor Fusion Utilizing YOLO-NAS and Faster R-CNN
WO2024044887A1 (en) Vision-based perception system
CN111815667B (en) Method for detecting moving target with high precision under camera moving condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination