CN110674829B

CN110674829B - A 3D Object Detection Method Based on Graph Convolutional Attention Network

Info

Publication number: CN110674829B
Application number: CN201910918980.6A
Authority: CN
Inventors: 夏桂华; 何芸倩; 苏丽; 朱齐丹; 张智
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2023-06-02
Anticipated expiration: 2039-09-26
Also published as: CN110674829A

Abstract

The invention provides a three-dimensional target detection method based on a graph convolution attention network. (1) voxelized partitioning and random downsampling are carried out on point clouds; (2) performing local feature extraction in each grid voxel; (3) extracting a high-order feature map by middle layer convolution; (4) The area suggests the frame, category, and direction of the network predicted target. In order to enhance the connection relation between each point and adjacent points, the invention provides a feature extraction module which is based on an edge convolution form and introduces an attention mechanism, and meanwhile, a attention mechanism module with the same principle is introduced after an intermediate convolution layer, so that features of each channel of the feature map are reselected, and a more reasonable high-order feature map is obtained. The invention improves the target detection accuracy of the point cloud, and has good performance especially under the condition of serious shielding.

Description

A 3D object detection method based on graph convolutional attention network

技术领域Technical Field

本发明涉及的是一种计算机视觉三维点云处理方法，具体地说是一种三维目标检测方法。The present invention relates to a computer vision three-dimensional point cloud processing method, specifically a three-dimensional target detection method.

背景技术Background Art

目标检测是一种传统的可视化任务，可以同时识别和定位目标，这是实现智能场景的先决条件。如今二维检测已经达到了前所未有的繁荣，但是在地图绘制、室内机器人和增强现实等领域，三维检测明显优于二维。它可以提供更多的位置姿态信息，同时也是自动驾驶环境感知的基本任务之一。RGB图像曾经是目标检测任务的主流数据形式，但随着3D传感器的发展，激光雷达近年来已成为一种越来越流行的检测工具。Object detection is a traditional visualization task that can simultaneously identify and locate objects, which is a prerequisite for realizing intelligent scenes. Today, two-dimensional detection has reached an unprecedented prosperity, but in areas such as mapping, indoor robots, and augmented reality, three-dimensional detection is significantly better than two-dimensional. It can provide more position and posture information, and is also one of the basic tasks of autonomous driving environment perception. RGB images used to be the mainstream data form for object detection tasks, but with the development of 3D sensors, lidar has become an increasingly popular detection tool in recent years.

现在，一些基于激光雷达和相机的方法融合了点云数据和图像数据一同获得更高的准确度。但融合方法也面临着计算成本过大的问题，所以单一传感器方法仍具有竞争力。许多研究表明，点云是描述物体形状更适当数据形式。点云可以更好地表示欧式距离并没有多尺度问题。然而，点云是一种稀疏数据，这使二维方法很难直接应用。Now, some methods based on LiDAR and cameras fuse point cloud data and image data together to achieve higher accuracy. But the fusion method also faces the problem of excessive computational cost, so the single sensor method is still competitive. Many studies have shown that point cloud is a more appropriate data form to describe the shape of objects. Point cloud can better represent Euclidean distance and does not have multi-scale problems. However, point cloud is a sparse data, which makes it difficult to directly apply two-dimensional methods.

在提取特征时，大部分方法使用逐点处理点的方式，并使用对称函数来提取全局特征，这种思路忽视了点与点之间的连接和关系。而与图片数据相比，点云是一种天然的易于构建链接的图结构。有一些研究利用了图网络的思想，考虑了相邻点和边之间的关系有助于增强局部特征的表达，提出了边卷积的方法。在三维卷积时，考虑到在定义的体素范围内，由于点的稀疏性，很多体素为空，使用稀疏卷积的方式，可以在不影响卷积效果的同时提升计算速度并且减小显存损耗。When extracting features, most methods process points one by one and use symmetric functions to extract global features. This approach ignores the connection and relationship between points. Compared with image data, point cloud is a natural graph structure that is easy to build links. Some studies have used the idea of graph networks, considering that the relationship between adjacent points and edges helps to enhance the expression of local features, and proposed edge convolution methods. In three-dimensional convolution, considering that within the defined voxel range, many voxels are empty due to the sparsity of points, the use of sparse convolution can improve the calculation speed and reduce the loss of video memory without affecting the convolution effect.

发明内容Summary of the invention

本发明的目的在于提供一种能够提升点云目标检测准确率，在遮挡严重的情况下仍能有良好性能的基于图卷积注意网络的三维目标检测方法。The purpose of the present invention is to provide a three-dimensional target detection method based on a graph convolutional attention network, which can improve the accuracy of point cloud target detection and still have good performance under severe occlusion.

本发明的目的是这样实现的：The object of the present invention is achieved in that:

(1)对点云进行体素化划分与随机降采样；(1) Voxelize and randomly downsample the point cloud;

(2)在每个栅格体素中进行局部特征提取；(2) extracting local features in each grid voxel;

(3)中间层卷积提取高阶特征图；(3) The middle layer convolution extracts high-order feature maps;

(4)区域建议网络预测目标的标框、类别以及方向。(4) The region proposal network predicts the target’s bounding box, category, and orientation.

本发明还可以包括：The present invention may also include:

1.所述的对点云进行体素化划分与随机降采样具体包括：使用体素网格的结构对原始点云进行划分，舍弃规定范围外的离群点，将点云划分至栅格中，并在每个体素栅格中进行随机降采样，然后对每个栅格进行编号，并进行存储。1. The voxelization and random downsampling of the point cloud specifically include: using the voxel grid structure to divide the original point cloud, discarding outliers outside the specified range, dividing the point cloud into grids, and performing random downsampling in each voxel grid, and then numbering each grid and storing it.

所述的存储是使用哈希表存储。The storage is performed using a hash table.

2.所述的在每个栅格体素中进行局部特征提取具体包括：在每一个体素的栅格内，使用图注意网络模块对对应点进行特征提取。2. The local feature extraction in each grid voxel specifically includes: within each voxel grid, using a graph attention network module to extract features of corresponding points.

所述的使用图注意网络模块对对应点进行特征提取具体为：首先将每个点与周围相邻的点之间连边，形成一个以欧氏距离为判断标准的图结构，同时将每个点与这个点本身连一条边，提取每条边的两端点坐标等信息作为边的初始特征，然后对边进行卷积操作，最后经过对称函数的选择，处理得到体素级特征。The method of using the graph attention network module to extract features of corresponding points is as follows: first, each point is connected to the surrounding adjacent points to form a graph structure with Euclidean distance as the judgment standard, and at the same time, each point is connected to the point itself by an edge, and the coordinates of the two endpoints of each edge and other information are extracted as the initial features of the edge, and then a convolution operation is performed on the edge, and finally, after the selection of a symmetric function, voxel-level features are obtained.

在边卷积操作之前，使用注意机制对初始特征进行选择。Before the side convolution operation, an attention mechanism is used to select initial features.

3.所述的中间层卷积提取高阶特征图具体包括：使用稀疏卷积的方法，将特征图压缩为一个致密的结构，进行卷积后，再映射回原本稀疏的空间表示；在经过卷积抽象后，利用注意机制对不同通道进行权重的重新分配，得到一个与特征图相对应的注意力图，将注意力图叠加到卷积得到的高阶特征图上，得到最终的三维特征图。3. The intermediate layer convolution extracts the high-order feature map specifically including: using the sparse convolution method to compress the feature map into a dense structure, and then mapping it back to the original sparse spatial representation after convolution; after convolution abstraction, using the attention mechanism to redistribute the weights of different channels to obtain an attention map corresponding to the feature map, and superimposing the attention map on the high-order feature map obtained by convolution to obtain the final three-dimensional feature map.

4.所述的区域建议网络预测目标的标框、类别以及方向具体包括：将经过多层卷积的高阶特征图经特征提取后，利用三个分别的全连接层计算各个锚点所对应的边界框、类别、方向的预测值。4. The region proposal network predicts the target's box, category, and direction specifically by extracting features from the high-order feature map that has undergone multiple layers of convolution, and then using three separate fully connected layers to calculate the predicted values of the bounding box, category, and direction corresponding to each anchor point.

本发明的基于图卷积注意网络的三维目标检测方法的特点是加强点云局部关系表达与优化特征选择过程。其中本发明将能够表述相邻点间关系的边卷积方法用于目标检测的特征提取，在初始点的特征选择阶段，使用注意机制来选择对特征表达更为重要的初始物理特征，从而得到更优的提取特征。在中间层卷积的过程中，同样产生了多通道的特征数据，本发明利用注意机制的思想，优化卷积结果，强化了有主要影响力的通道比重，得到更有表示力的特征图。The three-dimensional target detection method based on graph convolutional attention network of the present invention is characterized by strengthening the expression of local relationships of point clouds and optimizing the feature selection process. The present invention uses the edge convolution method that can express the relationship between adjacent points for feature extraction of target detection. In the feature selection stage of the initial point, the attention mechanism is used to select the initial physical features that are more important for feature expression, so as to obtain better extracted features. In the process of convolution of the middle layer, multi-channel feature data is also generated. The present invention uses the idea of attention mechanism to optimize the convolution results, strengthen the proportion of channels with major influence, and obtain a more expressive feature map.

通常的一副场景的点云数据包含超过100k个点，因此考虑使用特定的数据结构对点云进行预处理，即体素化。首先将原始点划分为体素并首先提取点状特征，然后下采样的体素信号进入卷积和区域建议以获得三维边界框。The point cloud data of a typical scene contains more than 100k points, so we consider using a specific data structure to preprocess the point cloud, namely voxelization. First, the original points are divided into voxels and the point features are extracted first, and then the downsampled voxel signals enter the convolution and region proposal to obtain the 3D bounding box.

本发明考虑到在特征提取过程中加强底层原始点之间的关系表示，在特征提取时利用了图网络的思想，同时，为了更好的加强特征表达，考虑了一种模仿人类认知敏锐度的注意机制，从而使特征的多通道选择更为智能化。本发明将注意机制分别应用在图网络边卷积初始特征选择之前与稀疏卷积特征图处理之后，在提升神经网络模块表述力的同时，令每个阶段的特征表达更具解释性。The present invention considers strengthening the relationship representation between the underlying original points in the feature extraction process, and utilizes the idea of graph network in feature extraction. At the same time, in order to better strengthen the feature expression, an attention mechanism that imitates human cognitive acuity is considered, so that the multi-channel selection of features is more intelligent. The present invention applies the attention mechanism before the initial feature selection of the graph network edge convolution and after the sparse convolution feature map processing, which improves the expression power of the neural network module and makes the feature expression of each stage more interpretable.

本发明具有以下优点：The present invention has the following advantages:

1.本发明在每个体素的特征表示过程中使用了注意机制的图卷积方法，能够更好的描述点云的每个点之间的关系，提取更有表述力的特征。1. The present invention uses a graph convolution method with an attention mechanism in the feature representation process of each voxel, which can better describe the relationship between each point in the point cloud and extract more expressive features.

2.本发明在经过中间层卷积后，对得到的高阶特征图利用注意机制进行权重的重新分配，得了更合理的高阶特征图。2. After the intermediate layer convolution, the present invention uses the attention mechanism to redistribute the weights of the obtained high-order feature map, thereby obtaining a more reasonable high-order feature map.

3.以上两个改进共同作用，本发明能够提升三维目标检测在车辆的检测中的准确率。3. The above two improvements work together, and the present invention can improve the accuracy of three-dimensional target detection in vehicle detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1：基于图网络注意结构的特征提取模块，其中e表示边，x表示点，i与j表示点的编号；Figure 1: Feature extraction module based on graph network attention structure, where e represents edge, x represents point, i and j represent point numbers;

图2：体素特征提取；Figure 2: Voxel feature extraction;

图3：引入注意机制的中间层稀疏卷积；Figure 3: Sparse convolution in the middle layer with attention mechanism;

图4：总体流程。Figure 4: Overall process.

具体实施方式DETAILED DESCRIPTION

下面举例对本发明做更详细的描述。The present invention is described in more detail with reference to the following examples.

步骤一：点云的体素化划分聚类Step 1: Voxelization and clustering of point clouds

使用体素化的方式对原始超过100k个点的点云数据进行结构化和降采样，首先裁去一定范围以外的点，仅保留范围在x，y，z轴下D，H，W之内的点。由于一副点云的点数量过大，在提取范围之内，利用尺寸为v_d，v_h，v_w的小体素网格对整体点云进行划分。The original point cloud data with more than 100k points is structured and downsampled by voxelization. First, points outside a certain range are cut off, and only points within the range of D, H, and W under the x, y, and z axes are retained. Since the number of points in a point cloud is too large, within the extraction range, the entire point cloud is divided using small voxel grids of size v _d , v _h , and v _w .

为了解决点在每个体素中分布不均的问题，本实施方式使用了随即降采样的方法，使每个体素中的点不超过T个。最后将处理好的体素结构编号，并使用哈希表的方式存储，从而消除内部点为空的体素。In order to solve the problem of uneven distribution of points in each voxel, this embodiment uses a random downsampling method to ensure that the number of points in each voxel does not exceed T. Finally, the processed voxel structures are numbered and stored in a hash table to eliminate voxels with empty internal points.

步骤二：体素中的点云特征提取Step 2: Point cloud feature extraction in voxels

在将原始点云体素化后，为了得到体素级特征，本实施方式对每一个体素使用图注意网络模块进行特征提取。After voxelizing the original point cloud, in order to obtain voxel-level features, this embodiment uses a graph attention network module to extract features for each voxel.

点云是一个天然的图结构，在对点云的特征提取中，常规将每个点单独考虑而忽视了点与点之间的联系，定义

为一个图，其中包括了n个点组成的点集

以及点之间的边集

例如，本发明定义一个d维的临近图，对于每个点x_i，在

中包含了(i，j_i1)，...，(i，j_ik)形式的边集，其中i与j都为点的编号，于是定义边特征为

其中h_θ与下列公式中的H为对称函数。Point cloud is a natural graph structure. In the feature extraction of point cloud, each point is usually considered separately and the connection between points is ignored.

is a graph consisting of a set of n points

and the set of edges between the points

For example, the present invention defines a d-dimensional proximity graph, where for each point x _i ,

contains an edge set of the form (i, j _i1 ), ..., (i, j _ik ), where i and j are both point numbers, so the edge feature is defined as

Where h _θ and H in the following formula are symmetric functions.

通常来说，点云具有三个维度来表示其在真实世界的坐标，在本实施方式中，描述两点之间的边时，结合了中心点x_i和与其用h操作连接的点

的信息作为初始的特征选择。此时，边特征的每一个通道对于总特征表述的贡献是不同的，于是，添加了一种注意机制方法。在边卷积的多层感知操作之后，使用一种对称操作H对边级的特征进行提取，得到对应的点级的特征。随后，通过将点级特征X＝{x′₁，...，x′_n}进行另一个对称操作提取得到最终的体素级特征。Generally speaking, a point cloud has three dimensions to represent its coordinates in the real world. In this embodiment, when describing the edge between two points, the center point _xi and the points connected to it by the h operation are combined.

The information is used as the initial feature selection. At this time, each channel of the edge feature contributes differently to the overall feature representation, so an attention mechanism is added. After the multi-layer perception operation of the edge convolution, a symmetry operation H is used to extract the edge-level features to obtain the corresponding point-level features. Subsequently, the final voxel-level features are extracted by performing another symmetry operation on the point-level features X = {x′ ₁ , ..., x′ _n }.

步骤三：中间层稀疏卷积Step 3: Sparse convolution in the middle layer

本实施方式使用三维稀疏卷积运算作为卷积中间层。假设ConvMD(c_in，c_out，k，s，p)是一个卷积运算符，其中c_in和c_out是输入的数量和输出通道，k，s，p分别对应于内核大小，步幅大小和填充大小。每个卷积运算包含3D卷积，BatchNormal层和Relu层。最后，在将稀疏映射转换为密集映射后，得到了一个高级特征映射，并在此添加了一个注意模块。This implementation uses 3D sparse convolution operations as convolution intermediate layers. Assume that ConvMD(c _in , c _out , k, s, p) is a convolution operator, where c _in and c _out are the number of input and output channels, and k, s, p correspond to the kernel size, stride size, and padding size, respectively. Each convolution operation contains 3D convolution, BatchNormal layer, and Relu layer. Finally, after converting the sparse map to a dense map, a high-level feature map is obtained, and an attention module is added to it.

卷积操作期间有许多不同的比例特征图。很明显，每个维度的特征对贡整个特征的贡献都具有不同的重要性。为了改进特征图的描述，令其更为合理，本发明将注意图添加到原始特征图中。There are many different scale feature maps during the convolution operation. Obviously, the contribution of each dimension of feature to the overall feature has different importance. In order to improve the description of the feature map and make it more reasonable, the present invention adds the attention map to the original feature map.

本实施方式使用SE注意模块用于生成注意特征图。首先，让致密的特征映射输入为

其中H为特征图高度，W为特征图宽度，C为通道数。然后使用avg-pool ing操作来提取每个通道从而得到一个提取特征，因此获得统计得到通道权重

然后使用多层感知来获得每个维度的一些高级特征，于是最终的注意图为s_c＝F_e(z_c，W)，其中F_e为提取函数。This implementation uses the SE attention module to generate attention feature maps. First, let the dense feature map input be

Where H is the feature map height, W is the feature map width, and C is the number of channels. Then use the avg-pooling operation to extract each channel to obtain an extracted feature, so as to obtain the channel weight statistically

Then, multi-layer perception is used to obtain some high-level features of each dimension, so the final attention map is _sc = _Fe ( _zc , W), where _Fe is the extraction function.

在缩放函数F_scale后，将注意特征图添加到原始图中以获取最终输出的综合特征图

After _scaling function F, the attention feature map is added to the original map to obtain the comprehensive feature map of the final output

此注意力机制操作添加在中间层之后，可以将高级信息聚合到最终的中间层特征图中，从而为后续的区域建议提供更多信息。This attention mechanism operation is added after the intermediate layer, which can aggregate high-level information into the final intermediate layer feature map, thereby providing more information for subsequent region proposals.

步骤四：区域建议网络Step 4: Region Proposal Network

区域建议网络(RPN)已经成为许多检测框架中的典型嵌入模块。在本实施方式中，使用类似SSD的端到端的形式作为区域建议架构。区域建议层的输入是中间层提取的特征图，一个区域建议层包含卷积层，BatchNormal层和Relu层。在每个单独的RPN层之后，将特征贴图上采样到相同的固定大小，并将这些图连接在一起。最后，使用三个1×1卷积来生成边界框，类和方向的预测值。Region Proposal Network (RPN) has become a typical embedding module in many detection frameworks. In this implementation, an end-to-end form similar to SSD is used as the region proposal architecture. The input of the region proposal layer is the feature map extracted by the intermediate layer. A region proposal layer contains a convolutional layer, a BatchNormal layer, and a Relu layer. After each individual RPN layer, the feature map is upsampled to the same fixed size and the maps are concatenated together. Finally, three 1×1 convolutions are used to generate predictions for the bounding box, class, and direction.

Claims

1. A three-dimensional target detection method based on a graph convolution attention network is characterized by comprising the following steps of:

step one: voxel division clustering of point cloud

Structuring and downsampling original point cloud data in a voxelized mode, discarding outliers outside a specified range, dividing the point cloud into grids, randomly downsampling in each voxel grid, numbering each grid, and storing; using a method of random downsampling, so that the number of points in each voxel is not more than T; finally numbering the processed voxel structure, and storing the voxel structure in a hash table mode, so that voxels with empty internal points are eliminated;

step two: point cloud feature extraction in voxels

After voxelization of the original point cloud, extracting features of each voxel by using a graph annotation network module in order to obtain voxel level features;

the point cloud is a natural graph structure, and each point is conventionally considered independently and negligibly in the feature extraction of the point cloudDefinition of Point-to-Point relationship

Is a graph comprising a set of n points

And edge set between points ++>

The point cloud has three dimensions to represent its real world coordinates, and after the multi-layer perceptive operation of edge convolution, a symmetry operation H is used to extract edge-level features, which are obtained by extracting the corresponding point-level features by using the point-level features x= { X' ₁ ，...，x′ _n Performing another symmetrical operation extraction to obtain final voxel level characteristics;

step three: middle layer sparse convolution

Using three-dimensional sparse convolution operation as a convolution intermediate layer, assume ConvMD (c _in ，c _out K, s, p) is a convolution operator, where c _in And c _out The number of inputs and the output channels, k, s, p correspond to kernel size, stride size, and fill size, respectively; each convolution operation includes a 3D convolution, a batch normal layer and a Relu layer; after the sparse mapping is converted into dense mapping, an advanced feature mapping is obtained, and an attention module is added;

using SE attention module for generating attention feature map, first, let dense feature map input as

Wherein H is the height of the feature map, W is the width of the feature map, and C is the number of channels; each channel is then extracted using an avg-mapping operation to obtain an extracted feature, thus obtaining statistically derived channel weights

Multilayer perceptions are then used to obtain some advanced features for each dimension, with the final attention being given to s _c ＝F _e (z _c W), where F _e Is an extraction function;

at the scaling function F _scale Thereafter, the attention feature map is added to the original map to obtain a final output integrated feature map

Step four: regional advice network

Using an end-to-end form similar to SSD as a region suggestion architecture, wherein the input of a region suggestion layer is a feature map extracted by an intermediate layer, and one region suggestion layer comprises a convolution layer, a Batchnormal layer and a Relu layer; after each individual RPN layer, upsampling the feature maps to the same fixed size and concatenating the maps together; finally, three 1×1 convolutions are used to generate the predicted values for bounding boxes, classes and directions.