CN111462135A

CN111462135A - Semantic Mapping Method Based on Visual SLAM and 2D Semantic Segmentation

Info

Publication number: CN111462135A
Application number: CN202010246158.2A
Authority: CN
Inventors: 唐漾; 钱锋; 杜文莉; 堵威
Original assignee: Shanghai Xinyuan Environmental Engineering Co ltd
Current assignee: Shanghai Xinyuan Environmental Engineering Co ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28
Anticipated expiration: 2040-03-31
Also published as: CN111462135B

Abstract

The invention relates to the field of cross fusion of computer vision and deep learning, and more particularly, to a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation. The method of the present invention includes the following steps: S1, calibrating camera parameters, and correcting camera distortion; S2, acquiring image frame sequence; S3, image preprocessing; S4, judging whether the current image frame is a key frame, and if so, go to the step S6, if not, go to step S5; S5, dynamic blur compensation; S6, semantic segmentation, extract ORB feature points for the image frame, and use the mask area convolutional neural network algorithm model to perform semantic segmentation; S7, pose Calculate, use the sparse SLAM algorithm model to calculate the camera pose; S8, assist the semantic information to construct a dense semantic map to realize the three-dimensional semantic map of the global point cloud map. The invention can improve the performance of the UAV semantic mapping system, and significantly improve the robustness of extracting and matching feature points for dynamic scenes.

Description

Semantic Mapping Method Based on Visual SLAM and 2D Semantic Segmentation

技术领域technical field

本发明涉及计算机视觉与深度学习的交叉融合领域，更具体的说，涉及一种基于视觉SLAM与二维语义分割的语义建图方法。The invention relates to the field of cross fusion of computer vision and deep learning, and more specifically, to a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation.

背景技术Background technique

无人机一般由智能决策、环境感知、运动控制三个模块构成，其中环境感知是一切的基础。UAVs are generally composed of three modules: intelligent decision-making, environmental perception, and motion control, of which environmental perception is the basis of everything.

无人机感知周围环境，需要一套稳定、性能强大的传感器系统来充当“眼睛”，同时需要相应的算法和强有力的处理单元“读懂物体”。For drones to perceive the surrounding environment, a stable and powerful sensor system is needed to act as "eyes", and corresponding algorithms and powerful processing units are needed to "read objects".

无人机的环境感知模块中，视觉传感器是不可或缺的一部分，视觉传感器可以是摄像头，相较于激光雷达、毫米波雷达，摄像头的分辨率更高，能获取足够的环境细节，例如可以描述物体的外观和形状、读取标识等。In the environmental perception module of the drone, the visual sensor is an indispensable part. The visual sensor can be a camera. Compared with lidar and millimeter-wave radar, the resolution of the camera is higher and can obtain sufficient environmental details. Describe the appearance and shape of objects, read logos, etc.

尽管全球定位系统(Global Positioning System，GPS)有助于定位过程，但是由于高大树木、建筑、隧道等造成的干扰会使得GPS定位不可靠，因此视觉传感器不能被GPS系统所取代。Although the Global Positioning System (GPS) helps the positioning process, the visual sensor cannot be replaced by the GPS system because of the interference caused by tall trees, buildings, tunnels, etc., which can make GPS positioning unreliable.

定位与建图(Simultaneous Localization and Mapping，SLAM)是指载有特定传感器的主体在没有先验信息的情况下，通过计算特定传感器获取图像帧来估计自身运动的轨迹，并建立周围环境的地图，其广泛应用于机器人、无人机、自动驾驶、增强现实、虚拟现实等应用中。Simultaneous Localization and Mapping (SLAM) refers to the fact that a subject carrying a specific sensor can estimate the trajectory of its own motion by calculating the image frame obtained by a specific sensor without prior information, and establish a map of the surrounding environment. It is widely used in robotics, drones, autonomous driving, augmented reality, virtual reality and other applications.

SLAM可以划分为激光SLAM和视觉SLAM两类。SLAM can be divided into two categories: laser SLAM and visual SLAM.

由于起步早，激光SLAM在理论技术和工程应用上都较为成熟，但是激光 SLAM在机器人的应用上有一个致命的缺点，就是激光雷达智能感知的结构信息是二维信息，信息量较少，造成丢失了大量的环境信息。同时其高昂的成本、庞大的体积以及缺少语义信息使其在一些特定的应用场景中受限。Due to its early start, laser SLAM is relatively mature in theoretical technology and engineering applications, but laser SLAM has a fatal disadvantage in the application of robots, that is, the structural information of lidar intelligent perception is two-dimensional information, and the amount of information is small, resulting in A lot of environmental information is lost. At the same time, its high cost, huge volume and lack of semantic information make it limited in some specific application scenarios.

视觉SLAM的感知信息源为相机图像。The perceptual information source of visual SLAM is the camera image.

根据相机类型，可将视觉SLAM分为三种：单目、双目以及深度SLAM。类似于激光雷达，深度相机可以通过采集点云来直接计算到障碍物的距离。深度相机结构简单，易于安装操作，而且成本低、使用场景广泛。According to the camera type, visual SLAM can be divided into three types: monocular, binocular, and depth SLAM. Similar to lidar, depth cameras can directly calculate distances to obstacles by collecting point clouds. The depth camera has a simple structure, is easy to install and operate, and has low cost and a wide range of usage scenarios.

随着深度学习的兴起，视觉SLAM在近几年也取得了长足的进步。With the rise of deep learning, visual SLAM has also made great progress in recent years.

大部分的视觉SLAM方案都是特征点或像素级别，为了完成一个特定的任务，或者与周围环境进行智能化的交互，无人机需要获取语义信息。Most of the visual SLAM solutions are at the feature point or pixel level. In order to complete a specific task or intelligently interact with the surrounding environment, UAVs need to obtain semantic information.

视觉SLAM系统能够选择有用信息，剔除无效信息。The visual SLAM system can select useful information and eliminate invalid information.

随着深度学习的发展，许多成熟的目标检测和语义分割的方法为精确的语义建图提供了条件。语义地图有利于提高无人机的自主性和鲁棒性，完成更复杂的任务，从路径规划转化为任务规划。With the development of deep learning, many mature object detection and semantic segmentation methods provide conditions for accurate semantic mapping. Semantic maps are conducive to improving the autonomy and robustness of UAVs, completing more complex tasks, and transforming from path planning to mission planning.

随着硬件计算能力的提高，以及算法结构的优化，深度学习取得了越来越瞩目的成就。With the improvement of hardware computing power and the optimization of algorithm structure, deep learning has made more and more remarkable achievements.

在计算机视觉领域取得了巨大的飞跃，就RGB图像分割来看，可以大体分为目标检测和语义分割。A huge leap has been made in the field of computer vision. As far as RGB image segmentation is concerned, it can be roughly divided into object detection and semantic segmentation.

在前期主要是目标检测框架的提出，实现了越来越精准的目标检测。In the early stage, the main target detection framework was proposed, which achieved more and more accurate target detection.

主流的深度学习目标检测框架主要是基于CNN(Convolutional NeuralNetworks，卷积神经网络)的，其中较为高效的有YOLO(You Only Look Once，你只用看一次)系列和R-CNN(Region-CNN，区域卷积神经网络)系列。The mainstream deep learning target detection framework is mainly based on CNN (Convolutional Neural Networks, convolutional neural network), among which the more efficient are the YOLO (You Only Look Once, you only need to look once) series and R-CNN (Region-CNN, Regional Convolutional Neural Networks) series.

三维图像中的目标感知技术越来越成熟，三维理解的需求也越来越紧迫。由于点云的不规则性，大多数研究者会将点转化为规则的体素或者网格模型，利用深度神经网络进行预测。Object perception technology in 3D images is becoming more and more mature, and the demand for 3D understanding is becoming more and more urgent. Due to the irregularity of point clouds, most researchers convert the points into regular voxel or grid models and use deep neural networks for prediction.

直接对点云空间进行语义分割需要消耗极大的计算资源，空间点之间的相互关系被削弱。The direct semantic segmentation of point cloud space consumes a lot of computing resources, and the relationship between spatial points is weakened.

2017年提出的PointNet(点网)是第一个可以直接处理原始三维点云的深度神经网络。PointNet, proposed in 2017, is the first deep neural network that can directly process raw 3D point clouds.

现有大部分视觉SLAM系统采用的稠密建图方法，缺少语义信息，无法完成智能化的需求。The dense mapping method adopted by most of the existing visual SLAM systems lacks semantic information and cannot meet the requirements of intelligence.

视觉SLAM算法有一个典型假设是场景固定，其中一些动态物体的出现不仅影响相机位姿的估计而且在地图中留下残影，影响地图质量。A typical assumption of the visual SLAM algorithm is that the scene is fixed, and the appearance of some dynamic objects not only affects the estimation of the camera pose but also leaves an afterimage in the map, which affects the quality of the map.

相机在高速运动情况下所捕捉的照片容易模糊，极大的影响了特征点的提取与匹配。The photos captured by the camera in the case of high-speed motion are easy to be blurred, which greatly affects the extraction and matching of feature points.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于视觉SLAM与二维语义分割的语义建图方法，解决高速运动的动态物体影响建立地图质量的技术问题。The purpose of the present invention is to provide a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, so as to solve the technical problem that high-speed moving dynamic objects affect the quality of the established map.

为了实现上述目的，本发明提供了一种基于视觉SLAM与二维语义分割的语义建图方法，包括以下步骤：In order to achieve the above object, the present invention provides a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, comprising the following steps:

S1、标定相机参数，校正相机畸变；S1. Calibrate camera parameters and correct camera distortion;

S2、获取图像帧序列，图像帧序列包括RGB图像和深度图像；S2, obtain an image frame sequence, the image frame sequence includes an RGB image and a depth image;

S3、图像预处理，采用针孔相机模型，得到RGB图像中每个像素点对应的真实三维空间点的坐标；S3, image preprocessing, using the pinhole camera model to obtain the coordinates of the real three-dimensional space point corresponding to each pixel point in the RGB image;

S4、判断当前图像帧是否为关键帧，如果是，则转入步骤S6，如果不是，则转入步骤S5；S4, determine whether the current image frame is a key frame, if so, then go to step S6, if not, then go to step S5;

S5、动态模糊补偿，计算得到当前图像帧的图像块质心作为语义特征点，作为ORB特征点的补充；S5, dynamic blur compensation, calculating the centroid of the image block of the current image frame as a semantic feature point, as a supplement to the ORB feature point;

S6、语义分割，针对图像帧进行ORB特征点的提取，利用掩膜区域卷积神经网络算法模型进行语义分割，获取该帧图像每一个像素点的语义信息；S6, semantic segmentation, extracting ORB feature points for the image frame, using the mask area convolutional neural network algorithm model to perform semantic segmentation, and obtaining the semantic information of each pixel of the frame image;

S7、位姿计算，利用稀疏SLAM算法模型计算相机位姿；S7, pose calculation, use the sparse SLAM algorithm model to calculate the camera pose;

S8、将语义信息输入到稀疏SLAM算法模型，辅助稠密语义地图构建，完成关键帧的遍历，实现全局点云地图的三维语义建图。S8. Input the semantic information into the sparse SLAM algorithm model, assist the construction of the dense semantic map, complete the traversal of key frames, and realize the three-dimensional semantic map of the global point cloud map.

在一实施例中，所述步骤S1中的校正相机畸变，进一步包括以下步骤：In one embodiment, the correction of camera distortion in step S1 further includes the following steps:

S11、将相机坐标系的三维空间点P(X,Y,Z)，投影到归一化图像平面形成该点的归一化坐标为[x,y]^T；S11, project the three-dimensional space point P(X, Y, Z) of the camera coordinate system to the normalized image plane to form the normalized coordinates of the point as [x, y] ^T ;

S12、对归一化平面上的点[x,y]^T进行径向畸变和切向畸变校正，通过以下公式实现：S12. Perform radial distortion and tangential distortion correction on the point [x, y] ^T on the normalized plane, which is achieved by the following formula:

其中，[x_corrected,y_corrected]^T是校正后的点坐标，p₁，p₂为相机的切向畸变系数，k₁，k₂，k₃为相机的径向畸变系数，r为点P离坐标系原点的距离；Among them, [x _corrected , y _corrected ] ^T is the corrected point coordinates, p ₁ , p ₂ are the tangential distortion coefficients of the camera, k ₁ , k ₂ , k ₃ are the radial distortion coefficients of the camera, and r is the point P the distance from the origin of the coordinate system;

S13、将校正后的点[x_corrected,y_corrected]^T通过内参数矩阵，投影到像素平面得到其在图像上的正确位置[u,v]^T，通过以下公式实现：S13. Project the corrected point [x _corrected ,y _corrected ] ^T through the internal parameter matrix to the pixel plane to obtain its correct position [u,v] ^T on the image, which is achieved by the following formula:

其中，f_x，f_y，c_x，c_y为相机的内参数。Among them, f _x , f _y , c _x , and _cy are the internal parameters of the camera.

在一实施例中，所述步骤S3的图像预处理，进一步包括，像素点[u,v]^T到真实三维空间点P(X,Y,Z)的映射关系满足以下公式：In one embodiment, the image preprocessing in step S3 further includes that the mapping relationship between the pixel point [u, v] ^T to the real three-dimensional space point P(X, Y, Z) satisfies the following formula:

其中，K称为内参数矩阵，f_x，f_y，c_x，c_y为相机的内参数，P为真实三维空间点坐标，[u,v]^T为像素点坐标。Among them, K is called the internal parameter matrix, f _x , f _y , c _x , _cy are the internal parameters of the camera, P is the real three-dimensional space point coordinates, [u, v] ^T is the pixel point coordinates.

在一实施例中，所述步骤S4的关键帧，使用稀疏SLAM算法模型进行筛选。In one embodiment, the key frame of step S4 is screened using a sparse SLAM algorithm model.

在一实施例中，所述步骤S5的图像块质心，通过以下步骤得到：In one embodiment, the centroid of the image block in step S5 is obtained through the following steps:

将该帧图像每一个物体标注为一个具体的类；Label each object in the frame image as a specific class;

对于每一个分割出来的对象有对应标注区域，分割出来的图像称为图像块；For each segmented object, there is a corresponding marked area, and the segmented image is called an image block;

计算图像块的矩

p,q＝{0,1}；Calculate moments of image patches

p,q={0,1};

计算对应的质心C作为语义特征点，对ORB特征点进行补充，其中Calculate the corresponding centroid C as a semantic feature point to supplement the ORB feature point, where

在一实施例中，所述步骤S6，进一步包括：In one embodiment, the step S6 further includes:

每一个像素点的语义信息包括语义分类标签、包围框坐标以及该分类的置信分数；The semantic information of each pixel includes semantic classification label, bounding box coordinates and confidence score of the classification;

基于语义分割结果，对于指定某种类别为动态物体的区域所提取的ORB 特征点进行剔除。Based on the results of semantic segmentation, the ORB feature points extracted from the regions that specify a certain category as dynamic objects are eliminated.

在一实施例中，所述步骤S6的Mask R-CNN算法模型进行语义分割，进一步包括：In one embodiment, the Mask R-CNN algorithm model of described step S6 carries out semantic segmentation, and further comprises:

通过特征图金字塔网络提取输入图像不同层次上的特征；Extract the features at different levels of the input image through the feature map pyramid network;

通过区域生成网络提出感兴趣提案；Propose proposals of interest through a region generation network;

利用感兴趣区域排列进行提案区域对齐；Proposal area alignment using ROI alignment;

利用全卷积网络进行掩膜分割；Mask segmentation using fully convolutional networks;

利用全连接层进行区域坐标确定和所述类别分类。The region coordinate determination and the class classification are performed using fully connected layers.

在一实施例中，所述稀疏SLAM算法模型，进一步包括，跟踪线程、局部建图线程、回环检测线程：In one embodiment, the sparse SLAM algorithm model further includes a tracking thread, a local mapping thread, and a loop closure detection thread:

所述跟踪线程，通过寻找对局部地图特征进行匹配，利用纯运动光束平差法最小化重投影误差进行定位每帧图片的相机；Described tracking thread, by looking for the local map feature to be matched, utilize pure motion beam adjustment method to minimize the reprojection error to locate the camera of each frame of picture;

所述局部建图线程，通过执行局部光束平差法管理局部地图并优化，通过地图点维护关键帧之间的共视关系，通过局部光束平差法优化共视关键帧位姿和地图点；The local map building thread manages and optimizes the local map by executing the local beam adjustment method, maintains the co-view relationship between key frames through map points, and optimizes the co-view key frame pose and map point through the local beam adjustment method. ;

所述回环检测线程，检测大的环并通过执行位姿图优化更正漂移误差，加速闭环匹配帧的筛选，并优化尺度，通过全局光束平差法优化本质图和地图点。The loop closure detection thread detects large loops and corrects drift errors by performing pose graph optimization, accelerates the screening of closed-loop matching frames, optimizes scale, and optimizes essence maps and map points by global beam adjustment.

在一实施例中，所述稀疏SLAM算法模型，进一步包括全局光束平差法优化线程，在回环检测线程确认后触发，在位姿图优化之后，计算整个系统最优结构和运动结果。In one embodiment, the sparse SLAM algorithm model further includes a global beam adjustment method optimization thread, which is triggered after the loop closure detection thread is confirmed, and after the pose graph optimization, calculates the optimal structure and motion results of the entire system.

在一实施例中，所述步骤S7的位姿计算，进一步包括：通过PnP求解初步计算相机位姿，利用后端位姿图优化计算相机位姿，构建相机位姿估计的最小化重投影误差：In one embodiment, the pose calculation in step S7 further includes: initially calculating the camera pose through PnP solution, using the back-end pose graph to optimize the calculation of the camera pose, and constructing a minimized reprojection error of the camera pose estimation :

其中，u_i为像素坐标，P_i为相机坐标，ξ^为相机位姿对应的李代数，s_i为特征点深度，K为相机内参数矩阵。Among them, _ui is the pixel coordinate, P _i is the camera coordinate, ξ^ is the Lie algebra corresponding to the camera pose, _si is the depth of the feature point, and K is the camera internal parameter matrix.

本发明提供的一种基于视觉SLAM与二维语义分割的语义建图方法，基于 ORB特征点的Mask R-CNN算法模型和稀疏SLAM算法模型，建立剔除动态物体的稠密语义地图，利用帧间信息以及图像帧上的语义信息来提升无人机语义建图系统性能，提升针对动态场景进行特征点的提取与匹配的鲁棒性。The present invention provides a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation, based on the Mask R-CNN algorithm model of ORB feature points and the sparse SLAM algorithm model, establishes a dense semantic map for eliminating dynamic objects, and uses inter-frame information. And semantic information on image frames to improve the performance of UAV semantic mapping system, and improve the robustness of feature point extraction and matching for dynamic scenes.

附图说明Description of drawings

本发明上述的以及其他的特征、性质和优势将通过下面结合附图和实施例的描述而变的更加明显，在附图中相同的附图标记始终表示相同的特征，其中：The above and other features, properties and advantages of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings and embodiments, wherein like reference numerals refer to like features throughout, wherein:

图1揭示了根据本发明一实施例的方法流程图；FIG. 1 discloses a flow chart of a method according to an embodiment of the present invention;

图2揭示了根据本发明一实施例的相机标定用标定板；FIG. 2 discloses a calibration board for camera calibration according to an embodiment of the present invention;

图3a揭示了根据本发明一实施例的针孔相机的小孔成像模型图；Fig. 3a discloses a pinhole imaging model diagram of a pinhole camera according to an embodiment of the present invention;

图3b揭示了根据本发明一实施例的针孔相机的相似三角形原理图；Fig. 3b discloses a similar triangular schematic diagram of a pinhole camera according to an embodiment of the present invention;

图4揭示了根据本发明一实施例的Mask RCNN的系统流程图。FIG. 4 discloses a system flowchart of Mask RCNN according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释发明，并不用于限定发明。In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the invention, but not to limit the invention.

语义地图的概念，指的是一种包含丰富语义信息的地图，表示了对环境中空间几何关系和存在的物体种类位置等语义信息的抽象。语义地图既包含环境空间信息又包含环境语义信息的地图，这样移动机器人可以像人一样，既知道环境中有物体也知道物体是什么。The concept of semantic map refers to a map containing rich semantic information, which represents the abstraction of semantic information such as spatial geometric relationship and existing object types and locations in the environment. Semantic maps contain both the spatial information of the environment and the semantic information of the environment, so that the mobile robot, like a human, knows both the objects in the environment and what the objects are.

针对现有技术中存在的问题与不足，本发明提出基于视觉SLAM与二维语义分割的语义建图系统，使用基于ORB(Oriented FAST and Rotated BRIEF，快速导向与简要旋转)特征点进行语义分割，结合稀疏SLAM算法模型，实现定位的同时完成语义建图。In view of the problems and deficiencies in the prior art, the present invention proposes a semantic mapping system based on visual SLAM and two-dimensional semantic segmentation, and uses feature points based on ORB (Oriented FAST and Rotated BRIEF, fast orientation and brief rotation) for semantic segmentation, Combined with the sparse SLAM algorithm model, semantic mapping is completed while positioning is achieved.

图1揭示了根据本发明一实施例的基于视觉SLAM与二维语义分割的语义建图方法流程图，在图1所示的实施例中，本发明提出的基于视觉SLAM 与二维语义分割的语义建图方法，具体步骤如下：1 shows a flowchart of a semantic mapping method based on visual SLAM and two-dimensional semantic segmentation according to an embodiment of the present invention. In the embodiment shown in FIG. 1 , the present invention proposes a method based on visual SLAM and two-dimensional semantic segmentation. Semantic mapping method, the specific steps are as follows:

S6、语义分割，针对图像帧进行ORB特征点的提取，利用Mask R-CNN 算法模型进行语义分割，获取该帧图像每一个像素点的语义信息；S6. Semantic segmentation, extracting ORB feature points for the image frame, using the Mask R-CNN algorithm model to perform semantic segmentation, and obtaining the semantic information of each pixel of the frame image;

下面对每一步骤进行详细的说明介绍。Each step is described in detail below.

步骤S1：标定相机参数、校正相机畸变。Step S1: calibrate camera parameters and correct camera distortion.

在图像测量过程以及机器视觉应用中，为确定空间物体表面某点的三维几何位置与其在图像中对应点之间的相互关系，必须建立相机成像的几何模型，这些几何模型参数就是相机参数。In the process of image measurement and machine vision applications, in order to determine the relationship between the three-dimensional geometric position of a point on the surface of a space object and its corresponding point in the image, a geometric model of camera imaging must be established, and these geometric model parameters are camera parameters.

畸变系数属于其中一种相机参数，对应与相机畸变现象。在大多数条件下，这些相机参数必须通过实验与计算才能得到，这个求解参数的过程就称之为相机标定(或摄像机标定)。The distortion coefficient is one of the camera parameters, which corresponds to the camera distortion phenomenon. Under most conditions, these camera parameters must be obtained through experiments and calculations, and this process of solving parameters is called camera calibration (or camera calibration).

相机畸变包括径向畸变和切向畸变。Camera distortion includes radial distortion and tangential distortion.

所述径向畸变由透镜形状引起。The radial distortion is caused by the lens shape.

更具体的说，在针孔模型中，一条直线投影到像素平面上还是一条直线。More specifically, in the pinhole model, a straight line projected onto the pixel plane is still a straight line.

可是，在实际拍摄的照片中，摄像机的透镜往往使得真实环境中的一条直线在图片中变成了曲线，这种畸变称为径向畸变。However, in actual photos, the lens of the camera often turns a straight line in the real environment into a curve in the picture, and this distortion is called radial distortion.

所述切向畸变，在相机的组装过程中由于不能使得透镜和成像面严格平行而形成。The tangential distortion is formed during the assembly process of the camera because the lens and the imaging plane cannot be strictly parallel.

由于光线投射导致实际对象物体跟投影到2D平面的图像不一致，这种不一致性是稳定的，可以通过对相机标定，计算出畸变参数来实现对后续图像的畸变校正。Due to ray projection, the actual object is inconsistent with the image projected to the 2D plane. This inconsistency is stable, and the distortion correction of subsequent images can be achieved by calibrating the camera and calculating the distortion parameters.

对于径向畸变，用和距中心距离有关的二次及高次多项式函数进行纠正：For radial distortion, correct with quadratic and higher-order polynomial functions related to the distance from the center:

其中，[x,y]^T是未纠正的点的坐标，[x_corrected,y_corrected]^T为纠正后的点坐标，k₁，k₂，k₃为相机的径向畸变系数，r为点P离坐标系原点的距离。Among them, [x,y] ^T is the coordinate of the uncorrected point, [x _corrected ,y _corrected ] ^T is the corrected point coordinate, k ₁ , k ₂ , k ₃ are the radial distortion coefficients of the camera, and r is the point The distance of P from the origin of the coordinate system.

对于切向畸变可以使用另外的两个参数p₁，p₂来进行纠正：For tangential distortion, two other parameters p ₁ , p ₂ can be used to correct:

其中，[x,y]^T是未纠正的点的坐标，[x_corrected,y_corrected]^T为纠正后的点坐标，p₁，p₂为相机的切向畸变系数，r为点P离坐标系原点的距离。Among them, [x,y] ^T is the coordinate of the uncorrected point, [x _corrected ,y _corrected ] ^T is the corrected point coordinate, p ₁ , p ₂ are the tangential distortion coefficients of the camera, and r is the point P away from the coordinate The distance from the origin.

在相机使用前，通过标定相机的径向畸变系数和切向畸变系数，从二维的图像中获取三维信息，实现图像的畸变校正、对象测量、三维重建等。Before the camera is used, by calibrating the radial distortion coefficient and tangential distortion coefficient of the camera, the three-dimensional information is obtained from the two-dimensional image, and the image distortion correction, object measurement, and three-dimensional reconstruction are realized.

图2揭示了根据本发明一实施例的相机标定用标定板，将图2所示的标定板摆在相机可视的范围内，每拍一张照片，标定板换一个位置和朝向，检测出图象中的特征点，求出相机的内参数、外参数，进而得到畸变系数。Fig. 2 discloses a calibration plate for camera calibration according to an embodiment of the present invention. The calibration plate shown in Fig. 2 is placed within the visible range of the camera. From the feature points in the image, the internal parameters and external parameters of the camera are obtained, and then the distortion coefficients are obtained.

优选的，使用MATLAB中的Camera Calibrator(相机校正)工具箱进行求解相机参数。Preferably, the Camera Calibrator (camera calibration) toolbox in MATLAB is used to solve the camera parameters.

对于相机坐标系中的点P(X,Y,Z)，本发明的步骤S1通过5个畸变系数进行相机畸变校正，找到这个点在像素平面上的正确位置。For the point P (X, Y, Z) in the camera coordinate system, step S1 of the present invention performs camera distortion correction through 5 distortion coefficients to find the correct position of this point on the pixel plane.

相机畸变的校正步骤如下：The steps to correct camera distortion are as follows:

S11、将三维空间点投影到归一化图像平面。设它的归一化坐标为[x,y]^T。S11. Project the three-dimensional space point to the normalized image plane. Let its normalized coordinates be [x,y] ^T .

S12、对归一化平面上的点进行径向畸变和切向畸变纠正。S12. Perform radial distortion and tangential distortion correction on the points on the normalized plane.

其中，[x_corrected,y_corrected]^T是校正后的点坐标，p₁，p₂为相机的切向畸变系数，k₁，k₂，k₃为相机的径向畸变系数，r为点P离坐标系原点的距离。Among them, [x _corrected , y _corrected ] ^T is the corrected point coordinates, p ₁ , p ₂ are the tangential distortion coefficients of the camera, k ₁ , k ₂ , k ₃ are the radial distortion coefficients of the camera, and r is the point P The distance from the origin of the coordinate system.

S13、将纠正后的点[x_corrected,y_corrected]^T通过内参数矩阵投影到像素平面，得到该点在图像上的正确位置坐标[u，v]^T。S13. Project the corrected point [x _corrected , y _corrected ] ^T to the pixel plane through the internal parameter matrix to obtain the correct position coordinates [u, v] ^T of the point on the image.

步骤S2、获取图像帧序列。Step S2, acquiring a sequence of image frames.

利用Kinect相机获取RGB-D图像帧序列，图像帧序列包括RGB图像和深度图像。The RGB-D image frame sequence is obtained by using the Kinect camera, and the image frame sequence includes RGB image and depth image.

步骤S3、图像预处理Step S3, image preprocessing

在一实施例中，采用RGB-D相机作为主要传感器，同时获得RGB图像与深度图像，采用针孔相机模型进行RGB图像的像素点到真实三维空间的映射。In one embodiment, an RGB-D camera is used as the main sensor to simultaneously obtain an RGB image and a depth image, and a pinhole camera model is used to map the pixels of the RGB image to the real three-dimensional space.

图3a揭示了根据本发明一实施例的针孔相机的小孔成像模型图，图3b 揭示了根据本发明一实施例的针孔相机的相似三角形原理图，如图3a和图 3b所示，建立相机坐标系O-x-y-z，以相机的光心位置为坐标系原点O，约定箭头方向为正向。Fig. 3a discloses a pinhole imaging model diagram of a pinhole camera according to an embodiment of the present invention, and Fig. 3b discloses a similar triangular schematic diagram of a pinhole camera according to an embodiment of the present invention, as shown in Figs. 3a and 3b, Establish the camera coordinate system O-x-y-z, take the position of the optical center of the camera as the origin O of the coordinate system, and agree that the direction of the arrow is the positive direction.

通过图3b所示的相似三角形的映射变换，在相机的成像平面上建立坐标系O'-x'-y'-z'，约定箭头方向为正向。Through the mapping transformation of similar triangles shown in Figure 3b, a coordinate system O'-x'-y'-z' is established on the imaging plane of the camera, and it is agreed that the direction of the arrow is positive.

假设P点坐标是[X,Y,Z]^T，相机镜片的焦距为f，焦距是相机光心到物理成像平面的距离。Assuming that the coordinates of point P are [X, Y, Z] ^T , the focal length of the camera lens is f, and the focal length is the distance from the optical center of the camera to the physical imaging plane.

点P穿过光心投影到成像平面的点P’，像素点P’的坐标[u,v]^T。The point P is projected through the optical center to the point P' of the imaging plane, and the coordinates of the pixel point P' are [u, v] ^T .

根据相应的对应关系，映射关系对应着一个尺度的缩放以及平移的量，推导可得：According to the corresponding relationship, the mapping relationship corresponds to the amount of scaling and translation of a scale, and the derivation can be obtained:

其中，K称为相机内参数矩阵，为固有参数，在步骤S1中已经进行标定， f_x，f_y，c_x，c_y为相机的内参数，P为真实三维空间点坐标，[u,v]^T为像素点坐标。Among them, K is called the camera internal parameter matrix, which is an intrinsic parameter, which has been calibrated in step S1, f _x , f _y , c _x , _cy are the internal parameters of the camera, P is the real three-dimensional space point coordinates, [u, v] ^T is the pixel coordinate.

步骤S4、判断是否是关键帧，如果是，则转入步骤S6，如果不是，则转入步骤S5；Step S4, judge whether it is a key frame, if yes, then go to step S6, if not, then go to step S5;

如果采用每一帧图像来进行视觉SLAM和语义分割计算，计算量太大，因此，选取其中质量高的作为关键帧。If each frame of image is used for visual SLAM and semantic segmentation calculation, the amount of calculation is too large, therefore, the one with high quality is selected as the key frame.

本发明中，使用基于ORB(Oriented FAST and Rotated BRIEF，快速导向与简要旋转)特征点的稀疏SLAM算法模型来筛选关键帧。In the present invention, a sparse SLAM algorithm model based on ORB (Oriented FAST and Rotated BRIEF, fast orientation and brief rotation) feature points is used to screen key frames.

每一个关键帧，都包含一张RGB图像和一张深度图像。Each keyframe contains an RGB image and a depth image.

步骤S5、动态模糊Step S5, dynamic blur

由于每帧图像中均可能存在动态物体，每次执行语义建图任务时，指定某几种目标为动态目标。在图像序列中，如果在该帧图像中识别出该动态目标，则本发明在二维像素点到三维空间坐标转化时，对相应的点云进行剔除，防止动态物体在地图中留下残影，影响建图质量。Since there may be dynamic objects in each frame of image, each time the semantic mapping task is performed, certain kinds of objects are designated as dynamic objects. In the image sequence, if the dynamic target is identified in the frame image, the present invention will remove the corresponding point cloud when the two-dimensional pixel point is converted to the three-dimensional space coordinate, so as to prevent the dynamic object from leaving afterimages in the map. , which affects the quality of the map.

本发明的步骤S5中，如果该帧图像不是关键帧，因为运动模糊，ORB特征点提取不足，在步骤S6的图像语义分割步骤之前，进行如下操作作为补充：In step S5 of the present invention, if this frame image is not a key frame, because of motion blur, ORB feature point extraction is insufficient, before the image semantic segmentation step of step S6, carry out the following operations as a supplement:

该帧图像每一个物体标注为一个具体的类，对于每一个分割出来的对象有对应标注区域，分割出来的图像称为图像块，计算图像块B的矩m_pq：Each object in the frame image is marked as a specific class, and there is a corresponding marked area for each segmented object. The segmented image is called an image block, and the moment m _pq of the image block B is calculated:

质心位置C为：The centroid position C is:

该质心作为语义特征点，对ORB特征点的不足进行补充。The centroid serves as a semantic feature point to supplement the insufficiency of ORB feature points.

针对ORB特征点损失严重的模糊图像进行语义特征点的补充，抑制跟踪算法使用属于动态对象的匹配，进而综合筛选关键帧，进行相机位姿的估计，防止建图算法将移动对象包括为3D地图的一部分。For the blurred images with serious loss of ORB feature points, the semantic feature points are supplemented, and the tracking algorithm is suppressed from using the matching belonging to the dynamic objects, and then the key frames are comprehensively screened, and the camera pose is estimated to prevent the mapping algorithm from including the moving objects as 3D maps. a part of.

步骤S6、语义分割Step S6, Semantic Segmentation

针对每一个图像帧进行ORB特征点的提取，利用Mask RCNN(Mask Region-CNN，掩膜区域卷积神经网络)算法模型进行语义分割，获取该帧图像每一个像素点的语义信息。The ORB feature points are extracted for each image frame, and the Mask RCNN (Mask Region-CNN, mask region convolutional neural network) algorithm model is used to perform semantic segmentation to obtain the semantic information of each pixel of the frame image.

基于语义分割结果，如果识别出动态目标，对于指定某种类别为动态物体的区域所提取的ORB特征点进行剔除。Based on the results of semantic segmentation, if a dynamic object is identified, the ORB feature points extracted from the region that specifies a certain category as a dynamic object are eliminated.

抑制视觉SLAM算法在建图过程中将移动对象包括为3D地图的一部分。Suppressed vision SLAM algorithms include moving objects as part of the 3D map during the mapping process.

本发明的步骤S6中，Mask RCNN算法模型采用COCO数据集进行训练。In step S6 of the present invention, the Mask RCNN algorithm model adopts the COCO data set for training.

COCO的全称是Common Objects in COntext，是微软团队提供的一个可以用来进行图像识别的数据集，可以获得80个类别的分类信息。The full name of COCO is Common Objects in COntext. It is a data set provided by the Microsoft team that can be used for image recognition. Classification information of 80 categories can be obtained.

图4揭示了根据本发明一实施例的Mask RCNN的系统流程图，如图4 所示，基于Mask R-CNN算法模型实现图像帧的RGB图像语义分割，所述基于Mask R-CNN算法模型的卷积神经网络框架，进行语义分割的步骤如下所示：FIG. 4 discloses a system flowchart of Mask RCNN according to an embodiment of the present invention. As shown in FIG. 4 , RGB image semantic segmentation of image frames is implemented based on the Mask R-CNN algorithm model. Convolutional neural network framework, the steps for semantic segmentation are as follows:

通过FPN(Feature Pyramid Networks，特征图金字塔网络)提取输入图像不同层次上的特征；Extract features at different levels of the input image through FPN (Feature Pyramid Networks);

通过RPN(Region Proposal Network，区域生成网络)提出感兴趣提案；Propose proposals of interest through RPN (Region Proposal Network);

利用RoI Align(Region of Interest Align，感兴趣区域排列)进行提案区域对齐；Use RoI Align (Region of Interest Align, region of interest alignment) to align proposal regions;

利用FCN(Fully Convolutional Networks,全卷积网络)进行掩膜分割；Use FCN (Fully Convolutional Networks, fully convolutional network) for mask segmentation;

利用FC(Fully Connected Layers，全连接层)进行区域坐标确定以及所属类别分类。Use FC (Fully Connected Layers) to determine the regional coordinates and classify the category.

该帧图像经过Mask RCNN算法模型处理，生成像素级别的语义分类结果，即每一个像素点的语义分类标签，同时输出包围框坐标以及该分类的置信分数。The frame image is processed by the Mask RCNN algorithm model to generate pixel-level semantic classification results, that is, the semantic classification label of each pixel point, and output the bounding box coordinates and the confidence score of the classification.

本发明采用ORB特征点进行追踪、建图和位置识别任务，ORB特征点的优点是具有旋转不变性和尺度不变性，并且能够迅速的提取特征和进行匹配，能够满足实时操作的需求，能够在基于词袋的位置识别过程中，显示出良好的精度。The present invention uses ORB feature points for tracking, mapping and position recognition tasks. The advantages of ORB feature points are that they have rotation invariance and scale invariance, and can quickly extract features and perform matching, which can meet the needs of real-time operations, and can be used in real-time operations. The bag-of-words-based position recognition process shows good accuracy.

S7位姿计算S7 pose calculation

视觉里程计位姿的估计是对于相邻两帧图像而言的，不难理解，多个这样的帧间位姿估计累积就是相机的运动轨迹。The estimation of the visual odometry pose is for two adjacent frames of images. It is not difficult to understand that the accumulation of multiple such inter-frame pose estimates is the motion trajectory of the camera.

使用基于ORB(Oriented FAST and Rotated BRIEF，快速导向与简要旋转) 特征点的稀疏SLAM算法模型计算相机位姿。The camera pose is calculated using a sparse SLAM algorithm model based on ORB (Oriented FAST and Rotated BRIEF, fast orientation and brief rotation) feature points.

在提取图像帧特征点后，基于关键帧使用PnP进行相机位姿的估计。After extracting image frame feature points, PnP is used to estimate the camera pose based on key frames.

PnP为Perspective-n-Point(n点透视)的简称，是求解3D到2D点对的运动的方法：即给出n个3D空间点及其投影位置时，如何求解相机的位姿。PnP is the abbreviation of Perspective-n-Point (n-point perspective), which is a method to solve the motion of 3D to 2D point pairs: that is, how to solve the pose of the camera when n 3D space points and their projection positions are given.

假设在时刻k，相机的位置为x_k，相机输入数据为u_k，w_k为噪声，构建运动方程：Assuming that at time k, the camera's position is x _k , the camera input data is u _k , and w _k is noise, construct the motion equation:

x_k＝f(x_k-1,u_k,w_k)。x _k =f(x _k-1 , _{uk , w k} ₎ .

在x_k位置上观测到路标点y_j，产生一系列观测数据z_k,j，v_k,j为观测噪声，构建观测方程：The landmark point y _j is observed at the position of x _k , a series of observation data z _k,j are generated, v _k,j is the observation noise, and the observation equation is constructed:

z_k,j＝h(y_j,x_k,v_k,j)。z _k,j =h(y _j ,x _k ,v _k,j ).

本发明的步骤S7中，通过PnP问题求解可以初步计算相机位姿，进而利用后端位姿图优化进一步计算更为精确的相机位姿。In step S7 of the present invention, the camera pose can be preliminarily calculated by solving the PnP problem, and then a more accurate camera pose can be further calculated by using the back-end pose graph optimization.

本发明的步骤S7中，把相机位姿估计的PnP问题，构建成一个定义域李代数上的非线性最小二乘问题。In step S7 of the present invention, the PnP problem of camera pose estimation is constructed as a nonlinear least squares problem on Lie algebra of definition domain.

更进一步的，本发明步骤S7的相机位姿估计，构建为一个BA(Bundle Adjustment，光束平差法)问题，构建相机位姿估计的最小化重投影误差：Further, the camera pose estimation in step S7 of the present invention is constructed as a BA (Bundle Adjustment, beam adjustment method) problem, and the minimum reprojection error of the camera pose estimation is constructed:

其中，u_i为像素坐标，P_i为相机坐标，ξ^为相机位姿对应的李代数，s_i为特征点深度，K为相机内参数矩阵，n为点的个数。Among them, _ui is the pixel coordinate, P _i is the camera coordinate, ξ^ is the Lie algebra corresponding to the camera pose, s _i is the depth of the feature point, K is the camera internal parameter matrix, and n is the number of points.

更进一步的，本发明的步骤S7进一步包括，采用基于DBOW(Direct index Bag ofwords，词袋模型)嵌入式位置识别模型进行重定位，来防止跟踪失败、或已知地图场景重初始化、回环检测等。Further, step S7 of the present invention further includes, using a DBOW (Direct index Bag of words, bag of words model) embedded location recognition model for relocation to prevent tracking failure, or re-initialization of known map scenes, loopback detection, etc. .

本发明中采用稀疏SLAM算法模型进行关键帧的筛选和相机位姿计算，所述稀疏SLAM算法模型，是在ORB-SLAM2(Oriented FAST and Rotated BRIEF- SimultaneousLocalization and Mapping 2，第二代快速导向与简要旋转的即时定位与地图构建)的基础上进行改进得到。In the present invention, the sparse SLAM algorithm model is used to screen key frames and calculate the camera pose. The sparse SLAM algorithm model is in ORB-SLAM2 (Oriented FAST and Rotated BRIEF- Simultaneous Localization and Mapping 2, the second generation of fast guidance and brief Rotation real-time positioning and map construction) are improved on the basis of.

所述SLAM算法模型，由4个平行线程组成，包括跟踪线程、局部建图线程、回环检测线程以及全局BA优化线程。The SLAM algorithm model consists of 4 parallel threads, including a tracking thread, a local mapping thread, a loopback detection thread and a global BA optimization thread.

更进一步的，全局BA优化线程，仅在回环检测线程确认后才执行。Furthermore, the global BA optimization thread is executed only after the loopback detection thread has confirmed it.

前三个线程为并行线程，定义分别如下：The first three threads are parallel threads, which are defined as follows:

1)跟踪线程。1) Trace the thread.

通过寻找对局部地图特征进行匹配，利用纯运动BA最小化重投影误差进行定位每帧图片的相机。By finding matches to local map features, pure motion BA is used to minimize the reprojection error to locate the camera of each frame.

优选的，采用恒速模型进行匹配。Preferably, a constant velocity model is used for matching.

2)局部建图线程。2) Local mapping thread.

通过执行局部BA管理局部地图并优化，通过MapPoints(地图点)维护关键帧之间的共视关系，通过局部BA优化共视关键帧位姿和MapPoints。The local map is managed and optimized by performing local BA, the co-view relationship between key frames is maintained through MapPoints (map points), and the co-view key frame pose and MapPoints are optimized through local BA.

3)回环检测线程。3) Loopback detection thread.

检测大的环并通过执行位姿图优化更正漂移误差，通过Bow加速闭环匹配帧的筛选，并通过Sim3优化尺度，通过全局BA优化Essential Graph(本质图)和MapPoints。所述Sim3变换就是相似变换。Detect large loops and correct drift errors by performing pose graph optimization, speed up the screening of closed-loop matching frames by Bow, optimize scale by Sim3, and optimize Essential Graph (essential graph) and MapPoints by global BA. The Sim3 transformation is the similarity transformation.

回环检测线程触发全局BA优化线程。The loopback detection thread triggers the global BA optimization thread.

全局BA线程，在位姿图优化之后，计算整个系统最优结构和运动结果。The global BA thread, after the pose graph optimization, calculates the optimal structure and motion results of the entire system.

与现有技术的稠密SLAM算法模型相比，在本发明的稀疏SLAM算法模型，通过语义信息融合，在最终的建图过程中，增添了丰富的图像的语义分割信息。Compared with the dense SLAM algorithm model of the prior art, in the sparse SLAM algorithm model of the present invention, through the fusion of semantic information, in the final mapping process, rich semantic segmentation information of the image is added.

步骤S8、三维语义建图Step S8, 3D semantic mapping

利用步骤S6的语义分割结果，结合步骤S7获取的帧间位姿信息以及图像帧像素点的真实三维坐标，将语义信息输入到稀疏SLAM算法模型，将该帧图像中语义包含的同种物体，以相同标注颜色投射到三维点云地图中，辅助稠密语义地图构建，完成关键帧的遍历，实现全局点云地图的三维语义建图。Using the semantic segmentation result of step S6, combined with the inter-frame pose information obtained in step S7 and the real three-dimensional coordinates of the pixel points of the image frame, the semantic information is input into the sparse SLAM algorithm model, and the same kind of object semantically contained in the frame image, The same label color is projected into the 3D point cloud map to assist in the construction of dense semantic maps, complete the traversal of key frames, and realize 3D semantic map construction of the global point cloud map.

本发明的步骤S8进一步包括：Step S8 of the present invention further comprises:

S81、将第一帧关键帧生成的三维空间像素投影到一个初始点云中；S81. Project the three-dimensional space pixels generated by the key frame of the first frame into an initial point cloud;

S82、通过针孔模型计算得到的当前关键帧每一个像素点对应的三维空间坐标，生成一幅点云地图；S82, generate a point cloud map through the three-dimensional space coordinates corresponding to each pixel point of the current key frame calculated by the pinhole model;

S83、计算得到当前关键帧与上一关键帧的位姿变化；S83. Calculate the pose change between the current key frame and the previous key frame;

S84、两幅点云地图通过位姿变换矩阵进行三维坐标点的叠加融合，生成一幅信息更多的点云地图；S84. The two point cloud maps are superimposed and fused with three-dimensional coordinate points through the pose transformation matrix to generate a point cloud map with more information;

S85、上述步骤不断迭代，当完成所有关键帧的遍历，实现全局点云地图的构建。S85. The above steps are continuously iterated. When the traversal of all key frames is completed, the construction of the global point cloud map is realized.

下面结合具体的试验，对本发明的基于视觉SLAM与二维语义分割的无人机语义建图方法的试验结果作进一步详细说明。The experimental results of the UAV semantic mapping method based on visual SLAM and two-dimensional semantic segmentation of the present invention will be further described in detail below in conjunction with specific experiments.

本次试验基于操作系统Ubuntu16.04和硬件显卡Nvidia Geforce GTX 1050，借助Tensorflow、OpenCV、g2o、Point Cloud Library等软件工具，以真实场景为实验条件，利用Kinect V1相机实拍数据。This experiment is based on the operating system Ubuntu16.04 and the hardware graphics card Nvidia Geforce GTX 1050. With the help of software tools such as Tensorflow, OpenCV, g2o, and Point Cloud Library, the real scene is used as the experimental condition, and the real shooting data of the Kinect V1 camera is used.

对于三维语义建图评估，Q₁代表正确检测物品个数，Q₂代表检测出物体但是分类错误以及实际有物体但是没有检测出的数量，Q₃代没有物体但是检测出结果的数量，P代表三维物体正确检出率，计算方式下如所示：For 3D semantic mapping evaluation, Q ₁ represents the number of correctly detected objects, Q ₂ represents the number of objects detected but classified incorrectly and the number of objects that actually exist but not detected, Q ₃ represents the number of detected objects without objects, and P represents the number of detected results The correct detection rate of 3D objects is calculated as follows:

P＝Q₁/(Q₁+Q₂+Q₃)P=Q ₁ /(Q ₁ +Q ₂ +Q ₃ )

通过9次构建稠密语义建图进行实验记录，计算地图中的平均三维物体正确检出率为48.1086％，具体实验结果如下表所示：Through 9 times of dense semantic mapping for experimental records, the average correct detection rate of 3D objects in the map is calculated to be 48.1086%. The specific experimental results are shown in the following table:

表1Table 1

尽管为使解释简单化将上述方法图示并描述为一系列动作，但是应理解并领会，这些方法不受动作的次序所限，因为根据一个或多个实施例，一些动作可按不同次序发生和/或与来自本文中图示和描述或本文中未图示和描述但本领域技术人员可以理解的其他动作并发地发生。Although the above-described methods are illustrated and described as a series of acts for simplicity of explanation, it should be understood and appreciated that these methods are not limited by the order of the acts, as some acts may occur in a different order in accordance with one or more embodiments and/or occur concurrently with other actions from or not shown and described herein but understood by those skilled in the art.

如本申请和权利要求书中所示，除非上下文明确提示例外情形，“一”、“一个”、“一种”和/或“该”等词并非特指单数，也可包括复数。一般说来，术语“包括”与“包含”仅提示包括已明确标识的步骤和元素，而这些步骤和元素不构成一个排它性的罗列，方法或者设备也可能包含其他的步骤或元素。As shown in this application and in the claims, unless the context clearly dictates otherwise, the words "a", "an", "an" and/or "the" are not intended to be specific in the singular and may include the plural. In general, the terms "comprising" and "comprising" only imply that the clearly identified steps and elements are included, and these steps and elements do not constitute an exclusive list, and the method or apparatus may also include other steps or elements.

上述实施例是提供给熟悉本领域内的人员来实现或使用本发明的，熟悉本领域的人员可在不脱离本发明的发明思想的情况下，对上述实施例做出种种修改或变化，因而本发明的保护范围并不被上述实施例所限，而应该是符合权利要求书提到的创新性特征的最大范围。The above-mentioned embodiments are provided for those skilled in the art to realize or use the present invention. Those skilled in the art can make various modifications or changes to the above-mentioned embodiments without departing from the inventive concept of the present invention. The protection scope of the present invention is not limited by the above-mentioned embodiments, but should be the maximum scope conforming to the innovative features mentioned in the claims.

Claims

1. A semantic mapping method based on visual S L AM and two-dimensional semantic segmentation is characterized by comprising the following steps:

s1, calibrating camera parameters, and correcting camera distortion;

s2, acquiring an image frame sequence, wherein the image frame sequence comprises an RGB image and a depth image;

s3, preprocessing the image, and obtaining the coordinates of a real three-dimensional space point corresponding to each pixel point in the RGB image by adopting a pinhole camera model;

s4, judging whether the current image frame is a key frame, if so, turning to the step S6, and if not, turning to the step S5;

s5, dynamic fuzzy compensation, wherein the centroid of the image block of the current image frame is obtained through calculation and is used as a semantic feature point and is used as a supplement of an ORB feature point;

s6, semantic segmentation, namely, extracting ORB feature points of the image frame, and performing semantic segmentation by using a mask region convolution neural network algorithm model to obtain semantic information of each pixel point of the image frame;

s7, calculating the pose, and calculating the pose of the camera by using a sparse S L AM algorithm model;

and S8, inputting semantic information into the sparse S L AM algorithm model, assisting the construction of a dense semantic map, completing the traversal of a key frame, and realizing the three-dimensional semantic map construction of the global point cloud map.

2. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation according to claim 1, wherein the step S1 corrects camera distortion, further comprising the steps of:

s11, projecting the three-dimensional space point P (X, Y, Z) of the camera coordinate system to the normalized image plane to form the normalized coordinate of the point as [ X, Y]^T；

S12, normalizing points [ x, y ] on the plane]^TPerforming radial distortion and tangential distortion correction byThe formula is realized as follows:

wherein, [ x ]_corrected,y_corrected]^TIs the corrected point coordinate, p₁，p₂Is the tangential distortion coefficient, k, of the camera₁，k₂，k₃Is the radial distortion coefficient of the camera, r is the distance of the point P from the origin of the coordinate system;

s13, correcting the point [ x ]_corrected,y_corrected]^TProjecting to a pixel plane to obtain the correct position [ u, v ] of the pixel plane on the image through an internal parameter matrix]^TThe method is realized by the following formula:

wherein f is_x，f_y，c_x，c_yIs an intrinsic parameter of the camera.

3. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation as claimed in claim 2, wherein the image preprocessing of step S3 further comprises pixel points [ u, v [ ]]^TThe mapping to the true three-dimensional spatial point P (X, Y, Z) satisfies the following formula:

where K is called the in-camera parameter matrix, f_x，f_y，c_x，c_yIs the intrinsic parameter of the camera, P is the real three-dimensional space point coordinate, [ u, v]^TIs the pixel point coordinate.

4. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation as claimed in claim 1, wherein the key frame of step S4 is filtered by using sparse S L AM algorithm model.

5. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation according to claim 1, wherein the centroids of the image blocks of step S5 are obtained by:

marking each object of the frame image as a specific class;

each divided object is provided with a corresponding label area, and the divided image is called an image block;

computing moments of image blocks

p,q＝{0,1}；

Calculating corresponding centroid C as semantic feature point to supplement ORB feature point, wherein

6. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation according to claim 1, wherein the step S6 further comprises:

the semantic information of each pixel point comprises a semantic classification label, an enclosure coordinate and a confidence score of the classification;

and based on the semantic segmentation result, eliminating ORB characteristic points extracted from the region of which a certain category is designated as the dynamic object.

7. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation of claim 1, wherein the Mask R-CNN algorithm model of step S6 is used for semantic segmentation, further comprising:

extracting features on different levels of the input image through a feature map pyramid network;

an interesting proposal is proposed through a regional generation network;

carrying out proposal area alignment by using the arrangement of the interested areas;

performing mask segmentation by using a full convolution network;

and utilizing the full connection layer to determine the region coordinates and classify the categories.

8. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation according to claim 1 or claim 4, wherein the sparse S L AM algorithm model further comprises a tracking thread, a local mapping thread, a loop detection thread:

the tracking thread is used for locating the camera of each frame of picture by searching and matching the local map features and minimizing the reprojection error by using a pure motion beam adjustment method;

the local map building thread manages and optimizes a local map by executing a local light beam adjustment method, maintains the common-view relation among key frames through map points, and optimizes the pose of the common-view key frame and the map points through the local light beam adjustment method;

and the loop detection thread detects a large loop, corrects drift errors by executing pose graph optimization, accelerates screening of closed loop matching frames, optimizes dimensions and optimizes essential graph and map points by a global beam adjustment method.

9. The semantic mapping method based on visual S L AM and two-dimensional semantic segmentation as claimed in claim 1, wherein the sparse S L AM algorithm model further comprises a global beam-balancing optimization thread, which is triggered after the loop detection thread is confirmed, and after the pose map is optimized, the optimal structure and motion result of the whole system is calculated.

10. The semantic mapping method based on vision S L AM and two-dimensional semantic segmentation as claimed in claim 1, wherein the pose calculation of step S7 further comprises solving preliminary calculated camera pose by PnP, optimizing calculated camera pose by using rear-end pose map, constructing minimized re-projection error of camera pose estimation:

wherein u is_iIs a pixel coordinate, P_iξ is the corresponding lie algebra of the camera pose, s_iAnd K is an intra-camera parameter matrix.