CN113916245A

CN113916245A - Semantic map construction method based on instance segmentation and VSLAM

Info

Publication number: CN113916245A
Application number: CN202111176088.9A
Authority: CN
Inventors: 陈建驱; 徐昱琳; 李铭扬; 李璇; 杨傲雷
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-11
Anticipated expiration: 2041-10-09
Also published as: CN113916245B

Abstract

The invention relates to the technical field of mobile robot positioning navigation, and discloses a semantic map construction method based on instance segmentation and VSLAM, which comprises the following steps: s1: matching the acquired RGB image with the depth map to obtain an RGB-D frame; s2: performing pose estimation on all RGB-D frames frame by frame, and finding out key frames in all RGB-D frames; s3: carrying out example segmentation on RGB images in all RGB-D key frames frame by frame to obtain an example segmentation result of each RGB-D key frame; s4: converting the RGB-D key frame processed in the step S3 into a point cloud; s5: creating an instance set in each RGB-D key frame point cloud, and firstly fusing the instance sets in all the RGB-D key frame point clouds into a blank initial instance map to obtain an instance map; s6: fusing background point clouds in all RGB-D key frame point clouds to obtain a global background point cloud; s7: and optimizing and fusing the example map and the global background point cloud to obtain a semantic map. The semantic map constructed by the semantic map construction method is high in precision and can be constructed in real time.

Description

Semantic map construction method based on instance segmentation and VSLAM

Technical Field

The invention relates to the technical field of mobile robot positioning and navigation, in particular to a semantic map construction method based on instance segmentation and VSLAM.

Background

With the development of society, the demand of human beings on robots is increasing. The mobile robot has the important characteristics that the mobile robot can sense and understand the environment by depending on a sensor, can autonomously or semi-autonomously execute tasks, and has certain learning capacity on the environment. SLAM aims to reconstruct the three-dimensional structure of the environment in real time in an unknown environment while positioning a mobile robot, and can be classified as visual SLAM (vslam) or laser SLAM. The map constructed by the traditional VSLAM only carries basic information such as color and texture of the environment, and cannot provide semantic information of higher level, such as semantic category of a certain space point in the point cloud. The robot cannot understand the environment and thus cannot perform more advanced tasks. For example, for a mobile robot to automatically grab a task, the robot is required to know what the target is, where.

Therefore, semantic information, called a semantic map, needs to be added to the map. The semantic map contains both geometric and semantic information of the environment. In addition to improving the intelligence level of the robot, the semantic map may also allow the user to interact directly with the robot, for example, the user may query how many cups are on the map.

The traditional semantic map is mainly constructed by machine learning algorithms such as a support vector machine and a conditional random field, and the semantic map constructed by the methods has low precision and is difficult to run in real time. Since 2012, the field of deep learning has been developed in a spanning manner, and some methods for constructing semantic maps through deep learning have appeared. However, the semantic map constructed by the methods is low in semantic level and difficult to directly utilize; or the calculation amount is too large to be deployed on the mobile robot. Therefore, a new semantic map construction algorithm based on deep learning needs to be provided.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides a semantic map construction method based on example segmentation and VSLAM.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:

a semantic map construction method based on instance segmentation and VSLAM comprises the following steps:

s1: matching the RGB image and the depth map acquired from the RGB-D camera according to the time stamp to obtain RGB-D frames, wherein each RGB-D frame consists of the RGB image and the depth map with the same time stamp;

s2: performing pose estimation on all RGB-D frames frame by adopting a VSLAM algorithm, finding out key frames in all RGB-D frames according to a pose estimation result, and recording the key frames as RGB-D key frames;

s3: carrying out example segmentation on the RGB images in all the RGB-D key frames frame by adopting a trained example segmentation model to obtain an example segmentation result of each RGB-D key frame, wherein the example segmentation result comprises semantic categories and sequence numbers of each example in the RGB images;

s4: converting the RGB-D key frame processed in the step S3 into a point cloud;

s5: creating an example set in each RGB-D key frame point cloud, inserting the example set in the first RGB-D key frame point cloud into a blank initial example map according to the sequence of time stamps, and then sequentially fusing the example sets in the remaining RGB-D key frame point clouds into the initial example map to obtain an example map;

s6: creating background point clouds in each RGB-D key frame point cloud, and then fusing the background point clouds in all the RGB-D key frame point clouds to obtain a global background point cloud;

s7: and respectively optimizing the example map and the global background point cloud, and then fusing the optimized example map and the optimized global background point cloud to obtain the semantic map.

According to the semantic mapping method based on example segmentation and VSLAM, preferably, the specific operation of creating the example set in each RGB-D keyframe point cloud in step S5 is as follows:

s51: traversing each three-dimensional point in the point cloud aiming at an RGB-D key frame point cloud, and extracting each instance point cloud according to instance semantic categories and instance sequence numbers corresponding to the three-dimensional points;

s52: for an example point cloud, carrying out clustering segmentation processing on the example point cloud by adopting an Euclidean distance clustering segmentation algorithm, and then calculating the length, width, height and center coordinates of the example point cloud after clustering segmentation to obtain a three-dimensional surrounding frame of the example point cloud, namely completing the construction of an example corresponding to the example point cloud;

s53: and combining all the examples constructed by the RGB-D key frame point cloud to obtain an example set of the RGB-D key frame.

According to the above semantic mapping method based on instance segmentation and VSLAM, the representation attributes of the instance in step S52 are preferably: semantic type, sequence number, geometric information, point cloud and observation dictionary of the instance; the geometric information includes length, width, height, center coordinates of the instance; and the key value pair of the observation dictionary is the serial number of the example and the observation times, and the observation times are the times of the example appearing in all RGB-D key frames.

According to the semantic map construction method based on example segmentation and VSLAM, preferably, the specific operation of sequentially fusing the example sets in the remaining RGB-D keyframe frame point clouds to the initial example map in step S5 is as follows: aiming at any instance A in an RGB-D key frame instance set, finding an instance B with the maximum IoU of the instance A in an initial instance map, judging whether the instance A and the instance B need to be fused, if so, fusing the instance A and the instance B to obtain an instance C, and inserting the instance C into the initial instance map; and judging that the instance A is directly inserted into the initial instance map if the instance A and the instance B do not need to be fused. More preferably, the condition for judging whether the instance a and the instance B need to be fused is as follows: if IoU of example A and example B>0.2 and semantic categories are the same, or IoU for instance A and instance B>0.5, judging that the example A and the example B need to be fused. Wherein IoU is equal to the volume of the intersection of example A and example B divided by the volume of the ice maker part, i.e.

Wherein Vab represents the volume of the intersection part of the example A and the example B, Va represents the volume of the example A, and Vb represents the volume of the example B.

According to the semantic map construction method based on instance segmentation and VSLAM, preferably, the semantic category and the sequence number of the instance C are consistent with those of the instances a and B with the largest observation times; the geometric information of the example C is an extreme value of the geometric information of the examples a and B (where the extreme value may be a maximum value or a minimum value; for example, minX is a minimum value of the X axis of the example in the world coordinate system, then minX of the example C is a smaller value of minX of the example a and minX of the example B, i.e., c.minx ═ MIN (a.minx, b.minx), or maxZ is a maximum value of the Z axis of the example in the world coordinate system, then maxZ of the example C is a larger value of maxZ of the example a and maxZ of the example B, i.e., c.maxz ═ MAX (a.maxz, b.maxz)); the point cloud of the example C is the sum of the point clouds of the example A and the example B.

According to the semantic map construction method based on example segmentation and VSLAM, preferably, the method for optimizing the example map in step S7 is as follows: firstly, carrying out non-maximum value inhibition processing on an example map, and then carrying out point cloud alignment processing to obtain an optimized example map; the method for optimizing the global background point cloud comprises the following steps: and processing the global background point cloud by adopting a voxel filtering algorithm, and then performing point cloud alignment processing to obtain the optimized global background point cloud.

According to the above semantic mapping method based on example segmentation and VSLAM, the specific operation steps of the non-maximum suppression processing are preferably: for each instance in the instance map, calculating the PIoU values of any two instances, and fusing the two instances if the PIoU is greater than a preset threshold Th. The calculation method of the PIoU comprises the following steps: calculating the volumes Vm and Vn of the instances M and N, calculating the volume Vmn of the intersecting part of the instances M and N, calculating the proportions Pm and Pn of the Vmn relative to the Vm and Vn, selecting the maximum value of the Pm and Pn, wherein the maximum value is the PIoU value of the instances A and B, and the calculation formula of the PIoU is as follows:

according to the above semantic map construction method based on example segmentation and VSLAM, preferably, the specific operations of the point cloud alignment processing are as follows: fitting the maximum plane in the point cloud by adopting a random sampling consistency algorithm to obtain a fitted plane; constructing a rotation matrix according to the normal vector of the fitting plane, rotating the point cloud according to the rotation matrix until the normal vector is aligned with the z axis, and translating the fitting plane to an xy plane in a world coordinate system along the z axis; and then, taking the xy plane as a partition plane, calculating the number of the point clouds on two sides of the partition plane, judging whether to turn the point clouds or not according to the number of the point clouds on two sides of the partition plane, and if the number of the point clouds above the partition plane is smaller than that below the partition plane, turning the point clouds along the xy plane, namely finishing the point cloud alignment treatment.

According to the above semantic map construction method based on example segmentation and VSLAM, preferably, the construction method of the rotation matrix is as follows:

(1) fitting the maximum plane in the point cloud by adopting a random consensus algorithm to obtain a fitting plane, wherein the equation of the fitting plane is ax + by + cz + d is 0, and the normal vector of the fitting plane is recorded as p₁＝(x₁，y₁，z₁) The normal vector p₁The unit vector to the Z axis is denoted as p₂＝(x₂，y₂，z₂) Let the origin be denoted as p₃＝(x₃，y₃，z₃) (0, 0, 0) according to p₁、p₂、p₃Three points defining a plane p₁-p₂-p₃Plane p of₁-p₂-p₃Is p₁To p₂Of the plane of rotation, the plane of calculation p₁-p₂-p₃The normal vector n is p₁To p₂A rotating shaft of (a); since n and p₁p₂、p₁p₃Perpendicular, with p₁p₁And p₁p₃Solving a solution vector n by the cross product of the n, wherein a calculation formula of the normal vector n is shown as follows;

a＝(y₂-y₁)*(z₃-z₁)-(y₃-y₁)*(z₂-z₁)

b＝(z₂-z₁)*(x₃-x₁)-(z₃-z₁)*(x₂-x₁)

c＝(x₂-x₁)*(y₃-y₁)-(x₃-x₁)*(y₂-y₁)

wherein i represents a direction vector on the X-axis, j represents a direction vector on the Y-axis, and k represents a direction vector on the Z-axis;

(2) calculating p₁To p₂The rotation angle θ of the rotation shaft of (a), the calculation formula of the rotation angle θ is as follows;

p₁*p₂＝|p₁|*|p₂|*cosθ

(3) and calculating according to a Rodrigues formula to obtain a rotation matrix, wherein the calculation formula of the rotation matrix is as follows:

R＝cosθI+(1-cosθ)nn^T+sinθn^

where R denotes a rotation matrix, θ denotes a rotation angle, I denotes a unit vector, T denotes a transpose of the vector, and the symbol Λ denotes a vector to antisymmetric converter.

According to the semantic mapping method based on example segmentation and VSLAM, preferably, the specific operation of creating the background point cloud in each RGB-D keyframe point cloud in step S6 is as follows: and removing all example point clouds in the point cloud aiming at one RGB-D key frame point cloud to obtain the background point cloud in the RGB-D key frame point cloud.

According to the above semantic mapping method based on example segmentation and VSLAM, preferably, the VSLAM algorithm in step S2 is ORB-SLAM 2.

According to the semantic mapping method based on example segmentation and VSLAM, the example segmentation model in step S3 is preferably a YOLACT model.

According to the semantic mapping method based on example segmentation and VSLAM, in step S4, the specific operation of converting the RGB-D key frame into a point cloud is preferably as follows: traversing each pixel point of the RGB-D key frame aiming at the RGB-D key frame, judging whether the pixel point is effective or not according to the depth measurement value of each pixel point, if the pixel point is effective, converting the pixel point from a pixel coordinate system into three-dimensional points under a world coordinate system, and combining the three-dimensional points of all effective pixel points under the world coordinate system together to obtain the point cloud of the RGB-D key frame.

According to the semantic map construction method based on example segmentation and VSLAM, preferably, the specific operation of converting the pixel point from the pixel coordinate system to the three-dimensional point in the world coordinate system is as follows:

(A) firstly, converting pixel points into three-dimensional points under a camera coordinate system according to a pinhole camera model, and setting the pixel coordinate of the pixel point P as P_uvThe coordinate of the pixel point P in the camera coordinate system is P_cThe formula of the available pinhole camera model is as follows:

wherein f is_x、f_yIs the focal length of the camera, c_x、c_yIs the offset value of the origin of the physical imaging coordinate system and the pixel coordinate system, and K is the internal reference matrix of the camera;

(B) let the coordinate of the pixel point P in the world coordinate system be P_w，P_cIs P_wTransforming the current pose of the camera from the world coordinate system to a result in the camera coordinate system according to the current pose of the camera; the pose of the camera is taken by the camera relative to the worldThe target rotation matrix R and the translation vector t, then,

the formula describes the projection relation from the world coordinate of a pixel point P to the pixel coordinate, wherein R represents the pose of the camera, T is an external parameter, T represents a transformation matrix, T is a homogeneous matrix formed by R and T, K represents an internal reference matrix of the camera, and P represents the position of the camera_wAnd the pixel coordinate of the pixel point P can be obtained by first multiplying T by T and then multiplying K by K. Conversely, the above process is reversed, so that the transformation from the pixel coordinate system to the world coordinate system of the pixel point P can be obtained, and the transformation formula is as follows:

T^-1K^-1ZP_uv＝P_w。

compared with the prior art, the invention has the following positive beneficial effects:

(1) according to the method, semantic information is introduced through an end-to-end instance segmentation model, and instance segmentation can directly segment instances in the image.

(2) The method constructs the instance-level semantic map, is more practical compared with the pixel-level semantic map constructed by the existing method, and can be directly used on high-level tasks.

(3) The invention adopts a distributed architecture, and deploys a module with small calculation amount on the mobile robot, and deploys a module with large calculation amount, such as a deep learning model, on a remote server, so that the universality of the invention is greatly increased, and for example, a program can run on the mobile robot with lack of calculation resources in real time.

(4) According to the method, when an example corresponding to the example point cloud is constructed, an observation dictionary is introduced into the attribute of the example, the observation dictionary is used for recording the class of the example judged by the example segmentation model and the occurrence frequency of the example in all RGB-D key frames, and the observation dictionary represents the confidence degree of whether the example is correctly modeled, so that the observation dictionary is used as auxiliary information when a high-level task (such as a grabbing task) of the robot is executed. Due to the introduction of the observation dictionary, the information of the semantic map is richer.

(5) Due to the precision problem of the example segmentation algorithm, the segmentation results of the same three-dimensional object predicted by different frames in the example segmentation may have larger difference, so that repeated examples cannot be completely fused when the example sets of different RGB-D key frames are fused. According to the method, the non-maximum value inhibition optimization processing is carried out on the example map, the redundant modeling example in the example map can be eliminated, and the construction quality of the example map can be improved; moreover, the purpose and effect of the invention to perform point cloud alignment processing on the example map is to align the constructed map to the world coordinate system.

(6) The invention can reduce the number of point clouds in the map by carrying out voxel filtering processing on the global background point clouds, thereby reducing the size of the occupied space of the map in the memory and reducing the calculation amount required when the map is processed.

Drawings

FIG. 1 is a schematic diagram of a semantic map construction method based on example segmentation and VSLAM in accordance with the present invention;

FIG. 2 is a pseudo code for fusing the sample sets in all RGB-D keyframe point clouds to the initial sample map in embodiment 1 of the present invention;

FIG. 3 is pseudo code for non-maxima suppression of an example map in embodiment 1 of the present invention;

FIG. 4 is a semantic mapping system based on example segmentation and VSLAM of the present invention.

Detailed Description

The present invention is further illustrated by the following specific examples, which are not intended to limit the scope of the invention.

Example 1:

a semantic mapping method based on instance segmentation and VSLAM, as shown in fig. 1, includes the following steps:

s1: and matching the RGB image and the depth map acquired from the RGB-D camera according to the time stamp to obtain RGB-D frames, wherein each RGB-D frame consists of the RGB image and the depth map with the same time stamp.

S2: and performing pose estimation on all RGB-D frames frame by adopting a VSLAM algorithm, finding out key frames in all RGB-D frames according to the pose estimation result, and recording the key frames as RGB-D key frames. Wherein, the VSLAM algorithm is preferably ORB-SLAM 2.

S3: and carrying out example segmentation on the RGB images in all the RGB-D key frames frame by adopting a trained example segmentation model to obtain an example segmentation result of each RGB-D key frame, wherein the example segmentation result comprises the semantic category and the sequence number of each example in the RGB images. Wherein the example segmentation model is preferably a YOLACT model.

S4: and converting the RGB-D key frame processed in the step S3 into a point cloud.

S5: creating an instance set in each RGB-D key frame point cloud, inserting the instance set in the first RGB-D key frame point cloud into a blank initial instance map according to the sequence of time stamps, and then sequentially fusing the instance sets in the remaining RGB-D key frame point clouds into the initial instance map to obtain the instance map (the pseudo code for fusing the instance sets in all the RGB-D key frame point clouds into the initial instance map is shown in FIG. 2).

S6: and creating background point clouds in each RGB-D key frame point cloud, and then fusing the background point clouds in all the RGB-D key frame point clouds to obtain a global background point cloud. Preferably, the specific operation of creating the background point cloud in each RGB-D keyframe point cloud is: and removing all example point clouds in the point cloud aiming at one RGB-D key frame point cloud to obtain the background point cloud in the RGB-D key frame point cloud.

S7: and respectively optimizing the example map and the global background point cloud, and then fusing the optimized example map and the optimized global background point cloud to obtain the semantic map. Preferably, the example map optimization method comprises the following steps: the non-maximum suppression processing is performed on the example map (the pseudo code for non-maximum suppression is shown in fig. 3), and then the point cloud alignment processing is performed to obtain the optimized example map. The optimization method of the global background point cloud comprises the following steps: processing the global background point cloud by adopting a voxel filtering algorithm (the voxel filtering algorithm is an algorithm known in the field), and then performing point cloud alignment processing to obtain the optimized global background point cloud.

The specific operation of converting the RGB-D key frame into a point cloud in step S4 is as follows:

traversing each pixel point of the RGB-D key frame aiming at the RGB-D key frame, judging whether the pixel point is effective or not according to the depth measurement value of each pixel point, if the pixel point is effective, converting the pixel point from a pixel coordinate system into three-dimensional points under a world coordinate system, and combining the three-dimensional points of all effective pixel points under the world coordinate system together to obtain the point cloud of the RGB-D key frame.

The specific operation of converting the pixel point from the pixel coordinate system to the three-dimensional point in the world coordinate system is as follows:

(A) firstly, converting pixel points into three-dimensional points under a camera coordinate system according to a pinhole camera model, and setting the pixel coordinate of the pixel point P as P_uvThe coordinate of the pixel point P in the camera coordinate system is P_cThe formula of the available pinhole camera model is shown in formula I:

(B) let the coordinate of the pixel point P in the world coordinate system be P_w，P_cIs P_wTransforming the current pose of the camera from the world coordinate system to a result in the camera coordinate system according to the current pose of the camera; the pose of the camera is described by the rotation matrix R and translation vector t of the camera with respect to the world coordinates, then,

formula II describes the projection relationship from the world coordinate to the pixel coordinate of the pixel point P, wherein R represents the pose of the camera, and t isThe external parameter T represents a transformation matrix, the transformation matrix T is a homogeneous matrix formed by R and T, K represents an internal parameter matrix of the camera, P_wAnd the pixel coordinate of the pixel point P can be obtained by first multiplying T by T and then multiplying K by K. Conversely, the above process is reversed, so that the conversion of the pixel point P from the pixel coordinate system to the world coordinate system selenium can be obtained, and the conversion formula is shown in formula III:

T^-1K^-1ZP_uv＝P_wformula III.

The specific operation of creating an instance set in each RGB-D keyframe point cloud in step S5 is:

s51: traversing each three-dimensional point in the point cloud aiming at an RGB-D key frame point cloud, and extracting each instance point cloud according to instance semantic categories and instance sequence numbers corresponding to the three-dimensional points. The semantic information of the three-dimensional points is represented by the colors of the three-dimensional points; the second channel value of the color of each three-dimensional point is the semantic category and the third channel value is the instance number.

S52: for an example point cloud, performing clustering segmentation processing on the example point cloud by using an Euclidean distance clustering segmentation algorithm (the algorithm is a known algorithm in the field), and then calculating the length, width, height and center coordinates of the example point cloud after clustering segmentation to obtain a three-dimensional surrounding frame of the example point cloud, namely completing the example construction corresponding to the example point cloud. The representation attributes of the constructed instance are: semantic type, sequence number, geometric information, point cloud and observation dictionary of the instance; the geometric information includes length, width, height, center coordinates of the instance; the key value pair of the observation dictionary is the serial number and the observation times of the example, and the observation times are the times of the example appearing in all RGB-D key frames; the observation dictionary is used for recording the types and the occurrence times of the instances judged by the instance segmentation model, and is an important reference for the fusion and the use of the subsequent instances. The example segmentation model can perform discontinuous segmentation on the same example, so that the obtained example point cloud is divided into a plurality of blocks, and discrete small block-shaped point clouds exist around the object due to example segmentation errors; the discrete and multiple pieces of example point clouds can greatly influence the construction of the example map, so that the method for clustering and partitioning the example point clouds by adopting the Euclidean distance clustering and partitioning algorithm is favorable for obtaining accurate examples.

The concrete operation of sequentially fusing the instance sets in the remaining RGB-D keyframe frame point clouds to the initial instance map in step S5 is as follows:

aiming at any instance A in an RGB-D key frame instance set, finding an instance B with the maximum IoU of the instance A in an initial instance map, judging whether the instance A and the instance B need to be fused, if so, fusing the instance A and the instance B to obtain an instance C, and inserting the instance C into the initial instance map; and judging that the instance A is directly inserted into the initial instance map if the instance A and the instance B do not need to be fused. The semantic category and the sequence number of the instance C are consistent with those of the instances A and B with the maximum observation times; the geometric information of the example C is an extreme value of the geometric information of the examples a and B (where the extreme value may be a maximum value or a minimum value; for example, minX is a minimum value of the X axis of the example in the world coordinate system, then minX of the example C is a smaller value of minX of the example a and minX of the example B, i.e., c.minx ═ MIN (a.minx, b.minx), or maxZ is a maximum value of the Z axis of the example in the world coordinate system, then maxZ of the example C is a larger value of maxZ of the example a and maxZ of the example B, i.e., c.maxz ═ MAX (a.maxz, b.maxz)); the point cloud of the example C is the sum of the point clouds of the example A and the example B.

More preferably, the condition for judging whether the instance a and the instance B need to be fused is as follows: if IoU of example A and example B>0.2 and semantic categories are the same, or IoU for instance A and instance B>0.5, judging that the example A and the example B need to be fused. Wherein IoU is equal to the volume of the intersection of example A and example B divided by the volume of the ice maker part, i.e.

Wherein Vab represents the volume of the intersection of the example A and the example B, Va represents the volume of the example AVolume, Vb, represents the volume of example B.

The specific operation of performing the non-maximum suppression processing on the example map in step S7 is:

for each instance in the instance map, calculating the PIoU values of any two instances, and fusing the two instances if the PIoU is greater than a preset threshold Th. The calculation method of the PIoU comprises the following steps: calculating the volumes Vm and Vn of the instances M and N, calculating the volume Vmn of the intersecting part of the instances M and N, calculating the proportions Pm and Pn of the Vmn relative to the Vm and Vn, selecting the maximum value of Pm and Pn, wherein the maximum value is the PIoU value of the instances A and B, and the calculation formula of the PIoU is shown in a formula IV:

the specific operation of the point cloud alignment processing in step S7 is:

fitting the maximum plane in the point cloud by adopting a random sampling consistency algorithm to obtain a fitted plane; constructing a rotation matrix according to the normal vector of the fitting plane, rotating the point cloud according to the rotation matrix until the normal vector is aligned with the z axis, and translating the fitting plane to an xy plane in a world coordinate system along the z axis; and then, taking the xy plane as a partition plane, calculating the number of the point clouds on two sides of the partition plane, judging whether to turn the point clouds or not according to the number of the point clouds on two sides of the partition plane, and if the number of the point clouds above the partition plane is smaller than that below the partition plane, turning the point clouds along the xy plane, namely finishing the point cloud alignment treatment.

Preferably, the method for constructing the rotation matrix comprises the following steps:

(1) fitting the maximum plane in the point cloud by adopting a random consensus algorithm to obtain a fitting plane, wherein the equation of the fitting plane is ax + by + cz + d is 0, and the normal vector of the fitting plane is recorded as p₁＝(x₁，y₁，z₁) The normal vector p₁The unit vector to the Z axis is denoted as p₂＝(x₂，y₂，z₂) Let the origin be denoted as p₃＝(x₃，y₃，z₃) (0, 0, 0) according to p₁、p₂、p₃Three points defining a plane p₁-p₂-p₃Plane p of₁-p₂-p₃Is p₁To p₂Of the plane of rotation, the plane of calculation p₁-p₂-p₃The normal vector n is p₁To p₂A rotating shaft of (a); since n and p₁p₂、p₁p₃Perpendicular, with p₁p₁And p₁p₃Solving a solution vector n by the cross product of the N, wherein a calculation formula of a normal vector n is shown as a formula V;

a＝(y₂-y₁)*(z₃-z₁)-(y₃-y₁)*(z₂-z₁)

b＝(z₂-z₁)*(x₃-x₁)-(z₃-z₁)*(x₂-x₁)

c＝(x₂-x₁)*(y₃-y₁)-(x₃-x₁)*(y₂-y₁) Formula V

(2) calculating p₁To p₂The calculation formula of the rotation angle theta is shown in formula VI;

p₁*p₂＝|p₁|*|p₂|*cosθ

(3) and calculating according to a Rodrigues formula to obtain a rotation matrix, wherein the calculation formula of the rotation matrix is shown as a formula VII:

R＝cosθI+(1-cosθ)nn^T+ sin theta n ^ formula VII

Example 2:

the embodiment provides a system for implementing the semantic mapping method based on instance segmentation and VSLAM described in embodiment 1, and as shown in fig. 4, the system adopts ROS as an implementation framework. The system is realized by a plurality of nodes, including a data source node, a SLAM node, an instance segmentation node, a graph building node and an interaction node. The data source node is used for reading and acquiring an RGB image and a depth map from the RGB-D camera or the data set, and matching the acquired RGB image and the depth map according to the time stamp to obtain an RGB-D frame; the SLAM node is used for carrying out pose estimation on all RGB-D frames frame by frame and finding out key frames in all the RGB-D frames according to a pose estimation result; the example segmentation node is used for carrying out example segmentation on the RGB images in all the RGB-D key frames frame by frame to obtain an example segmentation result of each RGB-D key frame; the map building node is used for extracting the examples and the background point clouds in all RGB-D key frames according to the example segmentation result and the estimated pose to obtain an example map and a global background point cloud, and then fusing the example map and the optimized global background point cloud to obtain a semantic map; the interactive node is responsible for the interaction of the user with the system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the present invention, but rather as the following description is intended to cover all modifications, equivalents and improvements falling within the spirit and scope of the present invention.

Claims

1. A semantic map construction method based on instance segmentation and VSLAM is characterized by comprising the following steps:

s2: performing pose estimation on all RGB-D frames frame by adopting a VSLAM algorithm, selecting a key frame in the RGB-D frames according to a pose estimation result, and recording the key frame as an RGB-D key frame;

s4: converting the RGB-D key frame processed in the step S3 into a point cloud;

s6: creating a background point cloud in each RGB-D key frame point cloud, and then fusing the background point clouds in all the RGB-D key frame point clouds to obtain a global background point cloud;

2. The method for semantic mapping according to claim 1, wherein the specific operations of creating the object instance set in each RGB-D keyframe frame point cloud in step S5 are as follows:

3. The method for semantic mapping according to example segmentation and VSLAM as claimed in claim 2, wherein the representation attributes of the example in step S52 are: semantic type, sequence number, geometric information, point cloud and observation dictionary of the instance; the geometric information includes length, width, height, center coordinates of the instance; and the key value pair of the observation dictionary is the serial number of the example and the observation times, and the observation times are the times of the example appearing in all RGB-D key frames.

4. The semantic map construction method based on instance segmentation and VSLAM as claimed in claim 3, wherein the specific operation of sequentially fusing the instance sets in the remaining RGB-D keyframe frame point clouds to the initial instance map in step S5 is as follows:

aiming at any instance A in an RGB-D key frame instance set, finding an instance B with the maximum IoU of the instance A in an initial instance map, and judging whether the instance A and the instance B need to be fused or not; judging that if the instance A and the instance B need to be fused, fusing the instance A and the instance B to obtain an instance C, and inserting the instance C into the initial instance map; judging that the instance A is directly inserted into the initial instance map if the instance A and the instance B do not need to be fused; the condition for judging whether the instance A and the instance B need to be fused is as follows: if IoU >0.2 and the semantic category is the same for instance A and instance B, or IoU >0.5 for instance A and instance B, then it is determined that instance A and instance B need to be fused.

5. The method for semantic mapping according to claim 4, wherein the semantic category and the sequence number of the instance C are consistent with those of the instances A and B with the largest observation times; the geometric information of the example C is an extremum of the geometric information of the example A and the example B; the point cloud of the example C is the sum of the point clouds of the example A and the example B.

6. The semantic mapping method according to claim 1, wherein the example map is optimized in step S7 by: firstly, carrying out non-maximum value inhibition processing on an example map, and then carrying out point cloud alignment processing to obtain an optimized example map; the method for optimizing the global background point cloud comprises the following steps: and processing the global background point cloud by adopting a voxel filtering algorithm, and then performing point cloud alignment processing to obtain the optimized global background point cloud.

7. The method of claim 6, wherein the non-maxima suppression process comprises the following specific steps: calculating the PIoU values of any two instances for each instance in the instance map, and fusing the two instances if the PIoU value is greater than a preset threshold Th; the calculation method of the PIoU comprises the following steps: and (3) calculating the volumes Vm and Vn of the instances M and N, calculating the volume Vmn of the intersection part of the instances M and N, calculating the proportions Pm and Pn of the Vmn relative to the Vm and Vn, and selecting the maximum value of the Pm and Pn, wherein the maximum value is the PIoU value of the instances A and B.

8. The method for semantic mapping based on instance segmentation and VSLAM according to claim 6, wherein the point cloud alignment process is specifically operated as follows: fitting the maximum plane in the point cloud by adopting a random sampling consistency algorithm to obtain a fitted plane; constructing a rotation matrix according to the normal vector of the fitting plane, rotating the point cloud according to the rotation matrix until the normal vector is aligned with the z axis, and translating the fitting plane to an xy plane in a world coordinate system along the z axis; and then, taking the xy plane as a partition plane, calculating the number of the point clouds on two sides of the partition plane, judging whether to turn the point clouds or not according to the number of the point clouds on two sides of the partition plane, and if the number of the point clouds above the partition plane is smaller than that below the partition plane, turning the point clouds along the xy plane, namely finishing the point cloud alignment treatment.

9. The method of claim 8, wherein the rotation matrix is constructed by:

a＝(y₂-y₁)*(z₃-z₁)-(y₃-y₁)*(z₂-z₁)

b＝(z₂-z₁)*(x₃-x₁)-(z₃-z₁)*(x₂-x₁)

c＝(x₂-x₁)*(y₃-y₁)-(x₃-x₁)*(y₂-y₁)

p₁*p₂＝|p₁|*|p₂|*cosθ

R＝cosθI+(1-cosθ)nn^T+sinθn^

10. The semantic mapping method based on example segmentation and VSLAM according to claim 1, wherein the specific operation of creating the background point cloud in each RGB-D keyframe frame point cloud in step S6 is: and removing all example point clouds in the frame point cloud aiming at one RGB-D key frame point cloud to obtain the background point cloud in the RGB-D key frame point cloud.