CN109341703B

CN109341703B - Visual SLAM algorithm adopting CNNs characteristic detection in full period

Info

Publication number: CN109341703B
Application number: CN201811087509.9A
Authority: CN
Inventors: 赵永嘉; 张宁; 雷小永; 戴树岭
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2022-07-01
Anticipated expiration: 2038-09-18
Also published as: CN109341703A

Abstract

The invention discloses a visual SLAM algorithm for detecting the characteristics of CNNs in a whole period, which comprises the steps of firstly pre-training original image data by using an unsupervised model at the front end, then associating joint representation of movement and depth with local speed and direction change by using the pre-trained data through a CNN network architecture, and executing a visual odometer; and finally, executing path prediction. The invention also uses an OverFeat neural network model to perform a loop detection link, and is used for eliminating accumulated errors brought by the front end and constructing a visual slam framework based on deep learning. Meanwhile, a time and space continuity filter is constructed, a matching result is verified, the matching accuracy is improved, and mismatching is eliminated. The invention has great advantages and potentials in the aspects of improving the accuracy of the visual odometer and the closed-loop detection accuracy.

Description

Visual SLAM algorithm adopting CNNs characteristic detection in full period

Technical Field

The invention belongs to the technical field of simultaneous localization and map construction (SLAM) algorithms in computer vision, and particularly relates to a visual SLAM algorithm for detecting characteristics of CNNs in a full period.

Background

SLAM (Simultaneous Localization and mapping) is known under the name "Simultaneous Localization and mapping". SLAM is an attractive area of research and has found widespread use in robotics, navigation, and many other applications. Visual SLAM basically involves estimating camera motion from visual sensor information and attempting to construct a map of the surrounding environment, e.g., sequential frames from one or more cameras. The current SLAM problem research means is mainly to estimate the motion information of a robot body and the feature information of an unknown environment by installing a plurality of types of sensors on the robot body, and realize the accurate estimation of the pose of the robot and the space modeling of a scene by utilizing information fusion. Although SLAM uses many types of sensors, including laser and vision, its processing generally includes 3 parts: front-end visual odometry, back-end optimization and closed-loop detection.

The typical visual SLAM algorithm takes the pose of an estimated camera as a main target, and reconstructs a 3D map through a multi-view geometric theory. In order to improve the data processing speed, a partial visual SLAM algorithm firstly extracts sparse image features, and visual odometry and closed-loop detection are realized through matching between feature points, such as a visual SLAM [13] based on SIFT (scale innovative feature transform) features and a visual SLAM based on ORB (oriented FAST and rotated BRIEF) features. The SIFT and ORB features are widely applied in the field of visual SLAM by virtue of better robustness, better distinguishing capability and fast processing speed. The manually designed sparse image features currently have many limitations, on one hand, how to design the sparse image features to optimally represent image information is still an unsolved important problem in the field of computer vision, and on the other hand, the sparse image features still have more challenges in the aspects of dealing with illumination change, dynamic target motion, camera parameter change, lack of textures or environments with single textures and the like. Conventional Visual Odometry (VO) basically involves estimating motion from Visual information, such as sequential frames from one or more cameras. One common feature of most of these attributes is that they rely on keypoint detection and tracking and camera geometry to estimate visual range.

In recent years, learning-based methods have shown promising results in many fields of computer vision, and can overcome the defects existing in the conventional visual slam algorithm (sparse image features have more difficulties in dealing with illumination changes, dynamic target motion, camera parameter changes, and environments lacking textures or single textures). Models like Convolutional Neural Networks (CNNs) have proven to be very effective in various visual tasks, such as classification and localization, depth estimation, and so on. Unsupervised feature learning models demonstrate the ability to learn locally transformed representations in data through multiplicative interactions. Research shows that data after unsupervised model pre-training is applied to the CNNs network, noise can be filtered well, and overfitting can be prevented.

The Visual Odometer (VO) is responsible for estimating the initial values of the trajectory and map. The VO principle considers only the relation of adjacent frame pictures. The inevitable errors will accumulate over time, so that the overall system will have accumulated errors, the long-term estimation results will be unreliable, or alternatively, we cannot construct globally consistent trajectories and maps. The loop detection module can give the constraint that some time intervals are longer outside the adjacent frames. The key of loop detection is how to effectively detect the fact that a camera passes through the same place, which is related to the estimated estimation and the correctness of a map under a long time. Therefore, the improvement of the precision and the robustness of the whole SLAM system by the loop detection is very obvious.

And this appearance-based loop detection is essentially a task of image similarity matching. And verifying whether the two images are in the same place or not by performing feature matching on the two images. The traditional method of loop detection is to use a 'bag of words model' to generate a dictionary for feature matching. CNNs have now shown optimal performance in various classification tasks. Existing landmark test results show that deep features from CNNs at different levels perform better than SIFT at all times in descriptor matching, indicating that SIFT or SURF may no longer be the preferred descriptor for the matching task. Therefore, our invention is inspired by the excellent performance of CNNs in image classification and their proof of feasibility in feature matching. The traditional bag-of-words model method is abandoned, and the loop detection is carried out by using a hierarchical image feature extraction method represented by a deep learning technology based on CNNs. The deep learning algorithm is a mainstream recognition algorithm in the current computer vision field, depends on hierarchical feature representation of a multilayer neural network learning image, and can realize higher accuracy of feature extraction and position recognition compared with the traditional recognition method.

Disclosure of Invention

Aiming at the problems, the invention provides a visual SLAM algorithm adopting CNNs characteristic detection in a full period, which is an SLAM system adopting a convolutional neural network to process both front end (VO) and loop detection of SLAM algorithm operation so as to realize a full period deep learning algorithm.

The invention discloses a visual SLAM algorithm adopting CNNs characteristic detection in a full period, which comprises the following steps:

step 1, scanning surrounding environment information by using a binocular camera; and taking part of the collected video streams as a training data set and taking part of the collected video streams as a testing data set.

Step 2: and (3) pre-training the training data set in the video stream acquired in the step (1) by a synchronous detection method.

And step 3: visual odometry is performed using convolutional neural network training to derive local changes in velocity and local changes in direction.

And 4, step 4: and (3) recovering the motion path of the camera by using the local change of the speed and the local change information of the direction obtained in the step (3).

And 5: and performing closed loop detection by using a convolutional neural network to eliminate accumulated errors of path prediction.

The invention has the advantages that:

1. the visual SLAM algorithm of CNNs characteristic detection is adopted in the whole period, the convolutional neural network is used for the front end, compared with the traditional front end algorithm, the method is based on learning to replace complicated formula calculation, manual characteristic extraction and matching are not needed, the method is concise and intuitive, and the online operation speed is high.

2. The visual SLAM algorithm of CNNs characteristic detection is adopted in the whole period, the traditional method of closed-loop detection by utilizing a bag-of-words model is abandoned, and a better position identification effect is obtained through accurate characteristic matching.

3. The visual SLAM algorithm of CNNs characteristic detection is adopted in the whole period, deep level characteristics in the image can be learned through a neural network, and the recognition rate can reach a higher level. Compared with the traditional visual slam algorithm, the closed-loop detection accuracy can be improved, the image information is more sufficiently expressed, the robustness to environmental changes such as illumination and seasons is stronger, the similarity among 2 frames of images can be calculated, and therefore the more concise visual odometer is realized, and the design of the matching classifier can be synchronously completed while the characteristic design is completed by utilizing the database to pre-train the neural network.

Drawings

FIG. 1 is an overall flow chart of the visual SLAM algorithm for full-period CNNs characteristic detection according to the present invention;

FIG. 2 is a flow chart of the closed loop detection method in the visual SLAM algorithm of the present invention which adopts the CNNs characteristic detection in the whole period.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The visual SLAM algorithm for detecting the CNNs characteristics in the whole period comprises the following steps as shown in figure 1:

step 1, scanning surrounding environment information;

and (3) moving the binocular camera in the square area, acquiring environment image information of the real scene, and transmitting the obtained video stream to an upper computer in real time. The number of moving circles of the binocular camera is 1-2 circles, a closed loop is formed, and compensation of accumulated errors in a subsequent closed loop detection link is facilitated. And repeating the process, wherein part of the collected video streams is used as a training data set, and part of the collected video streams is used as a test data set.

And 2, pre-training the training data set in the video stream acquired in the step 1 by a synchronous detection method.

To obtain a combined representation of camera motion and image depth information, the training data set is pre-trained using random gradient descent training using an unsupervised learning model (SAE-D). The synchronization-based SAE-D is a single-layer model that allows for feature extraction from training data sets through local and Hebbian-type learning.

Training an SAE-D model on local binocular blocks which are cut immediately in a training data set, wherein the size of each binocular block is 16 × 5(space × time), and obtaining characteristic information represented by motion and depth in a combined mode; then the feature information represented by the motion and depth is de-whitened and returned to the image space, and the pre-training of the training data set is completed. The pre-trained training data set is used to initialize the first layer of CNN (convolutional neural network).

And 3, performing visual odometry by using Convolutional Neural Network (CNN) training.

Convolutional Neural Networks (CNNs) are a supervised-based learning model. The CNN is trained to associate local depth and motion representations with local changes in speed and direction to learn to perform visual odometry. Training the acquired joint representation of motion and depth through the CNN architecture correlates with the desired label (direction and velocity change).

Inputting the features obtained by SAE-D model training in step 2 into the first CNN layers with the same architecture for initializing the first CNN layers; two CNNs output local changes in velocity and direction, respectively, and there are two CNN remainders that associate the local changes in velocity and direction, respectively, with the desired label.

The total CNN network has 6 layers, the first layer is a 5 × 5 convolutional layer, and features are learned from left and right images. The second layer then element multiplies the features extracted from the left and right convolutional layers. The third layer 1 x 1, the fourth pooling layer, the fifth fully-connected layer, and the final output layer, which is the Softmax layer.

The inputs to both CNNs are 5-frame subsequences and the target output is a vector representation of local velocity and direction changes. The effect and accuracy of the local speed and direction change information can be evaluated by a binocular data set KITTI.

And 4, step 4: camera motion path prediction

And for the whole video stream, discretely recovering the motion path of the camera by using the speed and direction change information of each 5-frame subsequence obtained in the step 3.

And 5: using CNN to perform closed loop detection, and eliminating accumulated error of path prediction;

the local speed and direction change information obtained in the step 4 cannot be completely accurate, and certain errors exist. With the accumulation of errors, the difference between the predicted path and the real path from the starting point to the end point increases. Therefore, a subsequent closed loop detection link is required to eliminate the accumulated errors, and the difference between the predicted path and the real path is reduced. This algorithm consists of two parts: performing feature extraction by using a convolutional neural network; the location matching hypotheses are spatio-temporally filtered by comparing the signature responses.

In the closed loop detection link, a CNN-based algorithm is also adopted for processing, an applied CNN model is different from the CNN model and is independently applied to a visual odometer and closed loop detection, so that the accumulated error of the camera motion path prediction obtained in the step 4 is eliminated, and autonomous closed loop is realized, and the specific method comprises the following steps:

and extracting image features by using a pre-trained convolutional neural network. The invention adopts overfeat convolution neural network to extract image characteristics.

The overfeat convolutional neural network was pre-trained on the ImageNet 2012 dataset, which consisted of 120 million images and 1000 classes. The overfeat convolutional neural network includes five convolution stages and three fully connected stages. The first two convolution stages consist of a convolution layer, a max-pooling layer and a rectifying (ReLU) nonlinear layer. The third and fourth convolution stages consist of convolutional layers, zero-padding layers, and ReLU nonlinear layers. The fifth stage comprises a convolutional layer, a zero-padding layer, a ReLU layer and a Maxpooling layer. Finally, the sixth and seventh fully connected stages contain one fully connected layer and one ReLU layer, while the eighth stage is an output layer containing only fully connected layers. The whole convolutional neural network has 21 layers.

When an image I is input into the network, it produces a series of hierarchical activations. Use of L in the invention_k(I) K 1, …,21, to represent the corresponding output of the kth layer of a given input image I. Output feature vector L of each layer_k(I) Is a deep-learning representation of the image I; the position recognition is performed by these comparisons of the corresponding feature vectors of the different images. The network is capable of processing any image with a size equal to or larger than 231 × 231 pixels, so the overfeat convolutional neural network input uses an image resized to 256 × 256 pixels.

Therefore, the training data set and the testing data set collected in the step one are used as input, and the pre-trained overfeat convolutional neural network is used for feature extraction.

Step 6: and generating a mixing matrix by feature matching and carrying out space-time continuity detection.

As shown in fig. 2, features extracted from the pictures in each test data set are matched to features proposed from each training data set by an overfeat convolutional neural network.

In order to compare the performance difference of the characteristics of each layer of image in the overfeat convolution neural network on scene recognition, a mixing matrix is further constructed by utilizing the characteristics of each layer:

M_k(i,j)＝d(L_k(I_i),L_k(I_j)),i＝1,…,R,j＝1,…,T

wherein, I_iRepresenting the image input in the training dataset of the ith frame, I_jRepresenting the image input in the test data set of frame j, L_k(I_i) Represents and I_iCorresponding k-th layer output, M_k(i, j) represents the Euclidean distance between the k-th training sample i and the test sample j, namely, the matching degree between the two is described; r and T represent the number of training images and the number of testing images respectively. Each column of the above-mentioned mixing matrix stores the average eigenvector difference between the test image of the jth frame and all the training images.

To find the strongest position matching hypothesis, the element in each column of the mixing matrix with the lowest feature vector difference is searched.

And for possible position matching hypothesis in the mixing matrix, further constructing a spatial continuity filter and a time continuity filter for comprehensive verification, and improving the matching accuracy. Meanwhile, the characteristic performance trained by each layer of network is explored, and the characteristic description of the network middle layer is found to have a good effect on image matching with similar visual angles, and the middle and rear layers have stronger adaptability and robustness to scene visual angle changes.

With accurate position matching, the accumulated error caused by the visual odometer without loop can be compensated, and a globally consistent track is constructed.

Claims

1. A visual SLAM method for full-period CNNs characteristic detection comprises the following steps:

step 1, scanning surrounding environment information by using a binocular camera; using a part of the collected video streams as a training data set, and using a part of the collected video streams as a test data set;

step 2: pre-training a training data set in the video stream acquired in the step 1 by a synchronous detection method; pre-training a training data set by adopting an unsupervised learning model and random gradient descent training to obtain characteristic information jointly represented by motion and depth, and then removing whitening from the characteristic information jointly represented by the motion and the depth and returning the characteristic information to an image space;

and step 3: performing a visual odometer using a convolutional neural network training to obtain local changes in velocity and local changes in direction; inputting the features obtained by the unsupervised learning model training in the step 2 into the first CNN layers with the same architecture for initializing the first CNN layers; two CNNs output local changes in velocity and direction, respectively, and two CNN remainders associate the local changes in velocity and direction, respectively, with a desired label;

the whole CNN network has 6 layers, the first layer is a convolution layer of 5 by 5, and the left and right image learning characteristics are respectively obtained; next, the second layer carries out element multiplication on the features extracted from the left convolution layer and the right convolution layer; a third 1 x 1 convolutional layer, a fourth pooling layer, a fifth fully-connected layer, and a final output layer which is a Softmax layer;

the inputs to both CNNs are 5-frame subsequences and the target output is a vector representation of local velocity and direction changes; the effect and the precision of the local speed and direction change information are evaluated through a binocular data set KITTI;

and 4, step 4: restoring the motion path of the camera by using the local change of the speed and the local change information of the direction obtained in the step 3;

and 5: performing closed loop detection by using a convolutional neural network to eliminate accumulated errors of path prediction;

firstly, using a training data set and a test data set acquired in the step 1 as input, and performing feature extraction by using an overfeat convolutional neural network pre-trained on an imagenet data set; subsequently, matching features extracted from the pictures in each test data set with features proposed from each training data set;

constructing a mixed matrix by using the characteristics of each layer of the overfeat convolution neural network:

M_k(i,j)＝d(L_k(I_i),L_k(I_j)),i＝1,…,R,j＝1,…,T

wherein, I_iRepresenting the image input in the training dataset of the ith frame, I_jRepresenting the image input in the test data set of frame j, L_k(I_i) Represents and I_iCorresponding k-th layer output, M_k(i, j) represents the Euclidean distance between the k-th training sample i and the test sample j, namely, the matching degree between the two is described; r and T respectively represent the number of training images and the number of testing images; each column of the mixing matrix stores the average feature vector difference between the jth frame of test image and all training images;

searching the element with the lowest feature vector difference in each column of the mixing matrix;

2. the visual SLAM method of claim 1, wherein the full-cycle detection of CNNs features is performed by: in the step 1, a binocular camera is used to move along the annular area, the number of moving circles is 1-2, and a closed loop is formed.

3. The visual SLAM method of claim 1, wherein the full-cycle detection of CNNs features is performed by: in step 3, the features obtained by training in step 2 are input into the first CNN layers with the same structure, and the obtained joint representation of the motion and the depth is trained to be associated with the expected label through the CNN structure.