CN108491880B

CN108491880B - Object classification and pose estimation method based on neural network

Info

Publication number: CN108491880B
Application number: CN201810243399.4A
Authority: CN
Inventors: 张向东; 张泽宇
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2021-09-03
Anticipated expiration: 2038-03-23
Also published as: CN108491880A

Abstract

The invention discloses an object classification and pose estimation method based on a neural network, which mainly solves the problem of low precision when a convolutional neural network is used for object detection and pose estimation in the prior art. The implementation scheme is as follows: 1) obtaining a multi-view image of each CAD model in the data set; 2) constructing a joint detection mathematical model according to the multi-view image of the CAD model; 3) constructing a convolutional neural network and training the convolutional neural network by utilizing a multi-view image of a CAD model; (4) and inputting the multi-view images of each CAD model in the test set into a neural network, and outputting the class labels and the pose labels predicted by the neural network. The invention combines the neural network shallow characteristic map and the deep characteristic map, so that the combined characteristic map not only retains rich pose information, but also retains good classification information, and improves the accuracy of classification and pose estimation. The robot can be used for intelligent mechanical arms and robots to grab.

Description

Object classification and pose estimation method based on neural network

Technical Field

The invention belongs to the field of artificial intelligence, and relates to an object classification and pose estimation method which can be used for intelligent mechanical arms and robot grabbing.

Background

The convolutional neural network CNN is a feedforward neural network, which is composed of convolutional layers, fully-connected layers, pooling layers, and activation layers. Compared with the traditional fully-connected neural network, the convolutional neural network has the advantages that the weights of the neurons on the same characteristic mapping surface are the same by applying the local connection and weight sharing technology, the number of parameters of the network is greatly reduced, and the complexity of the network is reduced. The activation function also evolves gradually from sigmoid to unilaterally suppressed ReLU. The continual improvement of activation functions allows neurons to more closely approximate the characteristics of biological neuron activation. In addition, CNN avoids complex pre-processing of images, including complex feature extraction and data reconstruction, and can directly input the original images. The application of the gradient descent and the chain type derivation rule enables the neural network to well carry out mutual iteration of forward propagation and backward propagation, and the detection precision is continuously improved. In a plurality of deep learning frames, caffe is a common one, and is applied to video and image processing. Caffe modularization, representation and separation, convenient switching between gpu and cpu and provided Python and Matlab interfaces enable the Caffe to be used for conveniently adjusting the network structure and training the network.

In recent years, deep learning has made remarkable progress in image classification, object detection, semantic segmentation, instance segmentation, and the like. General vision systems need to solve two problems: object classification and object pose estimation, the so-called pose estimation, refers to the pose of an object relative to a camera. Object pose estimation is crucial in many applications, such as robotic grasping, etc. However, object classification and pose estimation are contradictory, and the classification system needs to correctly classify objects no matter the objects are in any postures. The classification system learns features that are not related to the viewpoint. For the estimation of the object pose, the system needs to learn the characteristics of keeping the object geometry and vision to distinguish the pose. For convolutional neural networks, shallow feature maps tend to be more generic, class-uncertain features, but contain more features between different poses. The deep characteristic map is more abstract, the category characteristics are more obvious, but the information of the specific pose is not obvious due to high abstraction. The existing detection method generally selects the characteristics of a middle layer, and the characteristics of the layer are good in classification and pose estimation performance, so that the method is a compromise method and cannot enable the accuracy of object detection and pose estimation to be optimal at the same time.

In 2015, the method MVCNN proposed by Hang Su et al converts sample 3D data into 2D multi-view pictures, performs data dimension reduction on the premise of ensuring detection accuracy, simplifies the processing process, extracts features of pictures of all views of an object, and combines information of pictures of all views. In an actual scene, due to the fact that the target object has the phenomena of blocking, truncation and the like, difficulty is brought to the collection of multi-view images of the object from all predefined viewpoints, and the requirements in the actual scene are not met.

Disclosure of Invention

The invention aims to provide an object classification and pose estimation method based on a neural network aiming at the defects of the prior art, so as to improve the precision of object detection and pose estimation, accelerate the detection speed and meet the requirements of actual scenes.

The technical idea of the invention is as follows: the shallow feature and the deep feature in the convolutional neural network are combined to improve the accuracy of object detection and pose estimation; the detection speed is accelerated by the iteration of the image of the part of the visual angle of the detected object. The implementation scheme comprises the following steps:

(1) obtaining a training set and a testing set, and setting images corresponding to the CAD model:

3429 CAD models are taken out from a ModelNet10 data set to be used as a training set, and 1469 CAD models are taken out to be used as a test set;

for the CAD model of each sample in the model net10 dataset, two strategies were performed in sequence: the first method is that 12 predefined viewpoints are uniformly arranged on a view angle circle on which a CAD model is located, and an image corresponding to the CAD model is collected at each of the 12 predefined viewpoints; the second method is that the CAD model is placed in the center of the regular dodecahedron, 20 vertexes of the regular dodecahedron are set as predefined viewpoints, and images corresponding to the CAD model are collected at each of the 20 predefined viewpoints;

(2) constructing a mathematical model of joint detection according to a multi-view image obtained by preprocessing each CAD model in the data set:

(2a) taking the pose label of each CAD model as a hidden variable and marking as { v_i}；

(2b) M images of CAD model with different visual angles

And a class label y of the CAD model belongs to { 1.,. N }, and is defined as a training sample, wherein N is the total class number of the CAD model, and each view angle image x is_iRespectively corresponding to a view label v_i∈{1,..,M}；

(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:

wherein R is a neural network weight parameter,

a class label that is predicted for the neural network,

is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;

(3) constructing and training a convolutional neural network CNN:

(3a) on the basis of the existing AlexNet network, an Eltwise1 layer, an fc _ a1 layer, an fc _ a2 layer and an Eltwise2 layer are added to obtain a neural network CNN containing 16 layers of convolution, wherein:

the Eltwise1 layer is used for fusing the corresponding positions of the feature maps of the Conv3 layer and the Conv4 layer in the AlexNet network;

the fc _ a1 layer is used for mapping the Eltwise1 layer feature map into feature vectors;

the fc _ a2 maps Pool5 layer features in the AlexNet network into feature vectors;

the Eltwise2 layer is used for fusing the corresponding positions of the feature maps of the fc _ a1 layer, the fc _ a2 layer and the Eltwise1 layer;

(3b) multi-view images of each CAD model in the training set

Inputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J (theta) of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;

(4) test network

The ModelNet10 test is concentrated into the multi-view image of each CAD model

Input deviceAnd counting the precision of object classification and attitude estimation in the trained neural network.

Compared with the prior art, the invention has the following advantages:

1. according to the invention, because elements of relative positions of feature maps at different depths in the convolutional neural network are fused, a new feature map obtained by fusion contains rich pose information in a shallow feature map and abstract and definite classification information in a deep feature map, the detection precision is improved.

2. According to the method, the corresponding multi-view image is generated for each 3D CAD model in the data set, namely 3D sample data is converted into a 2D multi-view image, and dimension reduction processing is performed on the data, so that the complexity of the data is reduced, the calculated amount of feature extraction is reduced, and the detection speed is accelerated.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram illustrating two predefined viewpoint strategies in the present invention;

fig. 3 is a structural diagram of a convolutional neural network CNN constructed in the present invention.

Detailed Description

The following describes examples and effects of the present invention in further detail with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, obtaining a multi-view image of the CAD model.

The CAD model for each sample in the ModelNet10 dataset was pre-processed in turn for both strategies.

As shown in fig. 2(a), in the first preprocessing strategy, 12 predefined viewpoints are uniformly set on a view angle circle on which a CAD model is located, that is, a shaft is fixed as a rotating shaft, and then observation points are set on the view angle circle on which an object is located at intervals of 30 degrees, so that 12 images of different view angles corresponding to each CAD model can be obtained on the view angle circle of 360 degrees;

as shown in fig. 2(b), the second preprocessing strategy is to place the CAD model in the center of the regular dodecahedron, set 20 vertices of the regular dodecahedron to predefined viewpoints, and acquire images corresponding to the CAD model at each of the 20 predefined viewpoints.

And 2, constructing a joint detection mathematical model according to the multi-view images obtained by preprocessing each CAD model in the data set.

(2b) M images of CAD model with different visual angles

And defining a class label y of the CAD model as { 1., N } as a training sample, wherein N is the total class number of the CAD model, and x_iFor view angle images, each view angle image x_iRespectively corresponding to a visual angle label v_i∈{1,..,M}；

wherein R is a neural network weight parameter,

a class label that is predicted for the neural network,

will be provided with

Is marked as

The optimization problem is expressed in the form:

wherein (i) represents the input image x_iAnd k denotes an image x_iJ represents the image x_iIs observed from the jth predefined viewpoint.

And 3, constructing a convolutional neural network CNN.

(3a) Constructing a convolutional neural network CNN containing 16 layers as shown in FIG. 3, wherein the 16 layers are a first convolutional layer Conv1, a first pooling layer Pool1, a second convolutional layer Conv2, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4, a first feature fusion layer Eltwise1, a fifth convolutional layer Conv5, a fifth pooling layer Pool5, a first fully-connected layer fc _ a1, a second fully-connected layer fc _ a2, a third fully-connected layer fc6, a fourth fully-connected layer fc7, a second feature fusion layer Eltwise2, a fifth fully-connected layer fc8 and a classification layer Softmax in sequence, and feature extraction details of each layer are as follows:

(3a1) inputting the image with the size of 227 × 227 pixels into a first convolution layer Conv1, performing convolution operation on the image with the convolution kernel size of 11 × 11 pixels and the step size of 4 pixels, and obtaining 96 characteristic graphs with the size of 55 × 55 pixels by using 96 convolution kernels in total;

(3a2) inputting 96 characteristic graphs output by the first convolution layer Conv1 into a first pooling layer Pool1, and performing maximum pooling operation on the characteristic graphs, wherein the size of a pooling block is 3 x 3 pixels, the step size is 2 pixels, and 96 characteristic graphs with the size of 27 x 27 pixels are obtained;

(3a3) inputting 96 characteristic maps output by the first pooling layer Pool1 into a second convolution layer Conv2, performing convolution operation with convolution kernel size of 5 pixels by 5 pixels and step size of 1, and obtaining 256 characteristic maps with the size of 27 pixels by using 256 convolution kernels in total;

(3a4) inputting 256 characteristic maps output by the second convolutional layer Conv2 into a second pooling layer Pool2, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 13 × 13 pixels are obtained;

(3a5) inputting 256 characteristic maps output by the second pooling layer Pool2 into a third convolution layer Conv3, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic maps with a size of 13 × 13 pixels;

(3a6) inputting 384 characteristic graphs output by the third convolutional layer Conv3 into the fourth convolutional layer Con4, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic graphs with a size of 13 × 13 pixels;

(3a7) inputting the feature maps of the third convolution layer Conv3 and the fourth convolution layer Conv4 into the first Eltwise1 layer for feature map fusion to obtain 384 feature maps with 13 × 13 pixels;

(3a8) inputting 384 characteristic graphs output by the fourth convolutional layer Conv4 into the fifth convolutional layer Conv5, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely using 256 convolution kernels to obtain 256 characteristic graphs with a size of 13 × 13 pixels;

(3a9) inputting 256 characteristic maps output by the fifth convolutional layer Conv5 into a fifth pooling layer Pool5, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 6 × 6 pixels are obtained;

(3a10) inputting 384 characteristic maps output by the first Eltwise1 layer into the first fully-connected layer fc _ a1, and mapping the characteristic maps into characteristic vectors with the size of 1 x 4096 pixels;

(3a11) inputting 256 feature maps output by the fifth pooling layer Pool5 to the second fully-connected layer fc _ a2, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;

(3a12) inputting 256 feature maps output by the fifth pooling layer Pool5 to the third fully-connected layer fc6, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;

(3a13) inputting the feature vector with the size of 1 x 4096 pixels output by the third fully-connected layer fc6 into the fourth fully-connected layer fc7 for continuous feature extraction to obtain the feature vector with the size of 1 x 4096 pixels;

(3a14) inputting the feature vectors of the first full connection layer fc _ a1, the second full connection layer fc _ a2 and the fourth full connection layer fc7 into the second Eltwise2 layer, and fusing the feature vectors to obtain a feature vector with the size of 1 x 4096 pixels;

(3a15) inputting the feature map of 1 x 4096 pixels size output by the second Eltwise2 layer to the fifth fully-connected layer fc8, and mapping the feature vector to a feature vector of 1 x 11 x M pixels size, wherein M is the number of multi-view images, and the symbol "x" represents multiplication;

(3a16) inputting the feature vector with the size of 1 × 11 × M) element into the classification layer Softmax to obtain an image x_iIs selected such that the view angle label v that maximizes the class probability_iAs its pose tag;

and 4, training the convolutional neural network CNN.

(3b1) In the forward propagation stage, a training sample is taken from the training set, and the multi-view image of the training sample is taken

Inputting the data to an input layer of a Convolutional Neural Network (CNN), and outputting a final result by a Softmax layer after feature extraction and feature mapping;

(3b2) in a back propagation stage, calculating the difference between the actual output of the CNN and the ideal output of the training sample, and adjusting the weight parameter R of the CNN by back propagation according to a method of minimizing errors;

(3b3) and repeating the operations of (3b1) and (3b2) until the loss function J (theta) of the convolutional neural network CNN is less than or equal to 0.0001, and obtaining the trained neural network.

And 5, testing the network.

The ModelNet10 test is concentrated into the multi-view image of each CAD model

Inputting the input into a trained neural network, and outputting a class label and a pose label predicted by the neural network;

and respectively counting the percentage of the number of the CAD models with wrong category labels and pose labels in the test set to the number of all the CAD models in the test set to obtain the object classification and posture estimation accuracy.

The effect of the present invention is further described below with reference to simulation:

1. simulation conditions

The computer operating system used in the simulation experiment is an Ubuntu system with 64, the CPU is Intel Core i34.2GHz, the memory is 16.00GB, the GPU is GeForce GTX 1070, and the used deep learning frame is Caffe 2.

2. Contents and results of the experiments

In the experiment, the ModelNet10 data set is used for training and testing the network. 4898 CAD models of 10 categories are contained in a ModelNet10 data set, wherein the number of the CAD models in a training set is 3429, the number of the CAD models in a testing set is 1469, and a multi-view image of each CAD model in the data set is generated;

and inputting the multi-view images of the samples in the test set into a trained convolutional network, wherein the number of CAD models with wrong category labels predicted by the neural network is 77, and the number of CAD models with wrong pose labels predicted by the neural network is 609. The classification and attitude estimation accuracy of the network is obtained by statistics and compared with several existing detection methods, as shown in the following table:

TABLE 1

Method	Accuracy of classification	Attitude estimation accuracy
			The invention	94.76	58.52
Rotationnet	94.38	58.33
			MVCNN	92.10	-
FusionNet	90.80	-

Wherein, rotationNet is a rotation iteration algorithm,

MVCNN is a multi-view merging algorithm,

fusion net is a feature fusion algorithm, which is several advanced object identification and pose estimation methods in the prior art.

As can be seen from table 1, the method for fusing feature maps of different depth layers of a network, which is provided by the present invention, can improve the accuracy of classification and attitude estimation.

Claims

1. A method for object classification and pose estimation based on a neural network comprises the following steps:

(2a) will each beThe view label of the CAD model is used as a hidden variable and is marked as { v_i}；

(2b) M images of CAD model with different visual angles

wherein R is a neural network weight parameter,

a class label that is predicted for the neural network,

(3) constructing and training a convolutional neural network CNN:

the fc _ a2 layer maps Pool5 layer features in the AlexNet network into feature vectors;

(3b) multi-view images of each CAD model in the training set

Inputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;

(4) test network

The ModelNet10 test is concentrated into the multi-view image of each CAD model

Inputting the data into a trained neural network, and counting the precision of object classification and attitude estimation.

2. The method of claim 1, wherein the first preprocessing strategy in step (1) is to uniformly set 12 predefined viewpoints on the view circle of the CAD model by fixing an axis as a rotation axis and setting a viewpoint every 30 degrees on the view circle of the object, i.e. obtaining images of 12 different views corresponding to each CAD model on the view circle of 360 degrees.

3. The method of claim 1, wherein the problem is optimized in step (2c) by:

will be provided with

Is marked as

The optimization problem is expressed in the form:

wherein (i) represents the input image x_iAnd k denotes an image x_iJ represents the image x_iIs observed from the jth predefined viewpoint, R is a neural network weight parameter.

4. The method according to claim 1, wherein the convolutional neural network CNN comprising 16 layers is constructed in step (3a) by the following steps:

(3a16) inputting the feature vector with the pixel size of 1X 11X M) into the classification layer Softmax to obtain an image x_iIs selected such that the view angle label v that maximizes the class probability_iAs its pose tag.

5. The method of claim 1, wherein the Convolutional Neural Network (CNN) is trained in step (3b) as follows:

(3b3) repeating the operations of (3b1) and (3b2) until the loss function J of the convolutional neural network CNN is less than or equal to 0.0001.