CN108491880B - Object classification and pose estimation method based on neural network - Google Patents
Object classification and pose estimation method based on neural network Download PDFInfo
- Publication number
- CN108491880B CN108491880B CN201810243399.4A CN201810243399A CN108491880B CN 108491880 B CN108491880 B CN 108491880B CN 201810243399 A CN201810243399 A CN 201810243399A CN 108491880 B CN108491880 B CN 108491880B
- Authority
- CN
- China
- Prior art keywords
- layer
- pixels
- size
- neural network
- inputting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/75—Determining position or orientation of objects or cameras using feature-based methods involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
- G06T2207/10012—Stereo images
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an object classification and pose estimation method based on a neural network, which mainly solves the problem of low precision when a convolutional neural network is used for object detection and pose estimation in the prior art. The implementation scheme is as follows: 1) obtaining a multi-view image of each CAD model in the data set; 2) constructing a joint detection mathematical model according to the multi-view image of the CAD model; 3) constructing a convolutional neural network and training the convolutional neural network by utilizing a multi-view image of a CAD model; (4) and inputting the multi-view images of each CAD model in the test set into a neural network, and outputting the class labels and the pose labels predicted by the neural network. The invention combines the neural network shallow characteristic map and the deep characteristic map, so that the combined characteristic map not only retains rich pose information, but also retains good classification information, and improves the accuracy of classification and pose estimation. The robot can be used for intelligent mechanical arms and robots to grab.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and relates to an object classification and pose estimation method which can be used for intelligent mechanical arms and robot grabbing.
Background
The convolutional neural network CNN is a feedforward neural network, which is composed of convolutional layers, fully-connected layers, pooling layers, and activation layers. Compared with the traditional fully-connected neural network, the convolutional neural network has the advantages that the weights of the neurons on the same characteristic mapping surface are the same by applying the local connection and weight sharing technology, the number of parameters of the network is greatly reduced, and the complexity of the network is reduced. The activation function also evolves gradually from sigmoid to unilaterally suppressed ReLU. The continual improvement of activation functions allows neurons to more closely approximate the characteristics of biological neuron activation. In addition, CNN avoids complex pre-processing of images, including complex feature extraction and data reconstruction, and can directly input the original images. The application of the gradient descent and the chain type derivation rule enables the neural network to well carry out mutual iteration of forward propagation and backward propagation, and the detection precision is continuously improved. In a plurality of deep learning frames, caffe is a common one, and is applied to video and image processing. Caffe modularization, representation and separation, convenient switching between gpu and cpu and provided Python and Matlab interfaces enable the Caffe to be used for conveniently adjusting the network structure and training the network.
In recent years, deep learning has made remarkable progress in image classification, object detection, semantic segmentation, instance segmentation, and the like. General vision systems need to solve two problems: object classification and object pose estimation, the so-called pose estimation, refers to the pose of an object relative to a camera. Object pose estimation is crucial in many applications, such as robotic grasping, etc. However, object classification and pose estimation are contradictory, and the classification system needs to correctly classify objects no matter the objects are in any postures. The classification system learns features that are not related to the viewpoint. For the estimation of the object pose, the system needs to learn the characteristics of keeping the object geometry and vision to distinguish the pose. For convolutional neural networks, shallow feature maps tend to be more generic, class-uncertain features, but contain more features between different poses. The deep characteristic map is more abstract, the category characteristics are more obvious, but the information of the specific pose is not obvious due to high abstraction. The existing detection method generally selects the characteristics of a middle layer, and the characteristics of the layer are good in classification and pose estimation performance, so that the method is a compromise method and cannot enable the accuracy of object detection and pose estimation to be optimal at the same time.
In 2015, the method MVCNN proposed by Hang Su et al converts sample 3D data into 2D multi-view pictures, performs data dimension reduction on the premise of ensuring detection accuracy, simplifies the processing process, extracts features of pictures of all views of an object, and combines information of pictures of all views. In an actual scene, due to the fact that the target object has the phenomena of blocking, truncation and the like, difficulty is brought to the collection of multi-view images of the object from all predefined viewpoints, and the requirements in the actual scene are not met.
Disclosure of Invention
The invention aims to provide an object classification and pose estimation method based on a neural network aiming at the defects of the prior art, so as to improve the precision of object detection and pose estimation, accelerate the detection speed and meet the requirements of actual scenes.
The technical idea of the invention is as follows: the shallow feature and the deep feature in the convolutional neural network are combined to improve the accuracy of object detection and pose estimation; the detection speed is accelerated by the iteration of the image of the part of the visual angle of the detected object. The implementation scheme comprises the following steps:
(1) obtaining a training set and a testing set, and setting images corresponding to the CAD model:
3429 CAD models are taken out from a ModelNet10 data set to be used as a training set, and 1469 CAD models are taken out to be used as a test set;
for the CAD model of each sample in the model net10 dataset, two strategies were performed in sequence: the first method is that 12 predefined viewpoints are uniformly arranged on a view angle circle on which a CAD model is located, and an image corresponding to the CAD model is collected at each of the 12 predefined viewpoints; the second method is that the CAD model is placed in the center of the regular dodecahedron, 20 vertexes of the regular dodecahedron are set as predefined viewpoints, and images corresponding to the CAD model are collected at each of the 20 predefined viewpoints;
(2) constructing a mathematical model of joint detection according to a multi-view image obtained by preprocessing each CAD model in the data set:
(2a) taking the pose label of each CAD model as a hidden variable and marking as { vi};
(2b) M images of CAD model with different visual anglesAnd a class label y of the CAD model belongs to { 1.,. N }, and is defined as a training sample, wherein N is the total class number of the CAD model, and each view angle image x isiRespectively corresponding to a view label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
wherein R is a neural network weight parameter,a class label that is predicted for the neural network,
is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
(3) constructing and training a convolutional neural network CNN:
(3a) on the basis of the existing AlexNet network, an Eltwise1 layer, an fc _ a1 layer, an fc _ a2 layer and an Eltwise2 layer are added to obtain a neural network CNN containing 16 layers of convolution, wherein:
the Eltwise1 layer is used for fusing the corresponding positions of the feature maps of the Conv3 layer and the Conv4 layer in the AlexNet network;
the fc _ a1 layer is used for mapping the Eltwise1 layer feature map into feature vectors;
the fc _ a2 maps Pool5 layer features in the AlexNet network into feature vectors;
the Eltwise2 layer is used for fusing the corresponding positions of the feature maps of the fc _ a1 layer, the fc _ a2 layer and the Eltwise1 layer;
(3b) multi-view images of each CAD model in the training setInputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J (theta) of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;
(4) test network
The ModelNet10 test is concentrated into the multi-view image of each CAD modelInput deviceAnd counting the precision of object classification and attitude estimation in the trained neural network.
Compared with the prior art, the invention has the following advantages:
1. according to the invention, because elements of relative positions of feature maps at different depths in the convolutional neural network are fused, a new feature map obtained by fusion contains rich pose information in a shallow feature map and abstract and definite classification information in a deep feature map, the detection precision is improved.
2. According to the method, the corresponding multi-view image is generated for each 3D CAD model in the data set, namely 3D sample data is converted into a 2D multi-view image, and dimension reduction processing is performed on the data, so that the complexity of the data is reduced, the calculated amount of feature extraction is reduced, and the detection speed is accelerated.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a diagram illustrating two predefined viewpoint strategies in the present invention;
fig. 3 is a structural diagram of a convolutional neural network CNN constructed in the present invention.
Detailed Description
The following describes examples and effects of the present invention in further detail with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1, obtaining a multi-view image of the CAD model.
The CAD model for each sample in the ModelNet10 dataset was pre-processed in turn for both strategies.
As shown in fig. 2(a), in the first preprocessing strategy, 12 predefined viewpoints are uniformly set on a view angle circle on which a CAD model is located, that is, a shaft is fixed as a rotating shaft, and then observation points are set on the view angle circle on which an object is located at intervals of 30 degrees, so that 12 images of different view angles corresponding to each CAD model can be obtained on the view angle circle of 360 degrees;
as shown in fig. 2(b), the second preprocessing strategy is to place the CAD model in the center of the regular dodecahedron, set 20 vertices of the regular dodecahedron to predefined viewpoints, and acquire images corresponding to the CAD model at each of the 20 predefined viewpoints.
And 2, constructing a joint detection mathematical model according to the multi-view images obtained by preprocessing each CAD model in the data set.
(2a) Taking the pose label of each CAD model as a hidden variable and marking as { vi};
(2b) M images of CAD model with different visual anglesAnd defining a class label y of the CAD model as { 1., N } as a training sample, wherein N is the total class number of the CAD model, and xiFor view angle images, each view angle image xiRespectively corresponding to a visual angle label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
wherein R is a neural network weight parameter,a class label that is predicted for the neural network,is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
wherein (i) represents the input image xiAnd k denotes an image xiJ represents the image xiIs observed from the jth predefined viewpoint.
And 3, constructing a convolutional neural network CNN.
(3a) Constructing a convolutional neural network CNN containing 16 layers as shown in FIG. 3, wherein the 16 layers are a first convolutional layer Conv1, a first pooling layer Pool1, a second convolutional layer Conv2, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4, a first feature fusion layer Eltwise1, a fifth convolutional layer Conv5, a fifth pooling layer Pool5, a first fully-connected layer fc _ a1, a second fully-connected layer fc _ a2, a third fully-connected layer fc6, a fourth fully-connected layer fc7, a second feature fusion layer Eltwise2, a fifth fully-connected layer fc8 and a classification layer Softmax in sequence, and feature extraction details of each layer are as follows:
(3a1) inputting the image with the size of 227 × 227 pixels into a first convolution layer Conv1, performing convolution operation on the image with the convolution kernel size of 11 × 11 pixels and the step size of 4 pixels, and obtaining 96 characteristic graphs with the size of 55 × 55 pixels by using 96 convolution kernels in total;
(3a2) inputting 96 characteristic graphs output by the first convolution layer Conv1 into a first pooling layer Pool1, and performing maximum pooling operation on the characteristic graphs, wherein the size of a pooling block is 3 x 3 pixels, the step size is 2 pixels, and 96 characteristic graphs with the size of 27 x 27 pixels are obtained;
(3a3) inputting 96 characteristic maps output by the first pooling layer Pool1 into a second convolution layer Conv2, performing convolution operation with convolution kernel size of 5 pixels by 5 pixels and step size of 1, and obtaining 256 characteristic maps with the size of 27 pixels by using 256 convolution kernels in total;
(3a4) inputting 256 characteristic maps output by the second convolutional layer Conv2 into a second pooling layer Pool2, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 13 × 13 pixels are obtained;
(3a5) inputting 256 characteristic maps output by the second pooling layer Pool2 into a third convolution layer Conv3, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic maps with a size of 13 × 13 pixels;
(3a6) inputting 384 characteristic graphs output by the third convolutional layer Conv3 into the fourth convolutional layer Con4, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic graphs with a size of 13 × 13 pixels;
(3a7) inputting the feature maps of the third convolution layer Conv3 and the fourth convolution layer Conv4 into the first Eltwise1 layer for feature map fusion to obtain 384 feature maps with 13 × 13 pixels;
(3a8) inputting 384 characteristic graphs output by the fourth convolutional layer Conv4 into the fifth convolutional layer Conv5, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely using 256 convolution kernels to obtain 256 characteristic graphs with a size of 13 × 13 pixels;
(3a9) inputting 256 characteristic maps output by the fifth convolutional layer Conv5 into a fifth pooling layer Pool5, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 6 × 6 pixels are obtained;
(3a10) inputting 384 characteristic maps output by the first Eltwise1 layer into the first fully-connected layer fc _ a1, and mapping the characteristic maps into characteristic vectors with the size of 1 x 4096 pixels;
(3a11) inputting 256 feature maps output by the fifth pooling layer Pool5 to the second fully-connected layer fc _ a2, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a12) inputting 256 feature maps output by the fifth pooling layer Pool5 to the third fully-connected layer fc6, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a13) inputting the feature vector with the size of 1 x 4096 pixels output by the third fully-connected layer fc6 into the fourth fully-connected layer fc7 for continuous feature extraction to obtain the feature vector with the size of 1 x 4096 pixels;
(3a14) inputting the feature vectors of the first full connection layer fc _ a1, the second full connection layer fc _ a2 and the fourth full connection layer fc7 into the second Eltwise2 layer, and fusing the feature vectors to obtain a feature vector with the size of 1 x 4096 pixels;
(3a15) inputting the feature map of 1 x 4096 pixels size output by the second Eltwise2 layer to the fifth fully-connected layer fc8, and mapping the feature vector to a feature vector of 1 x 11 x M pixels size, wherein M is the number of multi-view images, and the symbol "x" represents multiplication;
(3a16) inputting the feature vector with the size of 1 × 11 × M) element into the classification layer Softmax to obtain an image xiIs selected such that the view angle label v that maximizes the class probabilityiAs its pose tag;
and 4, training the convolutional neural network CNN.
(3b1) In the forward propagation stage, a training sample is taken from the training set, and the multi-view image of the training sample is takenInputting the data to an input layer of a Convolutional Neural Network (CNN), and outputting a final result by a Softmax layer after feature extraction and feature mapping;
(3b2) in a back propagation stage, calculating the difference between the actual output of the CNN and the ideal output of the training sample, and adjusting the weight parameter R of the CNN by back propagation according to a method of minimizing errors;
(3b3) and repeating the operations of (3b1) and (3b2) until the loss function J (theta) of the convolutional neural network CNN is less than or equal to 0.0001, and obtaining the trained neural network.
And 5, testing the network.
The ModelNet10 test is concentrated into the multi-view image of each CAD modelInputting the input into a trained neural network, and outputting a class label and a pose label predicted by the neural network;
and respectively counting the percentage of the number of the CAD models with wrong category labels and pose labels in the test set to the number of all the CAD models in the test set to obtain the object classification and posture estimation accuracy.
The effect of the present invention is further described below with reference to simulation:
1. simulation conditions
The computer operating system used in the simulation experiment is an Ubuntu system with 64, the CPU is Intel Core i34.2GHz, the memory is 16.00GB, the GPU is GeForce GTX 1070, and the used deep learning frame is Caffe 2.
2. Contents and results of the experiments
In the experiment, the ModelNet10 data set is used for training and testing the network. 4898 CAD models of 10 categories are contained in a ModelNet10 data set, wherein the number of the CAD models in a training set is 3429, the number of the CAD models in a testing set is 1469, and a multi-view image of each CAD model in the data set is generated;
and inputting the multi-view images of the samples in the test set into a trained convolutional network, wherein the number of CAD models with wrong category labels predicted by the neural network is 77, and the number of CAD models with wrong pose labels predicted by the neural network is 609. The classification and attitude estimation accuracy of the network is obtained by statistics and compared with several existing detection methods, as shown in the following table:
TABLE 1
Method | Accuracy of classification | Attitude estimation accuracy |
The invention | 94.76 | 58.52 |
Rotationnet | 94.38 | 58.33 |
MVCNN | 92.10 | - |
FusionNet | 90.80 | - |
Wherein, rotationNet is a rotation iteration algorithm,
MVCNN is a multi-view merging algorithm,
fusion net is a feature fusion algorithm, which is several advanced object identification and pose estimation methods in the prior art.
As can be seen from table 1, the method for fusing feature maps of different depth layers of a network, which is provided by the present invention, can improve the accuracy of classification and attitude estimation.
Claims (5)
1. A method for object classification and pose estimation based on a neural network comprises the following steps:
(1) obtaining a training set and a testing set, and setting images corresponding to the CAD model:
3429 CAD models are taken out from a ModelNet10 data set to be used as a training set, and 1469 CAD models are taken out to be used as a test set;
for the CAD model of each sample in the model net10 dataset, two strategies were performed in sequence: the first method is that 12 predefined viewpoints are uniformly arranged on a view angle circle on which a CAD model is located, and an image corresponding to the CAD model is collected at each of the 12 predefined viewpoints; the second method is that the CAD model is placed in the center of the regular dodecahedron, 20 vertexes of the regular dodecahedron are set as predefined viewpoints, and images corresponding to the CAD model are collected at each of the 20 predefined viewpoints;
(2) constructing a mathematical model of joint detection according to a multi-view image obtained by preprocessing each CAD model in the data set:
(2a) will each beThe view label of the CAD model is used as a hidden variable and is marked as { vi};
(2b) M images of CAD model with different visual anglesAnd a class label y of the CAD model belongs to { 1.,. N }, and is defined as a training sample, wherein N is the total class number of the CAD model, and each view angle image x isiRespectively corresponding to a view label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
wherein R is a neural network weight parameter,a class label that is predicted for the neural network,
is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
(3) constructing and training a convolutional neural network CNN:
(3a) on the basis of the existing AlexNet network, an Eltwise1 layer, an fc _ a1 layer, an fc _ a2 layer and an Eltwise2 layer are added to obtain a neural network CNN containing 16 layers of convolution, wherein:
the Eltwise1 layer is used for fusing the corresponding positions of the feature maps of the Conv3 layer and the Conv4 layer in the AlexNet network;
the fc _ a1 layer is used for mapping the Eltwise1 layer feature map into feature vectors;
the fc _ a2 layer maps Pool5 layer features in the AlexNet network into feature vectors;
the Eltwise2 layer is used for fusing the corresponding positions of the feature maps of the fc _ a1 layer, the fc _ a2 layer and the Eltwise1 layer;
(3b) multi-view images of each CAD model in the training setInputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;
(4) test network
2. The method of claim 1, wherein the first preprocessing strategy in step (1) is to uniformly set 12 predefined viewpoints on the view circle of the CAD model by fixing an axis as a rotation axis and setting a viewpoint every 30 degrees on the view circle of the object, i.e. obtaining images of 12 different views corresponding to each CAD model on the view circle of 360 degrees.
3. The method of claim 1, wherein the problem is optimized in step (2c) by:
wherein (i) represents the input image xiAnd k denotes an image xiJ represents the image xiIs observed from the jth predefined viewpoint, R is a neural network weight parameter.
4. The method according to claim 1, wherein the convolutional neural network CNN comprising 16 layers is constructed in step (3a) by the following steps:
(3a1) inputting the image with the size of 227 × 227 pixels into a first convolution layer Conv1, performing convolution operation on the image with the convolution kernel size of 11 × 11 pixels and the step size of 4 pixels, and obtaining 96 characteristic graphs with the size of 55 × 55 pixels by using 96 convolution kernels in total;
(3a2) inputting 96 characteristic graphs output by the first convolution layer Conv1 into a first pooling layer Pool1, and performing maximum pooling operation on the characteristic graphs, wherein the size of a pooling block is 3 x 3 pixels, the step size is 2 pixels, and 96 characteristic graphs with the size of 27 x 27 pixels are obtained;
(3a3) inputting 96 characteristic maps output by the first pooling layer Pool1 into a second convolution layer Conv2, performing convolution operation with convolution kernel size of 5 pixels by 5 pixels and step size of 1, and obtaining 256 characteristic maps with the size of 27 pixels by using 256 convolution kernels in total;
(3a4) inputting 256 characteristic maps output by the second convolutional layer Conv2 into a second pooling layer Pool2, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 13 × 13 pixels are obtained;
(3a5) inputting 256 characteristic maps output by the second pooling layer Pool2 into a third convolution layer Conv3, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic maps with a size of 13 × 13 pixels;
(3a6) inputting 384 characteristic graphs output by the third convolutional layer Conv3 into the fourth convolutional layer Con4, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic graphs with a size of 13 × 13 pixels;
(3a7) inputting the feature maps of the third convolution layer Conv3 and the fourth convolution layer Conv4 into the first Eltwise1 layer for feature map fusion to obtain 384 feature maps with 13 × 13 pixels;
(3a8) inputting 384 characteristic graphs output by the fourth convolutional layer Conv4 into the fifth convolutional layer Conv5, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely using 256 convolution kernels to obtain 256 characteristic graphs with a size of 13 × 13 pixels;
(3a9) inputting 256 characteristic maps output by the fifth convolutional layer Conv5 into a fifth pooling layer Pool5, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 6 × 6 pixels are obtained;
(3a10) inputting 384 characteristic maps output by the first Eltwise1 layer into the first fully-connected layer fc _ a1, and mapping the characteristic maps into characteristic vectors with the size of 1 x 4096 pixels;
(3a11) inputting 256 feature maps output by the fifth pooling layer Pool5 to the second fully-connected layer fc _ a2, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a12) inputting 256 feature maps output by the fifth pooling layer Pool5 to the third fully-connected layer fc6, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a13) inputting the feature vector with the size of 1 x 4096 pixels output by the third fully-connected layer fc6 into the fourth fully-connected layer fc7 for continuous feature extraction to obtain the feature vector with the size of 1 x 4096 pixels;
(3a14) inputting the feature vectors of the first full connection layer fc _ a1, the second full connection layer fc _ a2 and the fourth full connection layer fc7 into the second Eltwise2 layer, and fusing the feature vectors to obtain a feature vector with the size of 1 x 4096 pixels;
(3a15) inputting the feature map of 1 x 4096 pixels size output by the second Eltwise2 layer to the fifth fully-connected layer fc8, and mapping the feature vector to a feature vector of 1 x 11 x M pixels size, wherein M is the number of multi-view images, and the symbol "x" represents multiplication;
(3a16) inputting the feature vector with the pixel size of 1X 11X M) into the classification layer Softmax to obtain an image xiIs selected such that the view angle label v that maximizes the class probabilityiAs its pose tag.
5. The method of claim 1, wherein the Convolutional Neural Network (CNN) is trained in step (3b) as follows:
(3b1) in the forward propagation stage, a training sample is taken from the training set, and the multi-view image of the training sample is takenInputting the data to an input layer of a Convolutional Neural Network (CNN), and outputting a final result by a Softmax layer after feature extraction and feature mapping;
(3b2) in a back propagation stage, calculating the difference between the actual output of the CNN and the ideal output of the training sample, and adjusting the weight parameter R of the CNN by back propagation according to a method of minimizing errors;
(3b3) repeating the operations of (3b1) and (3b2) until the loss function J of the convolutional neural network CNN is less than or equal to 0.0001.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810243399.4A CN108491880B (en) | 2018-03-23 | 2018-03-23 | Object classification and pose estimation method based on neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810243399.4A CN108491880B (en) | 2018-03-23 | 2018-03-23 | Object classification and pose estimation method based on neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108491880A CN108491880A (en) | 2018-09-04 |
CN108491880B true CN108491880B (en) | 2021-09-03 |
Family
ID=63319473
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810243399.4A Active CN108491880B (en) | 2018-03-23 | 2018-03-23 | Object classification and pose estimation method based on neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108491880B (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902675B (en) * | 2018-09-17 | 2021-05-04 | 华为技术有限公司 | Object pose acquisition method and scene reconstruction method and device |
CN109493417B (en) * | 2018-10-31 | 2023-04-07 | 深圳大学 | Three-dimensional object reconstruction method, device, equipment and storage medium |
CN111191492B (en) * | 2018-11-15 | 2024-07-02 | 北京三星通信技术研究有限公司 | Information estimation, model retrieval and model alignment methods and devices |
CN109598339A (en) * | 2018-12-07 | 2019-04-09 | 电子科技大学 | A kind of vehicle attitude detection method based on grid convolutional network |
CN109903332A (en) * | 2019-01-08 | 2019-06-18 | 杭州电子科技大学 | A kind of object's pose estimation method based on deep learning |
CN109934864B (en) * | 2019-03-14 | 2023-01-20 | 东北大学 | Residual error network deep learning method for mechanical arm grabbing pose estimation |
CN109978907A (en) * | 2019-03-22 | 2019-07-05 | 南京邮电大学 | A kind of sitting posture of student detection method towards household scene |
CN111860039B (en) * | 2019-04-26 | 2022-08-02 | 四川大学 | Cross-connection CNN + SVR-based street space quality quantification method |
CN110322510B (en) * | 2019-06-27 | 2021-08-27 | 电子科技大学 | 6D pose estimation method using contour information |
CN112396077B (en) * | 2019-08-15 | 2024-08-02 | 瑞昱半导体股份有限公司 | Full-connection convolutional neural network image processing method and circuit system |
CN110728187B (en) * | 2019-09-09 | 2022-03-04 | 武汉大学 | Remote sensing image scene classification method based on fault tolerance deep learning |
CN110728192B (en) * | 2019-09-16 | 2022-08-19 | 河海大学 | High-resolution remote sensing image classification method based on novel characteristic pyramid depth network |
CN110728222B (en) * | 2019-09-30 | 2022-03-25 | 清华大学深圳国际研究生院 | Pose estimation method for target object in mechanical arm grabbing system |
CN111126441B (en) * | 2019-11-25 | 2023-04-07 | 西安工程大学 | Construction method of classification detection network model |
CN111259735B (en) * | 2020-01-08 | 2023-04-07 | 西安电子科技大学 | Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network |
CN111325166B (en) * | 2020-02-26 | 2023-07-07 | 南京工业大学 | Sitting posture identification method based on projection reconstruction and MIMO neural network |
EP3885970A1 (en) * | 2020-03-23 | 2021-09-29 | Toyota Jidosha Kabushiki Kaisha | System for processing an image having a neural network with at least one static feature map |
CN111738220B (en) * | 2020-07-27 | 2023-09-15 | 腾讯科技(深圳)有限公司 | Three-dimensional human body posture estimation method, device, equipment and medium |
CN112163477B (en) * | 2020-09-16 | 2023-09-22 | 厦门市特种设备检验检测院 | Escalator pedestrian pose target detection method and system based on Faster R-CNN |
CN112381879B (en) * | 2020-11-16 | 2024-09-06 | 跨维(深圳)智能数字科技有限公司 | Object posture estimation method, system and medium based on image and three-dimensional model |
CN112528941B (en) * | 2020-12-23 | 2021-11-19 | 芜湖神图驭器智能科技有限公司 | Automatic parameter setting system based on neural network |
CN112634367A (en) * | 2020-12-25 | 2021-04-09 | 天津大学 | Anti-occlusion object pose estimation method based on deep neural network |
CN112857215B (en) * | 2021-01-08 | 2022-02-08 | 河北工业大学 | Monocular 6D pose estimation method based on regular icosahedron |
CN113129370B (en) * | 2021-03-04 | 2022-08-19 | 同济大学 | Semi-supervised object pose estimation method combining generated data and label-free data |
CN113705480B (en) * | 2021-08-31 | 2024-08-02 | 新东方教育科技集团有限公司 | Gesture recognition method, device and medium based on gesture recognition neural network |
CN114742212A (en) * | 2022-06-13 | 2022-07-12 | 南昌大学 | Electronic digital information resampling rate estimation method |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102375831B (en) * | 2010-08-13 | 2014-09-10 | 富士通株式会社 | Three-dimensional model search device and method thereof and model base generation device and method thereof |
US20160327653A1 (en) * | 2014-02-03 | 2016-11-10 | Board Of Regents, The University Of Texas System | System and method for fusion of camera and global navigation satellite system (gnss) carrier-phase measurements for globally-referenced mobile device pose determination |
WO2017015390A1 (en) * | 2015-07-20 | 2017-01-26 | University Of Maryland, College Park | Deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition |
CN106372648B (en) * | 2016-10-20 | 2020-03-13 | 中国海洋大学 | Plankton image classification method based on multi-feature fusion convolutional neural network |
CN106845510B (en) * | 2016-11-07 | 2020-04-07 | 中国传媒大学 | Chinese traditional visual culture symbol recognition method based on depth level feature fusion |
CN106845515B (en) * | 2016-12-06 | 2020-07-28 | 上海交通大学 | Robot target identification and pose reconstruction method based on virtual sample deep learning |
CN107169421B (en) * | 2017-04-20 | 2020-04-28 | 华南理工大学 | Automobile driving scene target detection method based on deep convolutional neural network |
CN107330463B (en) * | 2017-06-29 | 2020-12-08 | 南京信息工程大学 | Vehicle type identification method based on CNN multi-feature union and multi-kernel sparse representation |
CN107527068B (en) * | 2017-08-07 | 2020-12-25 | 南京信息工程大学 | Vehicle type identification method based on CNN and domain adaptive learning |
CN107657249A (en) * | 2017-10-26 | 2018-02-02 | 珠海习悦信息技术有限公司 | Method, apparatus, storage medium and the processor that Analysis On Multi-scale Features pedestrian identifies again |
CN107808146B (en) * | 2017-11-17 | 2020-05-05 | 北京师范大学 | Multi-mode emotion recognition and classification method |
-
2018
- 2018-03-23 CN CN201810243399.4A patent/CN108491880B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108491880A (en) | 2018-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491880B (en) | Object classification and pose estimation method based on neural network | |
CN110837778B (en) | Traffic police command gesture recognition method based on skeleton joint point sequence | |
Cheng et al. | Jointly network: a network based on CNN and RBM for gesture recognition | |
CN109816725B (en) | Monocular camera object pose estimation method and device based on deep learning | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN110852182B (en) | Depth video human body behavior recognition method based on three-dimensional space time sequence modeling | |
CN108062569B (en) | Unmanned vehicle driving decision method based on infrared and radar | |
CN106951923B (en) | Robot three-dimensional shape recognition method based on multi-view information fusion | |
CN112801015B (en) | Multi-mode face recognition method based on attention mechanism | |
CN110032925B (en) | Gesture image segmentation and recognition method based on improved capsule network and algorithm | |
CN111476806B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN107705322A (en) | Motion estimate tracking and system | |
CN113436227A (en) | Twin network target tracking method based on inverted residual error | |
CN110674741A (en) | Machine vision gesture recognition method based on dual-channel feature fusion | |
Naseer et al. | CNN-based Object Detection via Segmentation capabilities in Outdoor Natural Scenes | |
CN110827304A (en) | Traditional Chinese medicine tongue image positioning method and system based on deep convolutional network and level set method | |
CN114821014A (en) | Multi-mode and counterstudy-based multi-task target detection and identification method and device | |
CN113743544A (en) | Cross-modal neural network construction method, pedestrian retrieval method and system | |
CN113870160B (en) | Point cloud data processing method based on transformer neural network | |
CN110135277B (en) | Human behavior recognition method based on convolutional neural network | |
CN109508686A (en) | A kind of Human bodys' response method based on the study of stratification proper subspace | |
Wu et al. | A cascaded CNN-based method for monocular vision robotic grasping | |
CN114187506B (en) | Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network | |
Lin et al. | Robot grasping based on object shape approximation and LightGBM | |
CN111428555A (en) | Joint-divided hand posture estimation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |