[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108491880B - Object classification and pose estimation method based on neural network - Google Patents

Object classification and pose estimation method based on neural network Download PDF

Info

Publication number
CN108491880B
CN108491880B CN201810243399.4A CN201810243399A CN108491880B CN 108491880 B CN108491880 B CN 108491880B CN 201810243399 A CN201810243399 A CN 201810243399A CN 108491880 B CN108491880 B CN 108491880B
Authority
CN
China
Prior art keywords
layer
pixels
size
neural network
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810243399.4A
Other languages
Chinese (zh)
Other versions
CN108491880A (en
Inventor
张向东
张泽宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810243399.4A priority Critical patent/CN108491880B/en
Publication of CN108491880A publication Critical patent/CN108491880A/en
Application granted granted Critical
Publication of CN108491880B publication Critical patent/CN108491880B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an object classification and pose estimation method based on a neural network, which mainly solves the problem of low precision when a convolutional neural network is used for object detection and pose estimation in the prior art. The implementation scheme is as follows: 1) obtaining a multi-view image of each CAD model in the data set; 2) constructing a joint detection mathematical model according to the multi-view image of the CAD model; 3) constructing a convolutional neural network and training the convolutional neural network by utilizing a multi-view image of a CAD model; (4) and inputting the multi-view images of each CAD model in the test set into a neural network, and outputting the class labels and the pose labels predicted by the neural network. The invention combines the neural network shallow characteristic map and the deep characteristic map, so that the combined characteristic map not only retains rich pose information, but also retains good classification information, and improves the accuracy of classification and pose estimation. The robot can be used for intelligent mechanical arms and robots to grab.

Description

Object classification and pose estimation method based on neural network
Technical Field
The invention belongs to the field of artificial intelligence, and relates to an object classification and pose estimation method which can be used for intelligent mechanical arms and robot grabbing.
Background
The convolutional neural network CNN is a feedforward neural network, which is composed of convolutional layers, fully-connected layers, pooling layers, and activation layers. Compared with the traditional fully-connected neural network, the convolutional neural network has the advantages that the weights of the neurons on the same characteristic mapping surface are the same by applying the local connection and weight sharing technology, the number of parameters of the network is greatly reduced, and the complexity of the network is reduced. The activation function also evolves gradually from sigmoid to unilaterally suppressed ReLU. The continual improvement of activation functions allows neurons to more closely approximate the characteristics of biological neuron activation. In addition, CNN avoids complex pre-processing of images, including complex feature extraction and data reconstruction, and can directly input the original images. The application of the gradient descent and the chain type derivation rule enables the neural network to well carry out mutual iteration of forward propagation and backward propagation, and the detection precision is continuously improved. In a plurality of deep learning frames, caffe is a common one, and is applied to video and image processing. Caffe modularization, representation and separation, convenient switching between gpu and cpu and provided Python and Matlab interfaces enable the Caffe to be used for conveniently adjusting the network structure and training the network.
In recent years, deep learning has made remarkable progress in image classification, object detection, semantic segmentation, instance segmentation, and the like. General vision systems need to solve two problems: object classification and object pose estimation, the so-called pose estimation, refers to the pose of an object relative to a camera. Object pose estimation is crucial in many applications, such as robotic grasping, etc. However, object classification and pose estimation are contradictory, and the classification system needs to correctly classify objects no matter the objects are in any postures. The classification system learns features that are not related to the viewpoint. For the estimation of the object pose, the system needs to learn the characteristics of keeping the object geometry and vision to distinguish the pose. For convolutional neural networks, shallow feature maps tend to be more generic, class-uncertain features, but contain more features between different poses. The deep characteristic map is more abstract, the category characteristics are more obvious, but the information of the specific pose is not obvious due to high abstraction. The existing detection method generally selects the characteristics of a middle layer, and the characteristics of the layer are good in classification and pose estimation performance, so that the method is a compromise method and cannot enable the accuracy of object detection and pose estimation to be optimal at the same time.
In 2015, the method MVCNN proposed by Hang Su et al converts sample 3D data into 2D multi-view pictures, performs data dimension reduction on the premise of ensuring detection accuracy, simplifies the processing process, extracts features of pictures of all views of an object, and combines information of pictures of all views. In an actual scene, due to the fact that the target object has the phenomena of blocking, truncation and the like, difficulty is brought to the collection of multi-view images of the object from all predefined viewpoints, and the requirements in the actual scene are not met.
Disclosure of Invention
The invention aims to provide an object classification and pose estimation method based on a neural network aiming at the defects of the prior art, so as to improve the precision of object detection and pose estimation, accelerate the detection speed and meet the requirements of actual scenes.
The technical idea of the invention is as follows: the shallow feature and the deep feature in the convolutional neural network are combined to improve the accuracy of object detection and pose estimation; the detection speed is accelerated by the iteration of the image of the part of the visual angle of the detected object. The implementation scheme comprises the following steps:
(1) obtaining a training set and a testing set, and setting images corresponding to the CAD model:
3429 CAD models are taken out from a ModelNet10 data set to be used as a training set, and 1469 CAD models are taken out to be used as a test set;
for the CAD model of each sample in the model net10 dataset, two strategies were performed in sequence: the first method is that 12 predefined viewpoints are uniformly arranged on a view angle circle on which a CAD model is located, and an image corresponding to the CAD model is collected at each of the 12 predefined viewpoints; the second method is that the CAD model is placed in the center of the regular dodecahedron, 20 vertexes of the regular dodecahedron are set as predefined viewpoints, and images corresponding to the CAD model are collected at each of the 20 predefined viewpoints;
(2) constructing a mathematical model of joint detection according to a multi-view image obtained by preprocessing each CAD model in the data set:
(2a) taking the pose label of each CAD model as a hidden variable and marking as { vi};
(2b) M images of CAD model with different visual angles
Figure BDA0001605901570000021
And a class label y of the CAD model belongs to { 1.,. N }, and is defined as a training sample, wherein N is the total class number of the CAD model, and each view angle image x isiRespectively corresponding to a view label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
Figure BDA0001605901570000031
wherein R is a neural network weight parameter,
Figure BDA0001605901570000032
a class label that is predicted for the neural network,
Figure BDA0001605901570000033
is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
(3) constructing and training a convolutional neural network CNN:
(3a) on the basis of the existing AlexNet network, an Eltwise1 layer, an fc _ a1 layer, an fc _ a2 layer and an Eltwise2 layer are added to obtain a neural network CNN containing 16 layers of convolution, wherein:
the Eltwise1 layer is used for fusing the corresponding positions of the feature maps of the Conv3 layer and the Conv4 layer in the AlexNet network;
the fc _ a1 layer is used for mapping the Eltwise1 layer feature map into feature vectors;
the fc _ a2 maps Pool5 layer features in the AlexNet network into feature vectors;
the Eltwise2 layer is used for fusing the corresponding positions of the feature maps of the fc _ a1 layer, the fc _ a2 layer and the Eltwise1 layer;
(3b) multi-view images of each CAD model in the training set
Figure BDA0001605901570000034
Inputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J (theta) of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;
(4) test network
The ModelNet10 test is concentrated into the multi-view image of each CAD model
Figure BDA0001605901570000035
Input deviceAnd counting the precision of object classification and attitude estimation in the trained neural network.
Compared with the prior art, the invention has the following advantages:
1. according to the invention, because elements of relative positions of feature maps at different depths in the convolutional neural network are fused, a new feature map obtained by fusion contains rich pose information in a shallow feature map and abstract and definite classification information in a deep feature map, the detection precision is improved.
2. According to the method, the corresponding multi-view image is generated for each 3D CAD model in the data set, namely 3D sample data is converted into a 2D multi-view image, and dimension reduction processing is performed on the data, so that the complexity of the data is reduced, the calculated amount of feature extraction is reduced, and the detection speed is accelerated.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a diagram illustrating two predefined viewpoint strategies in the present invention;
fig. 3 is a structural diagram of a convolutional neural network CNN constructed in the present invention.
Detailed Description
The following describes examples and effects of the present invention in further detail with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1, obtaining a multi-view image of the CAD model.
The CAD model for each sample in the ModelNet10 dataset was pre-processed in turn for both strategies.
As shown in fig. 2(a), in the first preprocessing strategy, 12 predefined viewpoints are uniformly set on a view angle circle on which a CAD model is located, that is, a shaft is fixed as a rotating shaft, and then observation points are set on the view angle circle on which an object is located at intervals of 30 degrees, so that 12 images of different view angles corresponding to each CAD model can be obtained on the view angle circle of 360 degrees;
as shown in fig. 2(b), the second preprocessing strategy is to place the CAD model in the center of the regular dodecahedron, set 20 vertices of the regular dodecahedron to predefined viewpoints, and acquire images corresponding to the CAD model at each of the 20 predefined viewpoints.
And 2, constructing a joint detection mathematical model according to the multi-view images obtained by preprocessing each CAD model in the data set.
(2a) Taking the pose label of each CAD model as a hidden variable and marking as { vi};
(2b) M images of CAD model with different visual angles
Figure BDA0001605901570000041
And defining a class label y of the CAD model as { 1., N } as a training sample, wherein N is the total class number of the CAD model, and xiFor view angle images, each view angle image xiRespectively corresponding to a visual angle label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
Figure BDA0001605901570000051
wherein R is a neural network weight parameter,
Figure BDA0001605901570000052
a class label that is predicted for the neural network,
Figure BDA0001605901570000053
is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
will be provided with
Figure BDA0001605901570000054
Is marked as
Figure BDA0001605901570000055
The optimization problem is expressed in the form:
Figure BDA0001605901570000056
wherein (i) represents the input image xiAnd k denotes an image xiJ represents the image xiIs observed from the jth predefined viewpoint.
And 3, constructing a convolutional neural network CNN.
(3a) Constructing a convolutional neural network CNN containing 16 layers as shown in FIG. 3, wherein the 16 layers are a first convolutional layer Conv1, a first pooling layer Pool1, a second convolutional layer Conv2, a second pooling layer Pool2, a third convolutional layer Conv3, a fourth convolutional layer Conv4, a first feature fusion layer Eltwise1, a fifth convolutional layer Conv5, a fifth pooling layer Pool5, a first fully-connected layer fc _ a1, a second fully-connected layer fc _ a2, a third fully-connected layer fc6, a fourth fully-connected layer fc7, a second feature fusion layer Eltwise2, a fifth fully-connected layer fc8 and a classification layer Softmax in sequence, and feature extraction details of each layer are as follows:
(3a1) inputting the image with the size of 227 × 227 pixels into a first convolution layer Conv1, performing convolution operation on the image with the convolution kernel size of 11 × 11 pixels and the step size of 4 pixels, and obtaining 96 characteristic graphs with the size of 55 × 55 pixels by using 96 convolution kernels in total;
(3a2) inputting 96 characteristic graphs output by the first convolution layer Conv1 into a first pooling layer Pool1, and performing maximum pooling operation on the characteristic graphs, wherein the size of a pooling block is 3 x 3 pixels, the step size is 2 pixels, and 96 characteristic graphs with the size of 27 x 27 pixels are obtained;
(3a3) inputting 96 characteristic maps output by the first pooling layer Pool1 into a second convolution layer Conv2, performing convolution operation with convolution kernel size of 5 pixels by 5 pixels and step size of 1, and obtaining 256 characteristic maps with the size of 27 pixels by using 256 convolution kernels in total;
(3a4) inputting 256 characteristic maps output by the second convolutional layer Conv2 into a second pooling layer Pool2, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 13 × 13 pixels are obtained;
(3a5) inputting 256 characteristic maps output by the second pooling layer Pool2 into a third convolution layer Conv3, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic maps with a size of 13 × 13 pixels;
(3a6) inputting 384 characteristic graphs output by the third convolutional layer Conv3 into the fourth convolutional layer Con4, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic graphs with a size of 13 × 13 pixels;
(3a7) inputting the feature maps of the third convolution layer Conv3 and the fourth convolution layer Conv4 into the first Eltwise1 layer for feature map fusion to obtain 384 feature maps with 13 × 13 pixels;
(3a8) inputting 384 characteristic graphs output by the fourth convolutional layer Conv4 into the fifth convolutional layer Conv5, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely using 256 convolution kernels to obtain 256 characteristic graphs with a size of 13 × 13 pixels;
(3a9) inputting 256 characteristic maps output by the fifth convolutional layer Conv5 into a fifth pooling layer Pool5, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 6 × 6 pixels are obtained;
(3a10) inputting 384 characteristic maps output by the first Eltwise1 layer into the first fully-connected layer fc _ a1, and mapping the characteristic maps into characteristic vectors with the size of 1 x 4096 pixels;
(3a11) inputting 256 feature maps output by the fifth pooling layer Pool5 to the second fully-connected layer fc _ a2, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a12) inputting 256 feature maps output by the fifth pooling layer Pool5 to the third fully-connected layer fc6, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a13) inputting the feature vector with the size of 1 x 4096 pixels output by the third fully-connected layer fc6 into the fourth fully-connected layer fc7 for continuous feature extraction to obtain the feature vector with the size of 1 x 4096 pixels;
(3a14) inputting the feature vectors of the first full connection layer fc _ a1, the second full connection layer fc _ a2 and the fourth full connection layer fc7 into the second Eltwise2 layer, and fusing the feature vectors to obtain a feature vector with the size of 1 x 4096 pixels;
(3a15) inputting the feature map of 1 x 4096 pixels size output by the second Eltwise2 layer to the fifth fully-connected layer fc8, and mapping the feature vector to a feature vector of 1 x 11 x M pixels size, wherein M is the number of multi-view images, and the symbol "x" represents multiplication;
(3a16) inputting the feature vector with the size of 1 × 11 × M) element into the classification layer Softmax to obtain an image xiIs selected such that the view angle label v that maximizes the class probabilityiAs its pose tag;
and 4, training the convolutional neural network CNN.
(3b1) In the forward propagation stage, a training sample is taken from the training set, and the multi-view image of the training sample is taken
Figure BDA0001605901570000071
Inputting the data to an input layer of a Convolutional Neural Network (CNN), and outputting a final result by a Softmax layer after feature extraction and feature mapping;
(3b2) in a back propagation stage, calculating the difference between the actual output of the CNN and the ideal output of the training sample, and adjusting the weight parameter R of the CNN by back propagation according to a method of minimizing errors;
(3b3) and repeating the operations of (3b1) and (3b2) until the loss function J (theta) of the convolutional neural network CNN is less than or equal to 0.0001, and obtaining the trained neural network.
And 5, testing the network.
The ModelNet10 test is concentrated into the multi-view image of each CAD model
Figure BDA0001605901570000072
Inputting the input into a trained neural network, and outputting a class label and a pose label predicted by the neural network;
and respectively counting the percentage of the number of the CAD models with wrong category labels and pose labels in the test set to the number of all the CAD models in the test set to obtain the object classification and posture estimation accuracy.
The effect of the present invention is further described below with reference to simulation:
1. simulation conditions
The computer operating system used in the simulation experiment is an Ubuntu system with 64, the CPU is Intel Core i34.2GHz, the memory is 16.00GB, the GPU is GeForce GTX 1070, and the used deep learning frame is Caffe 2.
2. Contents and results of the experiments
In the experiment, the ModelNet10 data set is used for training and testing the network. 4898 CAD models of 10 categories are contained in a ModelNet10 data set, wherein the number of the CAD models in a training set is 3429, the number of the CAD models in a testing set is 1469, and a multi-view image of each CAD model in the data set is generated;
and inputting the multi-view images of the samples in the test set into a trained convolutional network, wherein the number of CAD models with wrong category labels predicted by the neural network is 77, and the number of CAD models with wrong pose labels predicted by the neural network is 609. The classification and attitude estimation accuracy of the network is obtained by statistics and compared with several existing detection methods, as shown in the following table:
TABLE 1
Method Accuracy of classification Attitude estimation accuracy
The invention 94.76 58.52
Rotationnet 94.38 58.33
MVCNN 92.10 -
FusionNet 90.80 -
Wherein, rotationNet is a rotation iteration algorithm,
MVCNN is a multi-view merging algorithm,
fusion net is a feature fusion algorithm, which is several advanced object identification and pose estimation methods in the prior art.
As can be seen from table 1, the method for fusing feature maps of different depth layers of a network, which is provided by the present invention, can improve the accuracy of classification and attitude estimation.

Claims (5)

1. A method for object classification and pose estimation based on a neural network comprises the following steps:
(1) obtaining a training set and a testing set, and setting images corresponding to the CAD model:
3429 CAD models are taken out from a ModelNet10 data set to be used as a training set, and 1469 CAD models are taken out to be used as a test set;
for the CAD model of each sample in the model net10 dataset, two strategies were performed in sequence: the first method is that 12 predefined viewpoints are uniformly arranged on a view angle circle on which a CAD model is located, and an image corresponding to the CAD model is collected at each of the 12 predefined viewpoints; the second method is that the CAD model is placed in the center of the regular dodecahedron, 20 vertexes of the regular dodecahedron are set as predefined viewpoints, and images corresponding to the CAD model are collected at each of the 20 predefined viewpoints;
(2) constructing a mathematical model of joint detection according to a multi-view image obtained by preprocessing each CAD model in the data set:
(2a) will each beThe view label of the CAD model is used as a hidden variable and is marked as { vi};
(2b) M images of CAD model with different visual angles
Figure FDA0003154178030000011
And a class label y of the CAD model belongs to { 1.,. N }, and is defined as a training sample, wherein N is the total class number of the CAD model, and each view angle image x isiRespectively corresponding to a view label vi∈{1,..,M};
(2c) According to the definition of the training samples, the object recognition and pose estimation tasks are abstracted into the following optimization problems:
Figure FDA0003154178030000012
wherein R is a neural network weight parameter,
Figure FDA0003154178030000013
a class label that is predicted for the neural network,
Figure FDA0003154178030000014
is the probability that the class label output by the Softmax layer in the convolutional neural network CNN is y;
(3) constructing and training a convolutional neural network CNN:
(3a) on the basis of the existing AlexNet network, an Eltwise1 layer, an fc _ a1 layer, an fc _ a2 layer and an Eltwise2 layer are added to obtain a neural network CNN containing 16 layers of convolution, wherein:
the Eltwise1 layer is used for fusing the corresponding positions of the feature maps of the Conv3 layer and the Conv4 layer in the AlexNet network;
the fc _ a1 layer is used for mapping the Eltwise1 layer feature map into feature vectors;
the fc _ a2 layer maps Pool5 layer features in the AlexNet network into feature vectors;
the Eltwise2 layer is used for fusing the corresponding positions of the feature maps of the fc _ a1 layer, the fc _ a2 layer and the Eltwise1 layer;
(3b) multi-view images of each CAD model in the training set
Figure FDA0003154178030000021
Inputting the data into a convolutional network, iterating forward calculation and backward propagation of the convolutional neural network CNN to train the neural network, and optimizing a neural network parameter R until a loss function J of the neural network is less than or equal to 0.0001 to obtain the trained neural network CNN;
(4) test network
The ModelNet10 test is concentrated into the multi-view image of each CAD model
Figure FDA0003154178030000022
Inputting the data into a trained neural network, and counting the precision of object classification and attitude estimation.
2. The method of claim 1, wherein the first preprocessing strategy in step (1) is to uniformly set 12 predefined viewpoints on the view circle of the CAD model by fixing an axis as a rotation axis and setting a viewpoint every 30 degrees on the view circle of the object, i.e. obtaining images of 12 different views corresponding to each CAD model on the view circle of 360 degrees.
3. The method of claim 1, wherein the problem is optimized in step (2c) by:
will be provided with
Figure FDA0003154178030000023
Is marked as
Figure FDA0003154178030000024
The optimization problem is expressed in the form:
Figure FDA0003154178030000025
wherein (i) represents the input image xiAnd k denotes an image xiJ represents the image xiIs observed from the jth predefined viewpoint, R is a neural network weight parameter.
4. The method according to claim 1, wherein the convolutional neural network CNN comprising 16 layers is constructed in step (3a) by the following steps:
(3a1) inputting the image with the size of 227 × 227 pixels into a first convolution layer Conv1, performing convolution operation on the image with the convolution kernel size of 11 × 11 pixels and the step size of 4 pixels, and obtaining 96 characteristic graphs with the size of 55 × 55 pixels by using 96 convolution kernels in total;
(3a2) inputting 96 characteristic graphs output by the first convolution layer Conv1 into a first pooling layer Pool1, and performing maximum pooling operation on the characteristic graphs, wherein the size of a pooling block is 3 x 3 pixels, the step size is 2 pixels, and 96 characteristic graphs with the size of 27 x 27 pixels are obtained;
(3a3) inputting 96 characteristic maps output by the first pooling layer Pool1 into a second convolution layer Conv2, performing convolution operation with convolution kernel size of 5 pixels by 5 pixels and step size of 1, and obtaining 256 characteristic maps with the size of 27 pixels by using 256 convolution kernels in total;
(3a4) inputting 256 characteristic maps output by the second convolutional layer Conv2 into a second pooling layer Pool2, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 13 × 13 pixels are obtained;
(3a5) inputting 256 characteristic maps output by the second pooling layer Pool2 into a third convolution layer Conv3, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic maps with a size of 13 × 13 pixels;
(3a6) inputting 384 characteristic graphs output by the third convolutional layer Conv3 into the fourth convolutional layer Con4, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely sharing 384 convolution kernels to obtain 384 characteristic graphs with a size of 13 × 13 pixels;
(3a7) inputting the feature maps of the third convolution layer Conv3 and the fourth convolution layer Conv4 into the first Eltwise1 layer for feature map fusion to obtain 384 feature maps with 13 × 13 pixels;
(3a8) inputting 384 characteristic graphs output by the fourth convolutional layer Conv4 into the fifth convolutional layer Conv5, and performing convolution operation of a convolution kernel size of 3 × 3 pixels and a step size of 1 pixel, namely using 256 convolution kernels to obtain 256 characteristic graphs with a size of 13 × 13 pixels;
(3a9) inputting 256 characteristic maps output by the fifth convolutional layer Conv5 into a fifth pooling layer Pool5, and performing maximum pooling operation on the characteristic maps, wherein the size of a pooling block is 3 × 3 pixels, the step size is 2 pixels, and 256 characteristic maps with the size of 6 × 6 pixels are obtained;
(3a10) inputting 384 characteristic maps output by the first Eltwise1 layer into the first fully-connected layer fc _ a1, and mapping the characteristic maps into characteristic vectors with the size of 1 x 4096 pixels;
(3a11) inputting 256 feature maps output by the fifth pooling layer Pool5 to the second fully-connected layer fc _ a2, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a12) inputting 256 feature maps output by the fifth pooling layer Pool5 to the third fully-connected layer fc6, and mapping the feature maps into feature vectors with the size of 1 × 4096 pixels;
(3a13) inputting the feature vector with the size of 1 x 4096 pixels output by the third fully-connected layer fc6 into the fourth fully-connected layer fc7 for continuous feature extraction to obtain the feature vector with the size of 1 x 4096 pixels;
(3a14) inputting the feature vectors of the first full connection layer fc _ a1, the second full connection layer fc _ a2 and the fourth full connection layer fc7 into the second Eltwise2 layer, and fusing the feature vectors to obtain a feature vector with the size of 1 x 4096 pixels;
(3a15) inputting the feature map of 1 x 4096 pixels size output by the second Eltwise2 layer to the fifth fully-connected layer fc8, and mapping the feature vector to a feature vector of 1 x 11 x M pixels size, wherein M is the number of multi-view images, and the symbol "x" represents multiplication;
(3a16) inputting the feature vector with the pixel size of 1X 11X M) into the classification layer Softmax to obtain an image xiIs selected such that the view angle label v that maximizes the class probabilityiAs its pose tag.
5. The method of claim 1, wherein the Convolutional Neural Network (CNN) is trained in step (3b) as follows:
(3b1) in the forward propagation stage, a training sample is taken from the training set, and the multi-view image of the training sample is taken
Figure FDA0003154178030000041
Inputting the data to an input layer of a Convolutional Neural Network (CNN), and outputting a final result by a Softmax layer after feature extraction and feature mapping;
(3b2) in a back propagation stage, calculating the difference between the actual output of the CNN and the ideal output of the training sample, and adjusting the weight parameter R of the CNN by back propagation according to a method of minimizing errors;
(3b3) repeating the operations of (3b1) and (3b2) until the loss function J of the convolutional neural network CNN is less than or equal to 0.0001.
CN201810243399.4A 2018-03-23 2018-03-23 Object classification and pose estimation method based on neural network Active CN108491880B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810243399.4A CN108491880B (en) 2018-03-23 2018-03-23 Object classification and pose estimation method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810243399.4A CN108491880B (en) 2018-03-23 2018-03-23 Object classification and pose estimation method based on neural network

Publications (2)

Publication Number Publication Date
CN108491880A CN108491880A (en) 2018-09-04
CN108491880B true CN108491880B (en) 2021-09-03

Family

ID=63319473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810243399.4A Active CN108491880B (en) 2018-03-23 2018-03-23 Object classification and pose estimation method based on neural network

Country Status (1)

Country Link
CN (1) CN108491880B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902675B (en) * 2018-09-17 2021-05-04 华为技术有限公司 Object pose acquisition method and scene reconstruction method and device
CN109493417B (en) * 2018-10-31 2023-04-07 深圳大学 Three-dimensional object reconstruction method, device, equipment and storage medium
CN111191492B (en) * 2018-11-15 2024-07-02 北京三星通信技术研究有限公司 Information estimation, model retrieval and model alignment methods and devices
CN109598339A (en) * 2018-12-07 2019-04-09 电子科技大学 A kind of vehicle attitude detection method based on grid convolutional network
CN109903332A (en) * 2019-01-08 2019-06-18 杭州电子科技大学 A kind of object's pose estimation method based on deep learning
CN109934864B (en) * 2019-03-14 2023-01-20 东北大学 Residual error network deep learning method for mechanical arm grabbing pose estimation
CN109978907A (en) * 2019-03-22 2019-07-05 南京邮电大学 A kind of sitting posture of student detection method towards household scene
CN111860039B (en) * 2019-04-26 2022-08-02 四川大学 Cross-connection CNN + SVR-based street space quality quantification method
CN110322510B (en) * 2019-06-27 2021-08-27 电子科技大学 6D pose estimation method using contour information
CN112396077B (en) * 2019-08-15 2024-08-02 瑞昱半导体股份有限公司 Full-connection convolutional neural network image processing method and circuit system
CN110728187B (en) * 2019-09-09 2022-03-04 武汉大学 Remote sensing image scene classification method based on fault tolerance deep learning
CN110728192B (en) * 2019-09-16 2022-08-19 河海大学 High-resolution remote sensing image classification method based on novel characteristic pyramid depth network
CN110728222B (en) * 2019-09-30 2022-03-25 清华大学深圳国际研究生院 Pose estimation method for target object in mechanical arm grabbing system
CN111126441B (en) * 2019-11-25 2023-04-07 西安工程大学 Construction method of classification detection network model
CN111259735B (en) * 2020-01-08 2023-04-07 西安电子科技大学 Single-person attitude estimation method based on multi-stage prediction feature enhanced convolutional neural network
CN111325166B (en) * 2020-02-26 2023-07-07 南京工业大学 Sitting posture identification method based on projection reconstruction and MIMO neural network
EP3885970A1 (en) * 2020-03-23 2021-09-29 Toyota Jidosha Kabushiki Kaisha System for processing an image having a neural network with at least one static feature map
CN111738220B (en) * 2020-07-27 2023-09-15 腾讯科技(深圳)有限公司 Three-dimensional human body posture estimation method, device, equipment and medium
CN112163477B (en) * 2020-09-16 2023-09-22 厦门市特种设备检验检测院 Escalator pedestrian pose target detection method and system based on Faster R-CNN
CN112381879B (en) * 2020-11-16 2024-09-06 跨维(深圳)智能数字科技有限公司 Object posture estimation method, system and medium based on image and three-dimensional model
CN112528941B (en) * 2020-12-23 2021-11-19 芜湖神图驭器智能科技有限公司 Automatic parameter setting system based on neural network
CN112634367A (en) * 2020-12-25 2021-04-09 天津大学 Anti-occlusion object pose estimation method based on deep neural network
CN112857215B (en) * 2021-01-08 2022-02-08 河北工业大学 Monocular 6D pose estimation method based on regular icosahedron
CN113129370B (en) * 2021-03-04 2022-08-19 同济大学 Semi-supervised object pose estimation method combining generated data and label-free data
CN113705480B (en) * 2021-08-31 2024-08-02 新东方教育科技集团有限公司 Gesture recognition method, device and medium based on gesture recognition neural network
CN114742212A (en) * 2022-06-13 2022-07-12 南昌大学 Electronic digital information resampling rate estimation method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375831B (en) * 2010-08-13 2014-09-10 富士通株式会社 Three-dimensional model search device and method thereof and model base generation device and method thereof
US20160327653A1 (en) * 2014-02-03 2016-11-10 Board Of Regents, The University Of Texas System System and method for fusion of camera and global navigation satellite system (gnss) carrier-phase measurements for globally-referenced mobile device pose determination
WO2017015390A1 (en) * 2015-07-20 2017-01-26 University Of Maryland, College Park Deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition
CN106372648B (en) * 2016-10-20 2020-03-13 中国海洋大学 Plankton image classification method based on multi-feature fusion convolutional neural network
CN106845510B (en) * 2016-11-07 2020-04-07 中国传媒大学 Chinese traditional visual culture symbol recognition method based on depth level feature fusion
CN106845515B (en) * 2016-12-06 2020-07-28 上海交通大学 Robot target identification and pose reconstruction method based on virtual sample deep learning
CN107169421B (en) * 2017-04-20 2020-04-28 华南理工大学 Automobile driving scene target detection method based on deep convolutional neural network
CN107330463B (en) * 2017-06-29 2020-12-08 南京信息工程大学 Vehicle type identification method based on CNN multi-feature union and multi-kernel sparse representation
CN107527068B (en) * 2017-08-07 2020-12-25 南京信息工程大学 Vehicle type identification method based on CNN and domain adaptive learning
CN107657249A (en) * 2017-10-26 2018-02-02 珠海习悦信息技术有限公司 Method, apparatus, storage medium and the processor that Analysis On Multi-scale Features pedestrian identifies again
CN107808146B (en) * 2017-11-17 2020-05-05 北京师范大学 Multi-mode emotion recognition and classification method

Also Published As

Publication number Publication date
CN108491880A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108491880B (en) Object classification and pose estimation method based on neural network
CN110837778B (en) Traffic police command gesture recognition method based on skeleton joint point sequence
Cheng et al. Jointly network: a network based on CNN and RBM for gesture recognition
CN109816725B (en) Monocular camera object pose estimation method and device based on deep learning
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
CN108062569B (en) Unmanned vehicle driving decision method based on infrared and radar
CN106951923B (en) Robot three-dimensional shape recognition method based on multi-view information fusion
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
CN110032925B (en) Gesture image segmentation and recognition method based on improved capsule network and algorithm
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
CN107705322A (en) Motion estimate tracking and system
CN113436227A (en) Twin network target tracking method based on inverted residual error
CN110674741A (en) Machine vision gesture recognition method based on dual-channel feature fusion
Naseer et al. CNN-based Object Detection via Segmentation capabilities in Outdoor Natural Scenes
CN110827304A (en) Traditional Chinese medicine tongue image positioning method and system based on deep convolutional network and level set method
CN114821014A (en) Multi-mode and counterstudy-based multi-task target detection and identification method and device
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
CN113870160B (en) Point cloud data processing method based on transformer neural network
CN110135277B (en) Human behavior recognition method based on convolutional neural network
CN109508686A (en) A kind of Human bodys' response method based on the study of stratification proper subspace
Wu et al. A cascaded CNN-based method for monocular vision robotic grasping
CN114187506B (en) Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network
Lin et al. Robot grasping based on object shape approximation and LightGBM
CN111428555A (en) Joint-divided hand posture estimation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant