Selective Embedding with Gated Fusion for 6D Object Pose Estimation

Shantong Sun ORCID: orcid.org/0000-0001-8621-7072¹,
Rongke Liu¹,
Qiuchen Du¹ &
…
Shuqiao Sun¹

508 Accesses
5 Citations
Explore all metrics

Abstract

Deep learning method for 6D object pose estimation based on RGB image and depth (RGB-D) has been successfully applied to robot grasping. The fusion of RGB and depth is one of the most important difficulties. Previous works on the fusion of these two features are mostly concatenated together without considering the different contributions of the two types of features to pose estimation. We propose a selective embedding with gated fusion structure called SEGate, which can adjust the weights of RGB and depth features adaptively. Furthermore, we aggregate the local features of point clouds according to the distance between them. More specifically, the close point clouds contribute a lot to local features, while the distant point clouds contribute a little. Experiments show that our approach achieves the state-of-art performance in both LineMOD and YCB-Video datasets. Meanwhile, our approach is more robust to the pose estimation of occluded objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

LHFF-Net: Local heterogeneous feature fusion network for 6DoF pose estimation

Article 13 July 2021

EFN6D: an efficient RGB-D fusion network for 6D pose estimation

Article 23 May 2022

Generalizable and Accurate 6D Object Pose Estimation Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Munaro M, Menegatti E (2014) Fast RGB-D people tracking for service robots. Auton Robot 37(3):227–242
Article Google Scholar
Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P, Lepetit V (2011) Gradient response maps for real-time detection oftextureless objects. IEEE Trans PAMI 34(5):876–888
Article Google Scholar
Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562
Besl PJ, McKay ND (1992) Method for registration of 3-D shapes. Sens fus IV: Control Paradig Data Struct 1611:586–606
Google Scholar
Drost B, Ulrich M, Navab N, Ilic S (2010) Model globally, match locally: efficient and robust 3D object recognition. In: IEEE computer society conference on computer vision and pattern recognition, pp 998–1005
Papazov C, Burschka D (2010) An efficient ransac for 3d object recognition in noisy and occluded scenes. In: Asian conference on computer vision, pp 135–148
Hinterstoisser S, Lepetit V, Rajkumar N, Konolige K (2016) Going further with point pair features. In: European conference on computer vision, pp 834–848
Kiforenko L, Drost B, Tombari F, Kruger N, Buch AG (2018) A performance evaluation of point pair features. Comput Vis Image Underst 166:66–80
Article Google Scholar
Schnabel R, Wahl R, Klein R (2007) Efficient RANSAC for point-cloud shape detection. Comput Gr forum 26(2):214–226
Article Google Scholar
Aldoma A, Marton ZC, Tombari F, Wohlkinger W, Potthast C, Zeisl B, Vincze M (2012) Tutorial: point cloud library: three-dimensional object recognition and 6 dof pose estimation. IEEE Robot Automation Mag 19(3):80–91
Article Google Scholar
Aldoma A, Tombari F, Stefano LD, Vincze M (2012) A global hypotheses verification method for 3d object recognition. In: European conference on computer vision, pp 511–524
Guo Y, Bennamoun M, Sohel F, Lu M, Wan J, Kwok NM (2016) A comprehensive performance evaluation of 3D local feature descriptors. Int J Comput Vis 116(1):66–89
Article MathSciNet Google Scholar
Doumanoglou A, Kouskouridas R, Malassiotis S, Kim TK (2016) Recovering 6D object pose and predicting next-best-view in the crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3583–3592
Tejani A, Kouskouridas R, Doumanoglou A, Tang D, Kim TK (2017) Latent-class hough forests for 6 DoF object pose estimation. IEEE Trans PAMI 40(1):119–132
Article Google Scholar
Brachmann E, Krull A, Michel F, Gumhold S, Shotton J, Rother C (2014) Learning 6d object pose estimation using 3d object coordinates. In: European conference on computer vision, pp 536–551
Brachmann E, Michel F, Krull A, Yang MY, Gumhold S (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3364–3372
Rangaprasad AS (2017) Probabilistic approaches for pose estimation, Carnegie Mellon University
Rad M, Lepetit V (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: Proceedings of the IEEE international conference on computer vision, pp 3828–3836
Kehl W, Manhardt F, Tombari F, Ilic S, Navab N (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE international conference on computer vision, pp 1521–1529
Xiang Y, Schmidt T, Narayanan V, Fox D (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. Preprint arXiv:1711.00199
Li C, Bai J, Hager GD (2018) A unified framework for multi-view multi-class object pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp 254–269
Wang C, Xu D, Zhu Y, Martín-Martín R, Lu C, Fei-Fei L (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3343-3352
Suwajanakorn S, Snavely N, Tompson JJ, Norouzi M (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In: Advances in neural information processing systems, pp 2059–2070
Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. Preprint arXiv:1809.10790
Kendall A, Grimes M, Cipolla R (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE international conference on computer vision, pp 2938–2946
Song S, Xiao J (2016) Deep sliding shapes for amodal 3d object detection in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 808–816
Li C, Lu B, Zhang Y, Liu H, Qu Y (2018) 3D reconstruction of indoor scenes via image registration. Neural Process Lett 48(3):1281–1304
Article Google Scholar
Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 918–927
Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4490–4499
Guo D, Li W, Fang X (2017) Capturing temporal structures for video captioning by spatio-temporal contexts and channel attention mechanism. Neural Process Lett 46(1):313–328
Article Google Scholar
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Park J, Woo S, Lee JY, Kweon IS (2018) Bam: bottleneck attention module. Preprint arXiv:1807.06514
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the european conference on computer vision (ECCV), pp 3–19
Wojek C, Walk S, Roth S, Schiele B (2011) Monocular 3D scene understanding with explicit occlusion reasoning. CVPR 2011:1993–2000
Google Scholar
Xu Y, Zhou X, Liu P, Xu H (2019) Rapid pedestrian detection based on deep omega-shape features with partial occlusion handing. Neural Process Lett 49(3):923–937
Article Google Scholar
Sanyal R, Ahmed SM, Jaiswal M, Chaudhury KN (2017) A scalable ADMM algorithm for rigid registration. IEEE Signal Process Lett 24(10):1453–1457
Article Google Scholar
Eitel A, Springenberg J T, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust RGB-D object recognition. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 681–687
Wang W, Neumann U (2018) Depth-aware cnn for rgb-d segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 135–150
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Bell S, Lawrence Zitnick C, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Cheng Y, Cai R, Li Z, Zhao X, Huang K (2017) Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3029–3037
Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in neural information processing systems, pp 5099–5108
Sundermeyer M, Marton ZC, Durner M, Brucker M, Triebel R (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In: Proceedings of the european conference on computer vision (ECCV), pp 699–715
Xu D, Anguelov D, Jain A (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 244–253
Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: International conference on computer vision, pp 858–865

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under grant 61231010.

Author information

Authors and Affiliations

Department of Electronic and Information Engineering, Beihang University, Beijing, 100191, China
Shantong Sun, Rongke Liu, Qiuchen Du & Shuqiao Sun

Authors

Shantong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Rongke Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qiuchen Du
View author publications
You can also search for this author in PubMed Google Scholar
Shuqiao Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongke Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The LineMOD dataset contains more than 18,000 real images in 13 video sequences, each of which contains several low-textured objects. It is one of the classic datasets, which is widely used in traditional approaches and recent deep learning methods. Figure 7 shows the qualitative results for all remaining objects on the LineMOD dataset.

Table 4 Quantitative evaluation of 6D pose (ADD[20] ) on YCB-Video dataset

Full size table

The YCB-Video dataset consists of 21 objects in 92 RGB-D video sequences. It contains a total of 133,827 frames with 3 to 9 objects per frame. The YCB-Video dataset contains many heavily occluded objects. Therefore, it is often used to verify severe occlusion problem in object pose estimation. We compare the pose estimation of objects with more severe occlusion on the YCB-Video dataset in Fig. 8. As shown in Fig. 8, PoseCNN+ICP and DenseFusion perform poorly in estimating the pose of the objects (tomato_soup_can, bowl, cracker_box, wood_block, large_clamp and pudding_box) due to severe occlusion. Our approach is more robust to the pose estimation of heavily occluded objects.

Table 4 lists the results of all 21 objects on the YCB-Video dataset under the ADD metrics. Ours(SEGate) method outperforms DenseFusion by 1.3% on the ADD(< 2 cm) metric and is basically the same on the ADD(AUC) metric. PoseCNN+ICP performs slightly better on the AUC metric, but it is extremely time consuming due to ICP. Note that the MEAN value of ADD(AUC) and the MEAN value of ADD(< 2 cm) are the area under the accuracy curve of all objects in the dataset and the accuracy of pose estimation error which is less than 2 cm of all objects in the dataset respectively, not the average of all 21 objects in Table 4.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, S., Liu, R., Du, Q. et al. Selective Embedding with Gated Fusion for 6D Object Pose Estimation. Neural Process Lett 51, 2417–2436 (2020). https://doi.org/10.1007/s11063-020-10198-8

Download citation

Published: 18 February 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11063-020-10198-8