[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1609/aaai.v37i1.25120guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Weakly supervised 3D multi-person pose estimation for large-scale scenes based on monocular camera and single LiDAR

Published: 07 February 2023 Publication History

Abstract

Depth estimation is usually ill-posed and ambiguous for monocular camera-based 3D multi-person pose estimation. Since LiDAR can capture accurate depth information in longrange scenes, it can benefit both the global localization of individuals and the 3D pose estimation by providing rich geometry features. Motivated by this, we propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes, which is easy to deploy and insensitive to light. Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud, and make full use of temporal information to guide the network to learn natural and coherent human motions. Without relying on any 3D pose annotations, our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision. Extensive experiments on public datasets and our newly collected dataset demonstrate the superiority and generalization capability of our proposed method.

References

[1]
Arnab, A.; Doersch, C.; and Zisserman, A. 2019. Exploiting temporal context for 3D human pose estimation in the wild. In CVPR, 3395-3404.
[2]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; and Tai, C.-L. 2022. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1090-1099.
[3]
Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 7291-7299.
[4]
Chen, C.-H.; Tyagi, A.; Agrawal, A.; Drover, D.; Mv, R.; Stojanov, S.; and Rehg, J. M. 2019. Unsupervised 3d pose estimation with geometric self-supervision. In CVPR, 57145724.
[5]
Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multiview 3D Object Detection Network for Autonomous Driving. 6526-6534.
[6]
Cheng, Y.; Yang, B.; Wang, B.; and Tan, R. T. 2020. 3d human pose estimation using spatio-temporal networks with explicit occlusion training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10631-10638.
[7]
Cong, P.; Zhu, X.; and Ma, Y. 2021. Input-output balanced framework for long-tailed lidar semantic segmentation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), 1-6. IEEE.
[8]
Cong, P.; Zhu, X.; Qiao, F.; Ren, Y.; Peng, X.; Hou, Y.; Xu, L.; Yang, R.; Manocha, D.; and Ma, Y. 2022. STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes. In CVPR, 19608-19617.
[9]
Dong, J.; Jiang, W.; Huang, Q.; Bao, H.; and Zhou, X. 2019. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7792-7801.
[10]
Fan, H.; Su, H.; and Guibas, L. J. 2017. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In CVPR.
[11]
Fürst, M.; Gupta, S. T.; Schuster, R.; Wasenmüller, O.; and Stricker, D. 2021. HPERL: 3d human pose estimation from RGB and lidar. In 2020 25th International Conference on Pattern Recognition (ICPR), 7321-7327. IEEE.
[12]
Han, X.; Cong, P.; Xu, L.; Wang, J.; Yu, J.; and Ma, Y. 2022. LiCamGait: Gait Recognition in the Wild by Using LiDAR and Camera Multi-modal Visual Sensors. arXiv preprint arXiv:2211.12371.
[13]
Ionescu, C.; Papava, D.; Olaru, V.; and Sminchisescu, C. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7): 1325-1339.
[14]
Kocabas, M.; Karagoz, S.; and Akbas, E. 2019. Self-supervised learning of 3d human pose using multi-view geometry. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1077-1086.
[15]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; and Waslander, S. L. 2018. Joint 3d proposal generation and object detection from view aggregation. In IROS, 1-8. IEEE.
[16]
Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; and Lu, C. 2021. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, 3383-3393.
[17]
Li, J.; Zhang, J.; Wang, Z.; Shen, S.; Wen, C.; Ma, Y.; Xu, L.; Yu, J.; and Wang, C. 2022a. LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds. ArXiv, abs/2203.14698.
[18]
Li, J.; Zhang, J.; Wang, Z.; Shen, S.; Wen, C.; Ma, Y.; Xu, L.; Yu, J.; and Wang, C. 2022b. LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20502-20512.
[19]
Li, Y.; Yu, A. W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q. V.; et al. 2022c. Deep-fusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17182-17191.
[20]
Liang, H.; He, Y.; Zhao, C.; Li, M.; Wang, J.; Yu, J.; and Xu, L. 2022. HybridCap: Inertia-aid Monocular Capture of Challenging Human Motions. arXiv preprint arXiv:2203.09287.
[21]
Liang, M.; Yang, B.; Wang, S.; and Urtasun, R. 2018. Deep continuous fusion for multi-sensor 3d object detection. In ECCV, 641-656.
[22]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; and Han, S. 2022. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation. arXiv preprint arXiv:2205.13542.
[23]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; and Black, M. J. 2015. SMPL: A skinned multi-person linear model. TOG, 34(6): 1-16.
[24]
Mallot, H. A.; Bulthoff, H. H.; Little, J.; and Bohrer, S. 1991. Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological cybernetics, 64(3): 177-185.
[25]
Martinez, J.; Hossain, R.; Romero, J.; and Little, J. J. 2017. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2640-2649.
[26]
Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; and Theobalt, C. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), 506-516. IEEE.
[27]
Mehta, D.; Sotnychenko, O.; and etc. 2018. Single-shot multi-person 3d pose estimation from monocular rgb. In 2018 International Conference on 3D Vision (3DV), 120130. IEEE.
[28]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.-P.; Rhodin, H.; Pons-Moll, G.; and Theobalt, C. 2020. XNect: Real-time multi-person 3D motion capture with a single RGB camera. TOG, 39(4): 82-1.
[29]
Mehta, D.; Sridhar, S.; and etc. 2017. Vnect: Real-time 3d human pose estimation with a single rgb camera. TOG, 36(4): 1-14.
[30]
Moon, G.; Chang, J. Y.; and Lee, K. M. 2019. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In ICCV, 10133-10142.
[31]
Pavlakos, G.; Zhou, X.; Derpanis, K. G.; and Daniilidis, K. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR, 7025-7034.
[32]
Piergiovanni, A.; Casser, V.; Ryoo, M. S.; and Angelova, A. 2021. 4d-net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15435-15445.
[33]
Prakash, A.; Chitta, K.; and Geiger, A. 2021. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7077-7087.
[34]
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 652-660.
[35]
Rhodin, H.; Salzmann, M.; and Fua, P. 2018. Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European conference on computer vision (ECCV), 750-767.
[36]
Rogez, G.; Weinzaepfel, P.; and Schmid, C. 2019. Lcr-net++: Multi-person 2d and 3d pose detection in natural images. TPAMI, 42(5): 1146-1161.
[37]
Sun, Y.; Bao, Q.; Liu, W.; Fu, Y.; Black, M. J.; and Mei, T. 2021. Monocular, one-stage, regression of multiple 3d people. In ICCV, 11179-11188.
[38]
Sun, Y.; Liu, W.; Bao, Q.; Fu, Y.; Mei, T.; and Black, M. J. 2022. Putting people in their place: Monocular regression of 3d people in depth. In CVPR, 13243-13252.
[39]
Tome, D.; Russell, C.; and Agapito, L. 2017. Lifting from the deep: Convolutional 3d pose estimation from a single image. In CVPR, 2500-2509.
[40]
Veges, M.; and Lőrincz, A. 2019. Absolute human pose estimation with depth prediction network. In IJCNN, 1-7. IEEE.
[41]
Von Marcard, T.; Henschel, R.; Black, M. J.; Rosenhahn, B.; and Pons-Moll, G. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), 601-617.
[42]
Vora, S.; Lang, A. H.; Helou, B.; and Beijbom, O. 2020. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4604-4612.
[43]
Wandt, B.; Rudolph, M.; Zell, P.; Rhodin, H.; and Rosenhahn, B. 2021. Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In CVPR, 13294-13304.
[44]
Wang, C.; Li, J.; Liu, W.; Qian, C.; and Lu, C. 2020a. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV, 242-259. Springer.
[45]
Wang, C.; Ma, C.; Zhu, M.; and Yang, X. 2021. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11794-11803.
[46]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. 2020b. Deep high-resolution representation learning for visual recognition. TPAMI, 43(10): 3349-3364.
[47]
Yin, T.; Zhou, X.; and Krähenbühl, P. 2021. Center-based 3D Object Detection and Tracking. CVPR.
[48]
Ying, J.; and Zhao, X. 2021. Rgb-D Fusion For Point-Cloud-Based 3d Human Pose Estimation. In 2021 IEEE International Conference on Image Processing (ICIP), 3108-3112. IEEE.
[49]
Yu, Z.; Wang, J.; Xu, J.; Ni, B.; Zhao, C.; Wang, M.; and Zhang, W. 2021. Skeleton2Mesh: Kinematics Prior Injected Unsupervised Human Mesh Recovery. In ICCV, 8619-8629.
[50]
Zhang, J.; Cai, Y.; Yan, S.; Feng, J.; et al. 2021. Direct multiview multi-person 3d pose estimation. Advances in Neural Information Processing Systems, 34: 13153-13164.
[51]
Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; and Yuan, J. 2022a. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13232-13242.
[52]
Zhang, J.; Wang, J.; Shi, Y.; Gao, F.; Xu, L.; and Yu, J. 2022b. Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation. arXiv preprint arXiv:2207.07900.
[53]
Zhao, C.; Ren, Y.; He, Y.; Cong, P.; Liang, H.; Yu, J.; Xu, L.; and Ma, Y. 2022. LiDAR-aid Inertial Poser: Large-scale Human Motion Capture by Sparse Inertial and LiDAR Sensors. ArXiv, abs/2205.15410.
[54]
Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; and Zhou, X. 2020. Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV, 550-566. Springer.
[55]
Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; and Ding, Z. 2021. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11656-11665.
[56]
Zheng, J.; Shi, X.; Gorban, A.; Mao, J.; Song, Y.; Qi, C. R.; Liu, T.; Chari, V.; Cornman, A.; Zhou, Y.; et al. 2022. Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4478-4487.
[57]
Zhu, X.; Ma, Y.; Wang, T.; Xu, Y.; Shi, J.; and Lin, D. 2020. SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds. ECCV.
[58]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Li, W.; Ma, Y.; Li, H.; Yang, R.; and Lin, D. 2021. Cylindrical and asymmetrical 3d convolution networks for lidar-based perception. TPAMI.
[59]
Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; and Brox, T. 2018. 3d human pose estimation in rgbd images for robotic task learning. In ICRA, 1986-1992. IEEE.

Cited By

View all
  • (2024)ELMO: Enhanced Real-time LiDAR Motion Capture through UpsamplingACM Transactions on Graphics10.1145/368799143:6(1-14)Online publication date: 19-Dec-2024
  • (2024)SATPose: Improving Monocular 3D Pose Estimation with Spatial-aware Ground TactilityProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681654(6192-6201)Online publication date: 28-Oct-2024
  • (2024)HmPEAR: A Dataset for Human Pose Estimation and Action RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681055(2069-2078)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
February 2023
16496 pages
ISBN:978-1-57735-880-0

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ELMO: Enhanced Real-time LiDAR Motion Capture through UpsamplingACM Transactions on Graphics10.1145/368799143:6(1-14)Online publication date: 19-Dec-2024
  • (2024)SATPose: Improving Monocular 3D Pose Estimation with Spatial-aware Ground TactilityProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681654(6192-6201)Online publication date: 28-Oct-2024
  • (2024)HmPEAR: A Dataset for Human Pose Estimation and Action RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681055(2069-2078)Online publication date: 28-Oct-2024
  • (2023)MM-FiProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666944(18756-18768)Online publication date: 10-Dec-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media